July 2024 Newsletter - LDC - Corpora

15 Jul 2024


      In this newsletter:
LDC at IC2S2
Fall 2024 LDC Data Scholarship Program
New publications:
MATERIAL Bulgarian-English Language Packhttps://catalog.ldc.upenn.edu/LDC2024S07
Dialogs Re-Enacted Across Languageshttps://catalog.ldc.upenn.edu/LDC2024S08
________________________________
LDC at IC2S2
LDC is delighted to be a bronze sponsor for the 10th International Conference on Computational Social Science (IC2S2https://ic2s2-2024.org/) held this year on Penn's campus July 17-20. The conference will feature research from around the world across a broad range of relevant fields to advance the many frontiers of computational social science. Be sure to visit LDC's table during the poster sessions July 18 and 19 from 1:30-2:30 pm.
Fall 2024 LDC Data Scholarship Program
Student applications for the Fall 2024 LDC Data Scholarship program are being accepted now through September 15, 2024. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarships pagehttps://www.ldc.upenn.edu/language-resources/data/data-scholarships.
________________________________
New publications:
MATERIAL Bulgarian-English Language Packhttps://catalog.ldc.upenn.edu/LDC2024S07 was developed by Appenhttp://www.appen.com/ for the IARPA (Intelligence Advanced Research Projects Activity) MATERIALhttps://www.iarpa.gov/index.php/research-programs/material (Machine Translation for English Retrieval of Information in Any Language) program. It contains 80 hours of Bulgarian conversational telephone speech, transcripts, English translations, annotations, and queries.
Calls were made using different telephones (e.g., mobile, landline) from a variety of environments. Transcripts cover approximately 40% of the speech files, and approximately 10% of the speech files were translated into English. This release also includes domain annotations, English queries, and their relevance annotations.
The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.
2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.
*
Dialogs Re-Enacted Across Languageshttps://catalog.ldc.upenn.edu/LDC2024S08 was developed at the University of Texas at El Pasohttps://www.utep.edu/. It contains 17 hours of conversational speech in English and Spanish by 129 unique bilingual speakers, specifically, short fragments extracted from spontaneous conversations and close re-enactments in the other language by the original speakers, for 3816 pairs of matching utterances. Data was collected in 2022-2023. Participants were recruited from among students at the University of Texas at El Paso; all were bilingual speakers of General American English and of Mexico-Texas Border Spanish.
Each speaker pair had a 10 minute conversation in one language. Various fragments from these conversations were chosen for re-enactment, and the original speakers produced equivalents in the other language. Each re-enactment was vetted for fidelity to the original and naturalness in the target language. Also included is metadata about conversations, participants, re-enactments and utterances.
2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC accounthttps://catalog.ldc.upenn.edu/login and uncheck the box next to "Receive Newsletter" under Account Options or contact LDC for assistance.
Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: ldc@ldc.upenn.edumailto:ldc@ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104