August 2024 Newsletter - LDC - Corpora

15 Aug 2024


      In this newsletter:
Fall 2024 LDC Data Scholarship program
New publications:
LORELEI Uyghur Incident Language Packhttps://catalog.ldc.upenn.edu/LDC2024T07
Ravnursson Faroese Speech and Transcriptshttps://catalog.ldc.upenn.edu/LDC2024S09
________________________________
Fall 2024 LDC Data Scholarship program
Student applications for the Fall 2024 LDC Data Scholarship program are being accepted now through September 15, 2024. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarships pagehttps://www.ldc.upenn.edu/language-resources/data/data-scholarships.
________________________________
New publications:
LORELEI Uyghur Incident Language Packhttps://catalog.ldc.upenn.edu/LDC2024T07 was developed by LDC and is comprised of 28 million words of Uyghur monolingual text, 500,000 words of English monolingual text, 3.3 million words of parallel and comparable Uyghur-English text, and 200,000 words annotated for simple named entities and situation frames. It constitutes all of the text data, annotations, supplemental resources, and related software tools for the Uyghur language that were used in the DARPA LORELEI / LoReHLT 2016 Evaluationhttps://www.nist.gov/itl/iad/mig/lorehlt-evaluations.
The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. In the evaluation scenario, an unforeseen event triggered a need for humanitarian and logistical support in a region where the incident language had received little or no attention in NLP research. Evaluation participants provided NLP solutions, including information extraction and machine translation, with limited resources and limited development time.
Data was collected from news, social network, weblog, newsgroup, discussion forum, and reference material. Named entity annotation identified entities to be detected by systems for scoring purposes. Situation frame analysis was designed to extract basic information about needs and relevant issues for planning a disaster response effort.
2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
*
Ravnursson Faroese Speech and Transcriptshttps://catalog.ldc.upenn.edu/LDC2024S09 contains 109 hours of Faroese prompted speech from 433 speakers (249 female, 184 male), corresponding transcripts and speaker metadata. It is an extract from the Basic Language Resource Kit 1.0 (BLARK 1.0)https://mtd.setur.fo/en/resource/ravnur-blark-1-0/ developed by the Faroe Islands' Ravnur Projecthttps://mtd.setur.fo/en/.
Speech data was collected in 2022. Speakers from all major dialect areas in the Faroe Islands in three age groups -- 15-35, 36-60, and 61+ years -- read texts that included a word list, a phrase list, closed vocabulary readings, and short texts. Recordings also contain spontaneous speech. Orthographic transcripts are included.
2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data at no cost.
To unsubscribe from this newsletter, log in to your LDC accounthttps://catalog.ldc.upenn.edu/login and uncheck the box next to "Receive Newsletter" under Account Options or contact LDC for assistance.
Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: ldc@ldc.upenn.edumailto:ldc@ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104