July 2023 Newsletter - LDC - Corpora

17 Jul 2023


      In this newsletter:
LanguageArc featured in Babel magazine
Fall 2023 LDC Data Scholarship Program
New publications:
Mixer 7 Spanish Speechhttps://catalog.ldc.upenn.edu/LDC2023S04
LORELEI Thai Representative Language Packhttps://catalog.ldc.upenn.edu/LDC2023T08
________________________________
LanguageArc featured in Babel magazine
The May 2023 edition of Babelhttps://cloud.3dissue.com/18743/41457/106040/BabelNo43/index.html (The Language Magazine) features an article about LDC's citizen science portal LanguageArchttps://languagearc.com/ (Language Analysis Research Community) and the diverse projects available there that utilize a variety of novel incentives to supplement traditional methods of creating data resources. Consider LanguageArc for your next collection project. Note: a subscription is necessary to view the article.
Fall 2023 LDC Data Scholarship Program
Student applications for the Fall 2023 LDC Data Scholarship program are being accepted now through September 15, 2023. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarships pagehttps://www.ldc.upenn.edu/language-resources/data/data-scholarships.
________________________________
New publications:
Mixer 7 Spanish Speechhttps://catalog.ldc.upenn.edu/LDC2023S04 was developed by LDC and contains 9,600 hours of audio recordings of interviews, transcript readings, and conversational telephone speech involving 191 distinct native Spanish speakers. This material was collected by LDC in 2011-2012 as part of the Mixer project, and the recordings were used in the 2012 NIST SRE test set.
Recruited speakers were connected through a robot operator to carry on casual conversations on a pre-set topic lasting up to 10 minutes. Participants also visited LDC's human subjects collection lab equipped with a 14-microphone array where they participated in interviews and transcript readings and conducted up to 3 telephone calls under varying conditions. Selected speaker metadata was also collected.
2023 members can access this corpus through their LDC accounts. This corpus is a members-only release and is not available for non-member licensing. Contact ldc@ldc.upenn.edumailto:ldc@ldc.upenn.edu for information about membership.
*
LORELEI Thai Representative Language Packhttps://catalog.ldc.upenn.edu/LDC2023T08 is comprised of over 39 million words of Thai monolingual text, 2.85 million words of found Thai-English parallel text, and 141,000 Thai words translated from English data. Over 186,000 words were annotated for named entities and more than 25,000 words were annotated for entity discovery and linking and situation frames (identifying entities, needs, and issues). Data was collected from discussion forum, news, reference, social network, and weblogs.
The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.
The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10)https://catalog.ldc.upenn.edu/LDC2020T10.
2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC accounthttps://catalog.ldc.upenn.edu/login and uncheck the box next to "Receive Newsletter" under Account Options or contact LDC for assistance.