May 2023 Newsletter - LDC - Corpora

15 May 2023


      In this newsletter:
LDC at ICASSP 2023
New publications:
2019 NIST Speaker Recognition Evaluation Test Set - CTS Challengehttps://catalog.ldc.upenn.edu/LDC2023S03
LORELEI Zulu Representative Language Packhttps://catalog.ldc.upenn.edu/LDC2023T06
________________________________
LDC at ICASSP 2023
LDC will be exhibiting at ICASSP 2023https://2023.ieeeicassp.org/, held this year June 4-10 in Rhodes, Greece. Stop by booth 15 to learn more about recent developments at the Consortium and the latest publications.
LDC will post conference updates via Twitterhttps://twitter.com/LDCupenn and Facebookhttps://www.facebook.com/ldc.upenn. We look forward to seeing you there!
________________________________
New publications:
2019 NIST Speaker Recognition Evaluation Test Set - CTS Challengehttps://catalog.ldc.upenn.edu/LDC2023S03, developed by LDC and NIST, contains 635 hours of Tunisian Arabic telephone recordings for development and test, answer keys, enrollment, trial files, and documentation from the CTS Challenge portion of the NIST-sponsored 2019 Speaker Recognition Evaluationhttps://www.nist.gov/itl/iad/mig/nist-2019-speaker-recognition-evaluation. The 2019 evaluation was conducted in two parts: (1) a leaderboard-style challenge based on conversational telephone speech from LDC's Call My Net 2 (CMN2) corpus; and (2) a separate evaluation using audio-visual material collected by LDC for the VAST (Video Annotation for Speech Technology) project (released as LDC2023V01https://catalog.ldc.upenn.edu/LDC2023V01).
The telephone speech data for the CTS Challenge was drawn from the CMN2 collection conducted by LDC in Tunisia in which Tunisian Arabic speakers called friends or relatives who agreed to record their telephone conversations lasting between 8-10 minutes. The speech segments include PSTN (public switched telephone network) and VOIP (voice over IP) data.
2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
*
LORELEI Zulu Representative Language Packhttps://catalog.ldc.upenn.edu/LDC2023T06 is comprised of over 5 million words of Zulu monolingual text, 2.7 million words of found Zulu-English parallel text, and 71,000 Zulu words translated from English data. Approximately 100,000 words were annotated for named entities and over 23,000 words were annotated for entity discovery and linking and situation frames (identifying entities, needs, and issues). Data was collected from discussion forum, news, reference, social network, and weblogs.
The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.
The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10)https://catalog.ldc.upenn.edu/LDC2020T10.
2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC accounthttps://catalog.ldc.upenn.edu/login and uncheck the box next to "Receive Newsletter" under Account Options or contact LDC for assistance.
Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: ldc@ldc.upenn.edumailto:ldc@ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104