March 2025 Newsletter - LDC - Corpora

17 Mar 2025


      In this newsletter:
LDC data and commercial technology development
New publications:
2015 NIST Language Recognition Evaluation Test Sethttps://catalog.ldc.upenn.edu/LDC2025S02
The Xi'an Multi-Language Learner Corpushttps://catalog.ldc.upenn.edu/LDC2025T03
________________________________
LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensinghttps://www.ldc.upenn.edu/data-management/using/licensing page for further information.
________________________________
New publications:
2015 NIST Language Recognition Evaluation Test Sethttps://catalog.ldc.upenn.edu/LDC2025S02 was developed by LDC and NIST. It contains the evaluation test set for the 2015 NIST Language Recognition Evaluation (LRE), approximately 867 hours of conversational telephone speech (CTS) and broadcast narrowband speech (BNBS) collected by LDC in 20 languages over 6 clusters of related languages: Arabic (Egyptian, Iraqi, Levantine, Maghrebi, Modern Standard Arabic); Spanish (Caribbean, European, Latin American, Brazilian Portuguese); English (British, Indian, General American English); Chinese (Cantonese, Mandarin, Min Nan, Wu); Slavic (Polish, Russian); and French (West African, Haitian Creole).
The CTS data includes calls between individuals in the same social networks lasting 8-15 minutes and telephone speech from the IARPA Babel series collected in 2012-2013 from speakers using a range of phone types in diverse settings with varying noise conditions. The BNBS data was collected by LDC from streaming and satellite radio programming, focusing on programs that included narrowband speech (e.g., call-ins to a talk show).
The goal of NIST's LRE evaluations is to establish the baseline of current performance capability for CTS language recognition and to lay the groundwork for further research efforts. LRE15 expanded the range of test segment durations and added a test condition that allowed systems to make use of unrestricted training data when developing models
2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
*
The Xi'an Multi-Language Learner Corpushttps://catalog.ldc.upenn.edu/LDC2025T03 was developed by Xi'an International Studies University (XISU)https://en.xisu.edu.cn/ and is comprised of 526 argumentative essays in 15 languages by Chinese L1 university students studying second languages, along with student metadata and writing prompts. It was developed to support second language learner research and to provide a database for cross-linguistic comparison of second languages.
Data was collected in 2023 and 2024 from students at XISU and Yunnan Minzu University (YMU) who were linguistic majors or studying one of the foreign languages available at XISU and YMU. Off-topic essays and incomplete texts were excluded.
2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC accounthttps://catalog.ldc.upenn.edu/login and uncheck the box next to "Receive Newsletter" under Account Options or contact LDC for assistance.
Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: ldc@ldc.upenn.edumailto:ldc@ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104