In this newsletter: LDC data and commercial technology development
New publications: Mixer 7 English Speechhttps://catalog.ldc.upenn.edu/LDC2025S08 AIDA Scenario 1 Evaluation Topic Source Data, Annotation and Assessmenthttps://catalog.ldc.upenn.edu/LDC2025T13 LORELEI Hindi Representative Language Packhttps://catalog.ldc.upenn.edu/LDC2025T12
________________________________ LDC data and commercial technology development For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensinghttps://www.ldc.upenn.edu/data-management/using/licensing page for further information. ________________________________
New publications: Mixer 7 English Speechhttps://catalog.ldc.upenn.edu/LDC2025S08 was developed by LDC and contains 12,321 hours of audio recordings of interviews, transcript readings, and conversational telephone speech involving 222 distinct English speakers. This material was collected by LDC in 2010-2011 as part of the Mixer project, and the recordings were used in the 2012 NIST SRE test set.
Recruited speakers were connected through a robot operator to carry on casual conversations on a pre-set topic lasting up to 10 minutes. Participants also visited LDC's Human Subjects Collection Lab equipped with a 14-microphone array where they participated in interviews and transcript readings, and conducted telephone calls under varying conditions. Selected speaker metadata was also collected.
2025 members can access this corpus through their LDC accounts. This corpus is a Members-Only release and is not available for non-member licensing. Contact ldc@ldc.upenn.edumailto:ldc@ldc.upenn.edu for information about membership.
*
AIDA Scenario 1 Evaluation Topic Source Data, Annotation and Assessmenthttps://catalog.ldc.upenn.edu/LDC2025T13 was developed by LDC and is comprised of English, Russian, and Ukrainian web documents (text, video, image), annotations, and assessments used in the AIDA Phase 1 pilot and final evaluations. The Phase 1 scenario focused on political relations between Russia and Ukraine in the 2010s. The material in this corpus covers the following events: Suspicious Deaths and Murders in Ukraine (January-April 2015); Odessa Tragedy (May 2, 2014); and Siege of Sloviansk and Battle of Kramatorsk (April-July 2014).
The corpus contains 10,522 documents, annotations for 386 of those documents, and assessment results covering 77,965 responses in 1,525 of those documents. Annotations were performed in three steps: (1) within-document labels for scenario-related entities, relations, and events; (2) coreference annotation across documents by linking information elements to a knowledge base; and (3) indications of any relationship between labeled events/relations and hypotheses about the scenario. In the assessment phase, LDC annotators reviewed and judged system response files to provide evaluation organizers with a means for scoring submissions. Assessment tasks included zero-hop assessment, class-based assessment, graph assessment, and hypothesis assessment.
The DARPA AIDA (Active Interpretation of Disparate Alternatives) program aimed to develop a multi-hypothesis semantic engine to generate explicit alternative interpretations of events, situations, and trends from a variety of unstructured sources. LDC supported AIDA by collecting, creating, and annotating multimodal linguistic resources in multiple languages.
2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
*
LORELEI Hindi Representative Language Packhttps://catalog.ldc.upenn.edu/LDC2025T12 contains over 26 million words of Hindi monolingual text, 363,00 words of which were translated into English, 1.07 million words of found Hindi-English parallel text, and 118,000 Hindi words translated from English data. Approximately 103,000 words were annotated for simple named entities and over 25,000 words were annotated for full entity (including nominals and pronouns), entity linking, and situation frames (identifying entities, needs and issues). Data was collected from discussion forum, news, reference, social network, and weblogs.
The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.
The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10)https://catalog.ldc.upenn.edu/LDC2020T10.
2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC accounthttps://catalog.ldc.upenn.edu/login and uncheck the box next to "Receive Newsletter" under Account Options or contact LDC for assistance.
Membership Coordinator Linguistic Data Consortium<ldc.upenn.edu> University of Pennsylvania T: +1-215-573-1275 E: ldc@ldc.upenn.edumailto:ldc@ldc.upenn.edu M: 3600 Market St. Suite 810 Philadelphia, PA 19104