[Corpora-List] February 2025 Newsletter - LDC

17 Feb 2025


      In this newsletter:
LDC at LT4ALL 2025
LDC membership discounts expire March 3
Spring 2025 data scholarship recipients
New publications:
AIDA Scenario 3 Practice Topic Source Data and Annotationhttps://catalog.ldc.upenn.edu/LDC2025T02
MATERIAL Georgian-English Language Packhttps://catalog.ldc.upenn.edu/LDC2025S01
________________________________
LDC at LT4All 2025
LDC is pleased to be a sponsor of The 2nd International Conference on Language Technologies for All (LT4All 2025)https://www.lt4all2025.eu/overview/, February 24-26, 2025, organized by ELRA and SIGUL, the ELRA/ISCA Special Interest Group on Under-resourced Languages, and in partnership with UNESCO as part of the International Decade of Indigenous Languages (2022-2032). The conference theme, "Advancing Humanism through Language Technologies," focuses on community empowerment within the larger discussion on the many ways technology impacts language communities. The conference will also commemorate the Silver Jubilee of International Mother Language Day (February 21).
LDC membership discounts expire March 3
Time is running out to save on 2025 membership fees. Renew your LDC membership, rejoin the Consortium, or become a new member by March 3 to receive a discount of up to 10%. For more information on membership benefits and options, visit Join LDChttps://www.ldc.upenn.edu/members/join-ldc.
Spring 2025 data scholarship recipients
Congratulations to the recipients of LDC's Spring 2025 data scholarships:
Sair Buckle: Charles Sturt University (Australia): PhD student, AI and Cyber Futures Institute. Sair is awarded a copy of Avocado Research Email Corpus LDC2015T03 for her work in behavioral science.
Le Phuoc Thinh Tien, Vietnam National University Ho Chi Minh City (Vietnam); Bachelor's student, Faculty of Information Technology. Le is awarded a copy of Penn Discourse Treebank Version 3.0 LDC2019T05 for his research in natural logical reasoning.
The next round of applications will be accepted in September 2025. For information about the program, visit the Data Scholarships pagehttps://www.ldc.upenn.edu/language-resources/data/data-scholarships.
________________________________
New publications:
AIDA Scenario 3 Practice Topic Source Data and Annotationhttps://catalog.ldc.upenn.edu/LDC2025T02 was developed by LDC and is comprised of English, Russian, and Spanish web documents (text, video, image) and annotations. Each phase of the AIDA program centered on a specific scenario, or broad topic area, with related subtopics designated as either practice subtopics or evaluation subtopics. The Phase 3 scenario focused on the COVID-19 global pandemic. This corpus contains source documents and annotations for the Scenario 3 practice topics.
The corpus contains 1417 root documents; 279 documents were annotated. Annotations include:
*   Event, relation, and entity annotation (64 documents)
  *   Claim frame annotation: claims (true or not) relating to the COVID-19 pandemic (203 documents)
  *   Practice topic query claim frames: example claim frames intended to be used by systems as queries to extract similar claims from additional documents (30 documents)
The DARPA AIDA (Active Interpretation of Disparate Alternatives) program aimed to develop a multi-hypothesis semantic engine to generate explicit alternative interpretations of events, situations, and trends from a variety of unstructured sources. LDC supported AIDA by collecting, creating, and annotating multimodal linguistic resources in multiple languages.
2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
*
MATERIAL Georgian-English Language Packhttps://catalog.ldc.upenn.edu/LDC2025S01 was developed by Appenhttp://www.appen.com/ for the IARPA MATERIALhttps://www.iarpa.gov/index.php/research-programs/material program and contains 79 hours of Georgian conversational telephone speech, transcripts, English translations, annotations, and queries. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments. Transcripts cover approximately half of the speech files, and approximately 3% of the speech data was translated into English. This release also includes English queries and their relevance annotations.
The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.
2025 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC accounthttps://catalog.ldc.upenn.edu/login and uncheck the box next to "Receive Newsletter" under Account Options or contact LDC for assistance.
Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: ldc@ldc.upenn.edumailto:ldc@ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

2026

2025

2024

2023

2022

[Corpora-List] February 2025 Newsletter - LDC