[Corpora-List] April 2026 Newsletter - LDC

15 Apr 2026


      In this newsletter:
New publications:
DEFT Chinese and English Light and Rich ERE Parallel Annotationhttps://catalog.ldc.upenn.edu/LDC2026T04
MATERIAL Tagalog-English Language Packhttps://catalog.ldc.upenn.edu/LDC2026S05
LORELEI Somali Representative Language Packhttps://catalog.ldc.upenn.edu/LDC2026T03
________________________________
New publications:
DEFT Chinese and English Light and Rich ERE Parallel Annotationhttps://catalog.ldc.upenn.edu/LDC2026T04 was developed by LDC and consists of 179 Chinese discussion forum documents and their English translations annotated for entities, relations, and events (ERE). Light ERE annotation labels entity mentions for the target set of entity, relation, and event types between and among those entities including coreference. Rich ERE annotation expands types and tagging in the entities, relations, and events annotation tasks and replaces strict event coreference with a more loosely defined event hopper annotation. 179 Chinese-English document pairs were annotated following Light ERE annotation guidelines; a subset of 171 Chinese-English document pairs were also labeled with Rich ERE annotation. The source data and English translations were drawn from BOLT Chinese Discussion Forum Parallel Training Data (LDC2017T05)https://catalog.ldc.upenn.edu/LDC2017T05, originally collected and translated by LDC under the DARPA BOLT program.
DARPA's Deep Exploration and Filtering of Text (DEFT) program aimed to address remaining capability gaps in state-of-the-art natural language processing technologies related to inference, causal relationships, and anomaly detection. LDC supported the DEFT program by collecting, creating, and annotating a variety of data sources.
2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
*
MATERIAL Tagalog-English Language Packhttps://catalog.ldc.upenn.edu/LDC2026S05 was developed by Appenhttp://www.appen.com/ for the IARPA MATERIALhttps://www.iarpa.gov/index.php/research-programs/material program and contains 100 hours of Tagalog conversational telephone speech, transcripts, English translations, annotations, and queries. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments. Transcripts cover approximately 30% of the speech files, 2% of which were translated into English. This release also includes domain annotations, English queries, and their relevance annotations.
The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.
2026 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.
*
LORELEI Somali Representative Language Packhttps://catalog.ldc.upenn.edu/LDC2026T03 contains over 13 million words of Somali monolingual text, 800,00 words of which were translated into English, and 106,000 Somali words translated from English data. Approximately 73,000 words were annotated for simple named entities, around 23,000 words were annotated for full entity (including nominals and pronouns), and over 10,000 words were covered by noun phrase chunking annotation. Data was collected from discussion forum, news, reference, social network, and weblogs.
The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.
The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10)https://catalog.ldc.upenn.edu/LDC2020T10.
2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC accounthttps://catalog.ldc.upenn.edu/login and uncheck the box next to "Receive Newsletter" under Account Options or contact LDC for assistance.
Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: ldc@ldc.upenn.edumailto:ldc@ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

2026

2025

2024

2023

2022

[Corpora-List] April 2026 Newsletter - LDC