June 2025 Newsletter - LDC - Corpora

16 Jun 2025


      In this newsletter:
LDC data and commercial technology development
New publications:
Chinese Sentence Pattern Structure Treebankhttps://catalog.ldc.upenn.edu/LDC2025T06
IWSLT 2022-2023 Shared Task Training, Development and Test Sethttps://catalog.ldc.upenn.edu/LDC2025S05
KAIROS Schema Learning Complex Event Annotationhttps://catalog.ldc.upenn.edu/LDC2025T07
________________________________
LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensinghttps://www.ldc.upenn.edu/data-management/using/licensing page for further information.
________________________________
New publications:
Chinese Sentence Pattern Structure Treebankhttps://catalog.ldc.upenn.edu/LDC2025T06 was developed at Beijing Normal Universityhttps://english.bnu.edu.cn/ and Peking Universityhttps://english.pku.edu.cn/. It contains 5,016 sentences and 119,627 tokens syntactically annotated following the concept of sentence constituent analysis which emphasizes sentence pattern structure. The source data consists of 27 chapters extracted from modern Mandarin and ancient Chinese works. There are three annotation layers: lexical sense and structural mode for dynamic words; syntactic structure for clauses; and inter-clause relation within complex sentence and sentence clusters. These structures can be visualized using the Jbw-viewer toolhttps://github.com/bnucip/jbwviewer which is included in the release.
2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
*
IWSLT 2022 - 2023 Shared Task Training, Development and Test Sethttps://catalog.ldc.upenn.edu/LDC2025S05 was developed by LDC and contains 210 hours of Tunisianhttps://catalog.ldc.upenn.edu/LDC2025S05 Arabic conversational telephone speech, transcripts, English translations, speaker metadata, and documentation. This material constitutes the training, development, and test data used in the International Conference on Spoken Language Translation (IWSLT) Dialectal Speech Translation task (2022)https://iwslt.org/2022/dialect and the Dialectal and Low-resource track (2023)https://iwslt.org/2023/low-resource.
The telephone speech was collected by LDC in 2016-2017 from native speakers of Tunisian Arabic in Tunis. Speakers were recruited to make telephone calls to people in their social networks from a variety of noise conditions and handsets. Transcripts are orthographic following Buckwalterhttps://catalog.ldc.upenn.edu/LDC2004L02 transliteration and cover 175 hours of the collected speech. IPA transcripts were added to a subset of the data. All transcribed segments were translated into English.
2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
*
KAIROS Schema Learning Complex Event Annotationhttps://catalog.ldc.upenn.edu/LDC2025T07 was developed by LDC to support the DARPA KAIROS program. It contains English and Spanish text, audio, video, and image data labeled for 93 real-world complex events with event, relation, and argument annotations linking to document provenance. Source data was collected from the web; 3431 root web pages were collected and processed, yielding 1919 text data files, 24019 image files, 1472 video files, and 16 audio files.
The DARPA KAIROS (Knowledge-directed Artificial Intelligence Reasoning Over Schemas) program aimed to build technology capable of understanding and reasoning about complex real-world events in order to provide actionable insights to end users. KAIROS systems utilized formal event representations in the form of schema libraries that specified the steps, preconditions, and constraints for an open set of complex events; schemas were then used in combination with event extraction to characterize and make predictions about real-world events in a large, multilingual, multimedia corpus.
2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC accounthttps://catalog.ldc.upenn.edu/login and uncheck the box next to "Receive Newsletter" under Account Options or contact LDC for assistance.
Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: ldc@ldc.upenn.edumailto:ldc@ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104