July 2025 Newsletter - LDC - Corpora

15 Jul 2025


      In this newsletter:
Fall 2025 LDC data scholarship program
New publications:
AnnoDIFP Session Audio and Transcriptshttps://catalog.ldc.upenn.edu/LDC2025S06
Penn Parsed Corpora of Historical English Second Releasehttps://catalog.ldc.upenn.edu/LDC2025T09
LoReHLT Uzbek Representative Language Packhttps://catalog.ldc.upenn.edu/LDC2025T08
________________________________
Fall 2025 LDC data scholarship program
Student applications for the Fall 2025 LDC data scholarship program are being accepted now through September 15, 2025. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarships pagehttps://www.ldc.upenn.edu/language-resources/data/data-scholarships.
________________________________
New publications:
AnnoDIFP (Annotated Data for the Investigation of Facets of Personality) Session Audio and Transcriptshttps://catalog.ldc.upenn.edu/LDC2025S06 was developed by LDC, the Florida Institute of Technology https://www.fit.edu/ (FIT), and the University of New Havenhttps://www.newhaven.edu/index.php (UNH) to support algorithm development for predicting personality traits. It contains 438.34 hours of English audio and transcripts from in-person interviews of 366 participants paired with scores from two self-reported personality assessments, HEXACO Personality Inventory (Revised) (HEXACO-PI-R) and Short Dark Triad (SD3).
In-person interviews were recorded at LDC, FIT, and UNH. In each session, the participant and interviewer were in separate sound-isolated rooms with communication between them supplied by audio/video hardware. Sessions consisted of the following tasks: rapport building, a YouTube task, a map task, and a business task. Further details on collection methodology and session tasks are contained in the documentation accompanying this release.
2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
*
Penn Parsed Corpora of Historical English Second Releasehttps://catalog.ldc.upenn.edu/LDC2025T09 was developed at the University of Pennsylvania and consists of running texts and text samples of British English prose from the earliest Middle English documents (1100 CE) up to the period of the First World War (1914 CE). This second release corrects errors and inconsistencies in Penn Parsed Corpora of Historical English (LDC2020T16https://catalog.ldc.upenn.edu/LDC2020T16), further streamlines annotation, simplifies the directory structure, and includes updated documentation.
This data set contains three corpora covering traditionally recognized periods of English:
*   The Penn-Helsinki Parsed Corpus of Middle English, second edition
  *   The Penn-Helsinki Parsed Corpus of Early Modern English
  *   The Penn Parsed Corpus of Modern British English, second edition
The texts are in two forms: part-of-speech tagged text and syntactically annotated text. Annotations were manually reviewed for accuracy and consistency. Included in this release are updated annotation guidelines, philological information for each corpus, and the CorpusSearch 2 program, which allows users to search the data for words, word sequences, and syntactic structure.
2025 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.
*
LoReHLT Uzbek Representative Language Packhttps://catalog.ldc.upenn.edu/LDC2025T08 was developed by LDC and is comprised of approximately 47 million words of Uzbek monolingual text, 563,000 words of found Uzbek-English parallel text, 100,000 Uzbek words translated from English data, and 6.4 hours of Uzbek broadcast news and amateur web audio recordings. Approximately 151, 000 words were annotated for named entities and over 28,000 words were annotated for full entity including nominals and pronouns. Noun-phrase chunking was applied to more than 13,000 words. Over 20,890 words were labeled with simple semantic annotation. Topic annotation was applied to the audio recordings. Data was collected from discussion forum, news, reference, social network, broadcast news, web audio recordings, and weblogs.
LoReHLT was a companion project of the DARPA LORELEI program. The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.
2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC accounthttps://catalog.ldc.upenn.edu/login and uncheck the box next to "Receive Newsletter" under Account Options or contact LDC for assistance.
Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: ldc@ldc.upenn.edumailto:ldc@ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104