May 2026 Newsletter - LDC - Corpora

15 May 2026


      In this newsletter:
New publications:
MADCAT Phases 1-3 Composite Evaluation Sethttps://catalog.ldc.upenn.edu/LDC2026T05
CALLHOME German Second Editionhttps://catalog.ldc.upenn.edu/LDC2026S06
CALLHOME German Lexicon Second Editionhttps://catalog.ldc.upenn.edu/LDC2026L04
________________________________
New publications:
MADCAT Phases 1-3 Composite Evaluation Sethttps://catalog.ldc.upenn.edu/LDC2026T05 contains the evaluation data created by LDC for Phases 1-3 of the DARPA MADCAT program and the NIST OpenHaRThttps://www.nist.gov/itl/iad/mig/openhart 2010 and 2013 evaluations. It consists of handwritten Arabic documents scanned at high resolution and annotated for the physical coordinates of each line and token, digital transcripts, and English translations with content and annotation layers integrated in a single MADCAT XML output.
This release includes 1,643 images and corresponding annotation files. Source documents were web text and newswire collected by LDC. Arabic-speaking scribes copied documents by hand, following specific instructions as to the writing style, writing implement, and paper. Each page was scanned and the images annotated.
The goal of the MADCAT program was to automatically convert foreign language text images into English transcripts for use by humans and downstream processes, including summarization and information extraction. The core evaluation task in MADCAT was the translation of handwritten Arabic documents.
2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
*
CALLHOME German Second Editionhttps://catalog.ldc.upenn.edu/LDC2026S06 was developed by LDC and contains 48 hours of speech from 100 unscripted telephone conversations between native German speakers. This publication is a re-release of the original CALLHOME German collection, combining CALLHOME German Speech (LDC97S43)https://catalog.ldc.upenn.edu/LDC97S43 and CALLHOME German Transcripts (LDC97T15)https://catalog.ldc.upenn.edu/LDC97T15, with additional transcription and updated directory structure, file formats, and documentation.
This release contains the 100 telephone conversations published in CALLHOME German Speech which represented training data (80 calls) and development data (20 calls). Participants spoke on topics of their choice in a single telephone call lasting up to 30 minutes. Calls were manually audited for language, recording quality, channel characteristics, dialect, and region. For this second edition, all audio was converted from SPHERE files to FLAC format, and the original training/development partitioning was removed.
This release also features revised transcripts conforming to updated LDC transcription guidelines that addressed normalization of annotation formats, standardization of speaker-produced and background noises, application of foreign-language marking, whitespace cleanup, and corrections and consistency fixes.
The CALLHOME series consists of telephone conversations and transcripts developed by LDC and Rutgers, The State University of New Jersey, in support of research in speaker identification, language identification, and related technologies. Languages in the series include American English, Egyptian Arabic, German, Japanese, Mandarin Chinese, and Spanish.
2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
*
CALLHOME German Lexicon Second Editionhttps://catalog.ldc.upenn.edu/LDC2026L04 was developed by LDC and contains 318,809 German words with morphological, phonological, stress, and frequency information. This second edition updates file formats, directory structure, and documentation. The first edition is available as CALLHOME German Lexicon (LDC97L18)https://catalog.ldc.upenn.edu/LDC97L18.
The words in the lexicon were derived from the CELEX German lexicon (CELEX2 (LDC96L14)https://catalog.ldc.upenn.edu/LDC96L14) and from 100 training and development transcripts representing unscripted telephone conversations between native German speakers contained in CALLHOME German Second Edition, LDC2026S04.
The lexicon has seven tab-separated information fields: (1) headword: orthographic form; (2) morph: morphological analysis of the headword; (3) pron: pronunciation of the headword; (4) stress: primary stress information of the word; (5) celex: whether the headword appears in the CELEX German lexicon; (6) train_freq: frequency of the headword in the CALLHOME training transcripts; and (7) dev_freq: frequency of the headword in the CALLHOME development transcripts. This release also includes a pronunciation dictionary derived from the lexicon in CMUdicthttps://stdlib.io/docs/api/latest/@stdlib/datasets/cmudict format.
2026 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC accounthttps://catalog.ldc.upenn.edu/login and uncheck the box next to "Receive Newsletter" under Account Options or contact LDC for assistance.
Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: ldc@ldc.upenn.edumailto:ldc@ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104