New subject: January 2026 Newsletter - LDC

15 Jan 2026


      In this newsletter:
Renew your LDC membership today
New publications:
CALLHOME Japanese Second Editionhttps://catalog.ldc.upenn.edu/LDC2026S02
CALLHOME Japanese Lexicon Second Editionhttps://catalog.ldc.upenn.edu/LDC2026L01
MATERIAL Swahili-English Language Packhttps://catalog.ldc.upenn.edu/LDC2026S01
________________________________
Renew your LDC membership today
The importance of curated resources for language-related education, research, and technology development drives LDC's mission to create them, to accept data contributions from researchers across the globe, and to broadly share such resources through the LDC Catalog. LDC members enjoy no-cost access to new corpora released annually, as well as the ability to license legacy data sets from among our 1000 holdings at reduced fees. Ensure that your data needs continue to be met by renewing your LDC membership or by joining the Consortium today.
Now through March 2, 2026, any organization that joins the Consortium or renews their membership will receive a 10% discount off the 2026 membership fee. Membership remains the most economical way to access current and past LDC releases. Consult Join LDChttps://www.ldc.upenn.edu/members/join-ldc for more details on membership options and benefits.
________________________________
New publications:
CALLHOME Japanese Second Editionhttps://catalog.ldc.upenn.edu/LDC2026S02 was developed by LDC and contains 49 hours of speech from 120 telephone conversations between native Japanese speakers. This publication is a re-release of the original CALLHOME Japanese collection, combining CALLHOME Japanese Speech (LDC96S37)https://catalog.ldc.upenn.edu/LDC96S37 and CALLHOME Japanese Transcripts (LDC96T18)https://catalog.ldc.upenn.edu/LDC96T18 with additional transcription and updated directory structure, file formats, and documentation.
This corpus contains the 120 calls from CALLHOME Japanese Speech which represented training and development data and a subset of evaluation data. Participants spoke on topics of their choice in a single telephone call lasting up to 30 minutes. Calls were manually audited for language, recording quality, channel characteristics, dialect, and region. For this second edition, all audio was converted from SPHERE files to FLAC format, and the original training/development/test partitioning was removed.
This release also features revised transcripts conforming to updated LDC transcription guidelines that addressed normalization of annotation formats, standardization of speaker-produced and background noises, application of foreign-language marking, whitespace cleanup, and corrections and consistency fixes.
The CALLHOME series consists of telephone conversations and transcripts developed by LDC and Rutgers, The State University of New Jersey, in support of research in speaker identification, language identification, and related technologies. Languages in the series include American English, Egyptian Arabic, German, Japanese, Mandarin Chinese, and Spanish.
2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
*
CALLHOME Japanese Lexicon Second Editionhttps://catalog.ldc.upenn.edu/LDC2026L01 was developed by LDC and contains 80,688 Japanese words with morphological, phonological, and stress information. This second edition updates file formats, directory structure, and documentation. The first edition is available as CALLHOME Japanese Lexicon (LDC96L17)https://catalog.ldc.upenn.edu/LDC96L17. The words in the lexicon were derived from 80 transcripts representing telephone conversations between native Japanese speakers contained in CALLHOME Japanese Second Edition (LDC2026S02)https://catalog.ldc.upenn.edu/LDC2026S02.
The lexicon contains seven tab-separated information fields: (1) headword: orthographic form in kanji or katakana or hiragana (if only written in hiragana); (2) hiragana: orthographic form in hiragana; (3) romanization: orthographic form in romaji; (4) pron: pronunciation of the headword; (5) morph: morphological analysis of the headword; (6) train freq: frequency of the headword in the transcripts; and (7) gloss: glosses of the headword. This release also includes a pronunciation dictionary derived from the lexicon in CMUdicthttps://stdlib.io/docs/api/latest/@stdlib/datasets/cmudict format and the grapheme-to-phoneme (G2P) tools used to automatically generate pronunciations for the original lexicon.
2026 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.
*
MATERIAL Swahili-English Language Packhttps://catalog.ldc.upenn.edu/LDC2026S01 was developed by Appenhttp://www.appen.com/ for the IARPA MATERIALhttps://www.iarpa.gov/index.php/research-programs/material program and contains 112 hours of Swahili conversational telephone speech, transcripts, English translations, annotations, and queries. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments. Transcripts cover approximately 30% of the speech files, 3% of which were translated into English. This release also includes domain annotations, English queries, and their relevance annotations.
The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.
2026 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC accounthttps://catalog.ldc.upenn.edu/login and uncheck the box next to "Receive Newsletter" under Account Options or contact LDC for assistance.
Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: ldc@ldc.upenn.edumailto:ldc@ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

January 2025 Newsletter - LDC