In this newsletter: New publications: BOLT CTS CALLFRIEND CALLHOME Mandarin Chinese Audiohttps://catalog.ldc.upenn.edu/LDC2025S04 BOLT CTS CALLFRIEND CALLHOME Mandarin Chinese Transcripts and Translationshttps://catalog.ldc.upenn.edu/LDC2025T05
________________________________
New publications: BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Audiohttps://catalog.ldc.upenn.edu/LDC2025S04 was developed by LDC and consists of 93 hours of speech from 236 unscripted telephone conversations between native speakers of the Mandarin Chinese dialect spoken in mainland China. The calls were collected by LDC in the CALLFRIEND and CALLHOME series where participants called family members or close friends and spoke on topics of their choice. Around 60% of the recordings (141 calls) are publicly released for the first time. The remaining 95 recordings were previously published by LDC in various CALLFRIEND, CALLHOME, and HUB5 Mandarin datasets. The data is divided into training, development, and evaluation partitions.
The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, conversational telephone speech, text messaging, and chat -- in Chinese, Egyptian Arabic, and English. The material in this release represents the unannotated Chinese source conversational telephone speech. The telephone data was transcribed, translated, and annotated for various tasks in the BOLT program including word alignment, treebanking, and co-reference.
2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
*
BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Transcripts and Translationshttps://catalog.ldc.upenn.edu/LDC2025T05 contains transcripts and corresponding English translations for the conversational telephone speech in BOLT CTS CALLFRIEND CALLHOME Mandarin Chinese Audiohttps://catalog.ldc.upenn.edu/LDC2025S04 and was developed by LDC to support the DARPA BOLT program.
Transcribers were required to produce a verbatim transcript of all speech within a file using simplified Chinese orthography and to add minimal markup to capture salient features of the speech. Some transcripts include redactions for potential personally identifying information. All speech data was transcribed and is divided into training, development, and evaluation partitions.
The goal of the BOLT translation task was to translate the Chinese transcripts into fluent English while preserving the meaning present in the original Chinese text. Transcripts in the development and evaluation partitions received first pass and gold standard translations. 89% of the transcripts were translated into English.
2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC accounthttps://catalog.ldc.upenn.edu/login and uncheck the box next to "Receive Newsletter" under Account Options or contact LDC for assistance.
Membership Coordinator Linguistic Data Consortium<ldc.upenn.edu> University of Pennsylvania T: +1-215-573-1275 E: ldc@ldc.upenn.edumailto:ldc@ldc.upenn.edu M: 3600 Market St. Suite 810 Philadelphia, PA 19104