[Corpora-List] September 2023 Newsletter - LDC

15 Sep 2023


      In this newsletter:
LDC data and commercial technology development
New publications:
CALLFRIEND Russian Speechhttps://catalog.ldc.upenn.edu/LDC2023S08
CALLFRIEND Russian Texthttps://catalog.ldc.upenn.edu/LDC2023T09
________________________________
LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensinghttps://www.ldc.upenn.edu/data-management/using/licensing page for further information.
________________________________
New publications:
CALLFRIEND Russian Speechhttps://catalog.ldc.upenn.edu/LDC2023S08 was developed by LDC and consists of 48 hours of telephone conversations (100 recordings) between native speakers of Russian. The calls were recorded in 1999 as part of the CALLFRIEND collection, a project designed primarily to support research in automatic language identification. One hundred native Russian speakers living in the continental United States each made a single phone call, lasting up to 30 minutes, to a family member or friend living in the United States.
All recordings involved domestic calls routed through LDC's automated telephone collection platform and stored as 2-channel (4-wire) 8-KHz mu-law samples taken directly from a public telephone network via a T-1 circuit. Each audio file is a FLAC-compressed MS-WAV (RIFF) format audio file containing 2-channel, 8-KHz, 16-bit PCM sample data.
This release includes call metadata, including speaker gender, the number of speakers on each channel, and call duration.
Corresponding transcripts and a lexicon are available in CALLFRIEND Russian Text (LDC2023T09http://catalog.ldc.upenn.edu/LDC2023T09).
2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
*
CALLFRIEND Russian Texthttps://catalog.ldc.upenn.edu/LDC2023T09 contains the corresponding transcripts and a lexicon for CALLFRIEND Russian Speechhttps://catalog.ldc.upenn.edu/LDC2023S08, that is, 48 hours of telephone conversations (100 recordings) between native Russian speakers.
The transcripts have four main fields on each line (begin_offset, end_offset, speaker_label, transcript_text) separated by tabs. Each contains a list of time-stamped segments in order according to their begin_offset values, with no blank lines.
The lexicon covers the word forms in the 97 transcript files. The main lexicon table contains three columns per row: Cyrillic orthography, phonetic transliteration, and numeric representation of syllabic stress.
Corresponding speech data is available as CALLFRIEND Russian Speech (LDC2023S08http://catalog.ldc.upenn.edu/LDC2023S08).
2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC accounthttps://catalog.ldc.upenn.edu/login and uncheck the box next to "Receive Newsletter" under Account Options or contact LDC for assistance.
Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: ldc@ldc.upenn.edumailto:ldc@ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

2026

2025

2024

2023

2022

[Corpora-List] September 2023 Newsletter - LDC