August 2023 Newsletter - LDC - Corpora

15 Aug 2023


      In this newsletter:
LDC at Interspeech 2023
LDC releases speech activity detector
Fall 2023 LDC Data Scholarship Program
New publications:
2019 OpenSAT Public Safety Communications Simulationhttps://catalog.ldc.upenn.edu/LDC2023S06
Samrómur Queries Icelandic Speech 1.0https://catalog.ldc.upenn.edu/LDC2023S05
________________________________
LDC at Interspeech 2023
LDC is happy to be back in person as an exhibitor and longtime supporter of Interspeech, taking place this year August 20-24 in Dublin, Ireland. Stop by Stand A2 to say hello and learn about the latest developments at the Consortium. LDC is also delighted to once again be a silver sponsor for the Young Female Researchers in Speech Workshophttps://sites.google.com/view/yfrsw-2023 and to provide data in support of the CHiME-7 challengehttps://www.chimechallenge.org/current/workshop/index satellite workshop and the MERLIon CCS Challengehttps://sites.google.com/view/merlion-ccs-challenge.
LDC will post conference updates via our social media platforms. We look forward to seeing you in Dublin!
LDC releases speech activity detector
LDC announces the release of the LDC Broad Phonetic Class Speech Activity Detector. Based on the broad phonetic class recognizer implemented in the HTK Speech Recognition Toolkithttps://htk.eng.cam.ac.uk/, LDC's speech activity detector model runs the speech signal through a GMM-HMM recognizer to identify five broad phonetic classes: vowel, stops/affricate, fricative, nasal, and glide/liquid. The LDC Broad Phonetic Class Speech Activity Detector is available at no cost on githubhttps://github.com/Linguistic-Data-Consortium/ldc-bpcsad under a GPL v3 licensehttps://www.gnu.org/licenses/gpl-3.0.en.html.
Fall 2023 LDC Data Scholarship Program
Student applications for the Fall 2023 LDC Data Scholarship program are being accepted now through September 15, 2023. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarships pagehttps://www.ldc.upenn.edu/language-resources/data/data-scholarships
________________________________
New publications:
2019 OpenSAT Public Safety Communications Simulationhttps://catalog.ldc.upenn.edu/LDC2023S06 contains 141 hours of English speech recordings and transcripts used in the NIST Open Speech Analytic Technologies (OpenSAThttps://www.nist.gov/itl/iad/mig/opensat) 2019 evaluation's automatic speech recognition, speech activity detection, and keyword search tasks. The data is part of the SAFE-T (Speech Analysis For Emergency Response Technology) corpus created by LDC which is comprised of speakers engaged in a collaborative problem-solving activity representative of public safety communications in terms of speech content, noise types, and noise levels.
US English speakers played the board game Flash Point Fire Rescue. Background noise was played through a participant's headset during the recording session. Recording sessions consisted of 2 30-minute games. The corpus is divided into training, development, and evaluation data.
2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
*
Samrómur Queries Icelandic Speech 1.0https://catalog.ldc.upenn.edu/LDC2023S05 was developed by the Language and Voice Lab, Reykjavik Universityhttps://lvl.ru.is/ in cooperation with Almannarómur, Center for Language Technologyhttps://almannaromur.is/. The corpus contains 20 hours of Icelandic prompted queries from 3,809 speakers representing 17,475 utterances.
Speech data was collected between October 2019 and December 2021 using the Samrómur websitehttps://samromur.is which displayed prompts to participants. The prompts were mainly from The Icelandic Gigaword Corpushttp://clarin.is/en/resources/gigaword, which includes text from novels, news, plays, and from a list of location names in Iceland. Additional prompts were taken from the Icelandic Web of Sciencehttps://www.visindavefur.is/ and others were created by combining a name followed by a question or a demand. Prompts and speaker metadata are included in the corpus.
2023 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC accounthttps://catalog.ldc.upenn.edu/login and uncheck the box next to "Receive Newsletter" under Account Options or contact LDC for assistance.
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: ldc@ldc.upenn.edumailto:ldc@ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104