December 2022 Newsletter - LDC - Corpora

15 Dec 2022


      In this newsletter:
LDC 2023 membership discounts now available
Approaching deadline for Spring 2023 data scholarship applications
30th Anniversary Highlight: AMR
________________________________
New publications:
CAMIO Transcription Languageshttps://catalog.ldc.upenn.edu/LDC2022T07
Global TIMIT Thaihttps://catalog.ldc.upenn.edu/LDC2022S13
Third DIHARD Challenge Evaluationhttps://catalog.ldc.upenn.edu/LDC2022S14
LDC 2023 membership discounts now available
Now through March 1, 2023, current 2022 members receive a 10% discount for renewing their membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDChttps://www.ldc.upenn.edu/members/join-ldc for details on membership options and benefits.
Approaching deadline for Spring 2023 data scholarship applications
Attention students: don't miss out on the chance to receive no-cost access to LDC data for your research. Applications for Spring 2023 data scholarships are due January 15, 2023. For more information on requirements and program rules, see LDC Data Scholarshipshttps://www.ldc.upenn.edu/language-resources/data/data-scholarships.
30th Anniversary Highlight: AMR
Abstract Meaning Representation (AMR) annotation was developed by LDC, SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group, and the Information Sciences Institute at the University of Southern California. It is a semantic representation language that captures "who is doing what to whom" in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax.
LDC's Catalog contains three cumulative English AMR publications: Release 1.0 (LDC2014T12https://catalog.ldc.upenn.edu/LDC2014T12), Release 2.0 (LDC2017T10https://catalog.ldc.upenn.edu/LDC2017T10), and Release 3.0  (LDC2020T02https://catalog.ldc.upenn.edu/LDC2020T02). The combined result in AMR 3.0 is a semantic treebank of roughly 59,255 English natural language sentences from broadcast conversations, newswire, weblogs, web discussion forums, fiction, and web text, and includes multi-sentence annotations.
LDC has also published Chinese Abstract Meaning Representation 1.0 (LDC2019T07https://catalog.ldc.upenn.edu/LDC2019T07) and 2.0 (LDC2021T13https://catalog.ldc.upenn.edu/LDC2021T13), developed by Brandeis University and Nanjing Normal University. These corpora contain AMR annotations for approximately 20,000 sentences from Chinese Treebank 8.0 (LDC2013T21https://catalog.ldc.upenn.edu/LDC2013T21). Chinese AMR follows the basic principles developed for English, making adaptations where necessary to accommodate Chinese phenomena.
Abstract Meaning Representation 2.0 - Four Translations (LDC2020T07https://catalog.ldc.upenn.edu/LDC2020T07), developed by the University of Edinburgh, School of Informatics, consists of Spanish, German, Italian, and Chinese Mandarin translations of a subset of sentences from AMR 2.0.
Visit LDC's Catalog https://catalog.ldc.upenn.edu/ for more details about these publications.
________________________________
New publications:
CAMIO Transcription Languageshttps://catalog.ldc.upenn.edu/LDC2022T07 was developed by LDC and contains nearly 70,000 images of machine printed text with corresponding annotations and transcripts in 13 languages: Arabic, Chinese, English, Farsi, Hindi, Japanese, Kannada, Korean, Russian, Tamil, Thai, Urdu, and Vietnamese. This corpus is a subset of data created for a broader effort to support the development and evaluation of optical character recognition and related technologies for 35 languages across 24 unique script types.
Most images were annotated for text localization, resulting in over 2.3M line-level bounding boxes; 1250 images per language were also annotated with orthographic transcriptions of each line plus specification of reading order, yielding over 2.4M tokens of transcribed text. The resulting annotations are represented in an XML output format defined for this corpus. Data for each language is partitioned into test, train, or validation sets.
2022 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
*
Global TIMIT Thaihttps://catalog.ldc.upenn.edu/LDC2022S13 consists of 12 hours of read speech and time-aligned transcripts in Standard Thai from 50 speakers (33 female, 17 male) reading 120 sentences selected from the Thai National Corpushttps://www.arts.chula.ac.th/ling/tnc/, the Thai Junior Encyclopediahttps://www.au.edu/royal-activities/the-thai-encyclopedia-for-youth-project.html, and Thai Wikipedia, for a total of 6000 utterances. Data was collected in 2016. Speakers were recruited in the Bangkok metropolitan area; they were native Thais, fluent in Standard Thai, and literate.
This data set was developed as part of LDC's Global TIMIT project which aims to create a series of corpora in a variety of languages with a similar set of key features as in the original TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1)https://catalog.ldc.upenn.edu/LDC93S1 which was designed for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems.
2022 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
*
Third DIHARD Challenge Evaluationhttps://catalog.ldc.upenn.edu/LDC2022S14 was developed by LDC and contains 33 hours of English and Chinese speech data along with corresponding annotations used in support of the Third DIHARD Challengehttps://dihardchallenge.github.io/dihard3.
The DIHARD third development and evaluation sets were drawn from diverse sources including monologues, map task dialogues, broadcast interviews, sociolinguistic interviews, meeting speech, speech in restaurants, clinical recordings, and amateur web videos. Annotations include diarization and segmentation.
2022 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC accounthttps://catalog.ldc.upenn.edu/login and uncheck the box next to "Receive Newsletter" under Account Options; or contact LDC for assistance.
Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: ldc@ldc.upenn.edumailto:ldc@ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104