In this newsletter: Fall 2022 LDC Data Scholarship Program 30th Anniversary Highlight: ATIS0 Complete
New publications: Qatari Corpus of Argumentative Writinghttps://catalog.ldc.upenn.edu/LDC2022T04 Second DIHARD Challenge Evaluation - SEEDLingShttps://catalog.ldc.upenn.edu/LDC2022S07 ________________________________ Fall 2022 LDC Data Scholarship Program Student applications for the Fall 2022 LDC Data Scholarship program are being accepted now through September 15, 2022. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarships pagehttps://www.ldc.upenn.edu/language-resources/data/data-scholarships.
30th Anniversary Highlight: ATIS0 Complete The ATIS corpora were among the first publications that appeared with the launch of LDC's catalog in 1993. ATIS0 Complete (LDC93S4A)https://catalog.ldc.upenn.edu/LDC93S4A is comprised of spontaneous speech, read speech, and other material from participants in the ATIS collection that is contained in ATIS0 Pilot (LDC93S4B),http://catalog.ldc.upenn.edu/LDC93S4B ATIS0 Read (LDC93S4B-2)http://catalog.ldc.upenn.edu/LDC93S4B-2, and ATIS0 SD-Read (LDC93S4B-3http://catalog.ldc.upenn.edu/LDC93S4B-3).
The ATIS (Air Travel Information Services) collection was developed to support the research and development of speech understanding systems. Participants were presented with various hypothetical travel planning scenarios and asked to solve them by interacting with partially or completely automated ATIS systems. The resulting utterances were recorded and transcribed. Data was collected in the early 1990s at five US sites: Raytheon BBN, Carnegie Mellon University, MIT Laboratory for Computer Science, National Institute for Standards and Technology, and SRI International.
The ATIS collection has been widely used to further research in spoken language understanding and slot filling (Kuo et al., 2020https://arxiv.org/pdf/2009.14386.pdf). Other data sets published from the collection include ATIS2 (LDC93S5)https://catalog.ldc.upenn.edu/LDC93S5, ATIS3 Training and Test Data (LDC94S19,https://catalog.ldc.upenn.edu/LDC94S19 LDC95S26https://catalog.ldc.upenn.edu/LDC95S26) and, more recently, Multilingual ATIS (LDC2019T04)https://catalog.ldc.upenn.edu/LDC2019T04 and ATIS - Seven Languages (LDC2021T04)https://catalog.ldc.upenn.edu/LDC2021T04.
All ATIS corpora are available for licensing by Consortium members and non-members. Visit Obtaining Data https://www.ldc.upenn.edu/language-resources/data/obtaining for more information. ________________________________ New publications:
(1) Qatari Corpus of Argumentative Writinghttps://catalog.ldc.upenn.edu/LDC2022T04 was developed by Qatar Universityhttp://www.qu.edu.qa/, University of Exeterhttps://www.exeter.ac.uk/, and Hamad Bin Khalifa Universityhttps://www.hbku.edu.qa/en, and is comprised of approximately 200,000 tokens of Arabic and English writing by undergraduate students (159 female, 36 male) along with annotations and related metadata. Students were native Arabic speakers and fluent in English; each student wrote one Arabic and one English essay in response to specific argumentative prompts. They were instructed to include in their essays a clear thesis statement supported by relevant evidence.
The corpus is divided into Arabic and English parts, each of which contains 195 essays. Metadata includes information about the students (gender, major, first language, second language) and information about the essay texts (serial numbers of texts, word limits, genre, date of writing, time spent on writing, place of writing).
Qatari Corpus of Argumentative Writing is distributed via web download.
2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. *
(2) Second DIHARD Challenge Evaluation - SEEDLingShttps://catalog.ldc.upenn.edu/LDC2022S07 was developed by Duke University and LDC and contains approximately two hours of English child language recordings along with corresponding annotations used in support of the Second DIHARD Challengehttps://dihardchallenge.github.io/dihard2.
Source data is from the SEEDLingShttps://homebank.talkbank.org/access/Password/Bergelson.html (The Study of Environmental Effects on Developing Linguistic Skills) corpus, designed to investigate how infants' early linguistic and environmental input plays a role in their learning. Recordings were generated in the home environment of infants in the Rochester, New York area. A subset of that data was annotated by LDC for use in the First and Second DIHARD Challenges
Second DIHARD Challenge Evaluation - SEEDLingS is distributed via web download.
2022 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC accounthttps://catalog.ldc.upenn.edu/login and uncheck the box next to "Receive Newsletter" under Account Options; or contact LDC for assistance.
Membership Coordinator Linguistic Data Consortium<ldc.upenn.edu> University of Pennsylvania T: +1-215-573-1275 E: ldc@ldc.upenn.edumailto:ldc@ldc.upenn.edu M: 3600 Market St. Suite 810 Philadelphia, PA 19104