December 2024 Newsletter - LDC - Corpora

16 Dec 2024


      In this newsletter:
LDC 2025 membership discounts now available
Approaching deadline for Spring 2025 data scholarship applications
LDC closed for Winter Break December 25-January 1
New publications:
MATERIAL Farsi-English Language Packhttps://catalog.ldc.upenn.edu/LDC2024S13
Abstract Meaning Representation 3.0 - Machine Translationshttps://catalog.ldc.upenn.edu/LDC2024T11
________________________________
LDC 2025 membership discounts now available
Now through March 3, 2025, current 2024 members receive a 10% discount for renewing their membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDChttps://www.ldc.upenn.edu/members/join-ldc for details on membership options and benefits.
Approaching deadline for Spring 2025 data scholarship applications
Attention students: don't miss out on the chance to receive no-cost access to LDC data for your research. Applications for Spring 2025 data scholarships are due January 15, 2025. For more information on requirements and program rules, see LDC Data Scholarshipshttps://www.ldc.upenn.edu/language-resources/data/data-scholarships.
LDC closed for Winter Break December 25-January 1
LDC will be closed from Wednesday December 25, 2024, through Wednesday, January 1, 2025, in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on Thursday, January 2, 2025. Requests received by the Membership Office during Winter Break will be processed when the office reopens.
________________________________
New publications:
MATERIAL Farsi-English Language Packhttps://catalog.ldc.upenn.edu/LDC2024S13 was developed by Appenhttp://www.appen.com/ for the IARPA MATERIALhttps://www.iarpa.gov/index.php/research-programs/material program and contains 61 hours of Farsi conversational telephone speech, transcripts, English translations, annotations, and queries. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments. Transcripts cover approximately 30% of the speech files, and approximately 3% of the speech files were translated into English. This release also includes English queries and their relevance annotations.
The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.
2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.
*
Abstract Meaning Representation  3.0 - Machine Translationshttps://catalog.ldc.upenn.edu/LDC2024T11 was developed by the Center for Computational Linguistics at KU Leuvenhttps://www.arts.kuleuven.be/ling/ccl in the HORIZON2020https://research-and-innovation.ec.europa.eu/funding/funding-opportunities/funding-programmes-and-open-calls/horizon-2020_en project SignONhttps://cordis.europa.eu/project/id/101017255. It is an automatic translation of a subset of sentences from Abstract Meaning Representation (AMR) Annotation Release 3.0 (LDC2020T02)https://catalog.ldc.upenn.edu/LDC2020T02 into Spanish, Irish Gaelic, and Dutch.
AMR 3.0 training, development, and test splits were translated using Google Translatehttps://translate.google.com. "Unsplit" directories were not translated and are not included in this release. Translations were not manually verified, but formal issues (such as unexpected new lines) were corrected, and special tokens and encoding issues were fixed with the Python tool ftfy.fix_texthttps://ftfy.readthedocs.io/en/latest/.
AMR 3.0 is a semantic treebank of over 59,000 English natural language sentences drawn from material collected by LDC, specifically, discussion forum text from the DARPA BOLT and DARPA DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming, Wall Street Journal text, translated Xinhua news texts, various newswire texts from NIST OpenMT evaluations, and weblog data from the DARPA GALE program.
2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC accounthttps://catalog.ldc.upenn.edu/login and uncheck the box next to "Receive Newsletter" under Account Options or contact LDC for assistance.
Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: ldc@ldc.upenn.edumailto:ldc@ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104