In this newsletter: LDC at LREC-COLING 2024
New publications: Call My Net 1https://catalog.ldc.upenn.edu/LDC2024S05 Automatic Content Extraction for Portuguesehttps://catalog.ldc.upenn.edu/LDC2024T05 ________________________________ LDC at LREC-COLING 2024 LDC will be exhibiting at LREC-COLING 2024https://lrec-coling-2024.org/ hosted by the European Language Resources Association (ELRA) and the International Committee on Computational Linguistics (ICCL) May 20-25 in Turin, Italy. Stop by our table to learn more about recent developments at the Consortium and the latest publications.
LDC staff members will also be presenting current work on topics including Spanless Event Annotation for Corpus-Wide Complex Event Understanding; Schema Learning Corpus: Data and Annotation Focused on Complex Events; and KoFREN: Comprehensive Korean Word Frequency Norms Derived from Large Scale Free Speech Corpora.
LDC will post conference updates via social media. We look forward to seeing you in Italy! ________________________________
New publications: Call My Net 1https://catalog.ldc.upenn.edu/LDC2024S05 was developed by LDC and contains 364 hours of conversational telephone speech in four languages (Tagalog, Cebuano, Cantonese, and Mandarin) collected in 2015 from 221 native speakers located in the Philippines and China along with metadata and speaker demographic information. Recordings and data from this collection were used to support the NIST 2016 Speaker Recognition Evaluationhttps://www.nist.gov/itl/iad/mig/speaker-recognition-evaluation-2016.
Speakers made 10 telephone calls each to people within their existing social networks, using different handsets and under a variety of noise conditions. Speakers were connected through a robot operator to carry on casual conversations on topics of their choice. All recordings were manually audited to confirm language and speaker requirements. The documentation for this release includes metadata about phone type, noise conditions, and call quality. Speaker demographic information on year of birth, sex, and native language is also included.
2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
*
Automatic Content Extraction for Portuguesehttps://catalog.ldc.upenn.edu/LDC2024T05 was developed at INESC TEC - Instituto de Engenharia de Sistemas e Computadores, Tecnologia e Ciênciahttps://www.inesctec.pt/en and consists of automatic Brazilian Portuguese and European Portuguese translations of the English text and annotations in ACE 2005 Multilingual Training Corpus (LDC2006T06)https://catalog.ldc.upenn.edu/LDC2006T06.
ACE 2005 Multilingual Training Corpus was developed by LDC to support the Automatic Contract Extraction (ACE)https://www.ldc.upenn.edu/collaborations/past-projects/ace program, specifically, by providing training data for the 2005 technology evaluation. It contains 1,800 files of mixed genre text in Arabic, English, and Chinese annotated for entities, relations, and events. The objective of the ACE program was to develop automatic content extraction technology to support automatic processing of human language in text form. Text genres included newswire, broadcast news, broadcast conversation, weblog, discussion forums, and conversational telephone speech.
For this translation, the English data was partitioned into training, development, and test sets. The documents were split into sentences and each event mention was assigned to its sentence. Source sentences and their annotations were translated into Brazilian Portuguese using Google Translatehttps://translate.google.com/ and into European Portuguese using DeepL Translatehttps://www.deepl.com/en/translator. An alignment algorithm and a parallel corpus word aligner were used to handle mismatches between translated annotations and their translated sentences.
2024 members can access this corpus through their LDC account. Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC accounthttps://catalog.ldc.upenn.edu/login and uncheck the box next to "Receive Newsletter" under Account Options or contact LDC for assistance.
Membership Coordinator Linguistic Data Consortium<ldc.upenn.edu> University of Pennsylvania T: +1-215-573-1275 E: ldc@ldc.upenn.edumailto:ldc@ldc.upenn.edu M: 3600 Market St. Suite 810 Philadelphia, PA 19104