April 2024 Newsletter - LDC - Corpora

15 Apr 2024


      In this newsletter:
New publications:
LoReHLT Hausa Representative Language Packhttps://catalog.ldc.upenn.edu/LDC2024T03
AIDA Scenario 2 Practice Topic Source Datahttps://catalog.ldc.upenn.edu/LDC2024T04
________________________________
New publications:
LoReHLT Hausa Representative Language Packhttps://catalog.ldc.upenn.edu/LDC2024T03 was developed by LDC and is comprised of approximately 4.4 million words of Hausa monolingual text, 86,000 Hausa words translated from English data, and 30 minutes of Hausa audio recordings. Approximately 96,000 words were annotated for named entities and over 13,000 words were annotated for full entity including nominals and pronouns. Noun-phrase chunking was applied to more than 7,400 words. Over 9,600 words were labeled with simple semantic annotation. Topic annotation was applied to the audio recordings. Data was collected from discussion forum, news, reference, social network, amateur web audio recordings, and weblogs.
LoReHLT was a companion project of the DARPA LORELEI program. The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.
2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
*
AIDA Scenario 2 Practice Topic Source Datahttps://catalog.ldc.upenn.edu/LDC2024T04 was developed by LDC and is comprised of 1500 root documents (text, image, and video) from English, Russian, and Spanish web sources. Each phase of the AIDA program centered on a specific scenario, or broad topic area, with related subtopics designated as either practice subtopics or evaluation subtopics. The Phase 2 scenario focused on the socioeconomic and political crisis in Venezuela since 2010. This corpus constitutes the full set of topic-focused documents for Phase 2 practice subtopics.
The DARPA AIDA (Active Interpretation of Disparate Alternatives) program aimed to develop a multi-hypothesis semantic engine to generate explicit alternative interpretations of events, situations and trends from a variety of unstructured sources. LDC supported AIDA by collecting, creating and annotating multimodal linguistic resources in multiple languages.
The knowledge base for entity detection and linking annotation for all AIDA Scenario 1 and 2 corpora is available separately as AIDA Scenario 1 and 2 Reference Knowledge Base (LDC2023T10)https://catalog.ldc.upenn.edu/LDC2023T10.
2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC accounthttps://catalog.ldc.upenn.edu/login and uncheck the box next to "Receive Newsletter" under Account Options or contact LDC for assistance.