Dear colleagues,
We are pleased to announce the last call for participation for 1st first Shared Task on Language Identification for Web Data at WMDQS/COLM 2025.
Important information: 🗓️ Registration Deadline: July 23 (AoE) 📍 Montréal, Canada 🌐 https://wmdqs.org/shared-task/
Registration: To register, please submit a one-page document with a title, a list of authors, a list of provisional languages that you want to focus on, and a brief description of your approach. This document should be sent to wmdqs-pcs@googlegroups.com. You can change the list of languages or the system description during the shared task. This document's only purpose is to register your participation in the shared task. The shared task will run until the last week of September.
Motivation: The lack of training data—especially high-quality data—is the root cause of poor language model performance for many languages. One obstacle to improving the quantity and quality of available text data is language identification (LangID or LID). LangID remains far from solved for many languages. Several of the commonly used LangID models were introduced in 2017 (e.g. fastText and CLD3). The aim of this shared task is to encourage innovation in open-source language identification and improve accuracy on a broad range of languages.
All participants will be invited to contribute a larger paper, which will be submitted to a high-impact NLP venue.
Description: The main shared task is to submit LangID models that work well on a wide variety of languages on web data. We encourage participants to employ a range of approaches, including the development of new architectures and the curation of novel high-quality annotated datasets.
We recommend using the GlotLID corpus as a starting point for training data. Access to the data will be managed through the Hugging Face repository. Please note that this data should not be redistributed. We will use the same language label format as those used by GlotLID: an ISO 639-3 language code plus an ISO 15924 script code, separated by an underscore.
Although all systems will be evaluated on the full range of languages in our test set, we encourage submissions that focus on a particular language or set of languages, especially if those language(s) present particular challenges for language identification.
The shared task will take place in rounds. The first round will only include data from already existing datasets, subsequent rounds will include data annotated by the community as it is collected and processed. More languages will also be added in subsequent rounds.
Organizers: For any questions, please drop a mail to wmdqs-pcs@googlegroups.com
Program Chairs: Pedro Ortiz Suarez (Common Crawl Foundation) Sarah Luger (MLCommons) Laurie Burchell (Common Crawl Foundation) Kenton Murray (Johns Hopkins University) Catherine Arnett (EleutherAI)
Organizing Committee: Thom Vaughan (Common Crawl Foundation) Sara Hincapié (Factored) Rafael Mosquera (MLCommons)