Dear colleagues,
We are pleased to announce the first call for papers of the *1st Workshop on Multilingual Data Quality Signals at COLM 2025*
Important information: 🗓️ CfP Deadline: June 23, Workshop: October 10 📍 Montréal, Canada 🌐 https://wmdqs.org
Scope
Recent research has shown that large language models (LLMs) not only need large quantities of data, but also need data of sufficient quality. Ensuring data quality is even more important in a multilingual setting, where the amount of acceptable training data in many languages is limited. Indeed, for many languages even the fundamental step of language identification remains a challenge, leading to unreliable language labels and thus noisy datasets for underserved languages.
In response to these challenges, we will be holding the first Workshop on Multilingual Data Quality Signals (WMDQS) in tandem with COLM. We invite the submission of long and short research papers related to data quality in multilingual data.
Even though most previous work on data quality has been targeted at LLM development, we believe that research in this area can also benefit other research communities in areas such as web search, web archiving, corpus linguistics, digital humanities, political sciences and beyond. We therefore encourage submissions from a wide range of disciplines.
WMDQS will also include a shared task on language identification for web text. We invite participants to submit novel systems which address current problems with language identification for web text. We will provide a training set of annotated documents sourced from Common Crawl to aid development.
Topics
We welcome submissions of (1) original research papers, (2) review/opinion papers, (3) online systems on the topics listed below, and (4) extended abstracts. We especially welcome work-in-progress projects and all novel ideas covering research in multilinguality, underserved/low-resource languages, under-represented linguistic communities and all types of work covering data quality signals. Suggested areas include:
- Data pipelines for data annotation and data filtering - Undesirable content detection in a multilingual setting - Multilingual or language independent content ranking - Human annotation platforms and systems - Multilingual tokenization mechanisms - Small language models and embeddings - Linguistic studies in underserved languages - Corpus creation and curation methods, especially for underserved languages - Machine translation - Digital humanities - Historical and constructed languages
Shared task
The lack of training data—especially high-quality data—is the root cause of poor language model performance for many languages. One obstacle to improving the quantity and quality of available text data is language identification (LangID or LID). Lang ID remains far from solved for many languages. Several of the commonly used LangID models were introduced in 2017 (e.g. fastText and CLD3). The aim of this shared task is to encourage innovation in open-source language identification and improve accuracy on a broad range of languages.
All accepted authors will be invited to contribute a larger paper, which will be submitted to a high-impact NLP venue.
Important dates for the Workshop: Workshop paper submission deadline: June 23, 2025 Workshop paper acceptance notification: July 24, 2025 Workshop: October 10, 2025
Important dates for the Shared Task: 1st Deadline to contribute annotations: July 7, 2025 1st Annotations released (train split): July 14, 2025 Abstract Deadline: July 21, 2025 Decision Notification: July 24, 2025 Camera Ready Deadline: September 21, 2025
(All deadlines are 23:59 AoE.)
Organizers: For any questions, please drop a mail to wmdqs-pcs@googlegroups.com
Program Chairs: Pedro Ortiz Suarez (Common Crawl Foundation) Sarah Luger (MLCommons) Laurie Burchell (Common Crawl Foundation) Kenton Murray (Johns Hopkins University) Catherine Arnett (EleutherAI)
Organizing Committee: Thom Vaughan (Common Crawl Foundation) Sara Hincapié (Factored) Rafael Mosquera (MLCommons)