We are inviting researchers and students to participate in the shared task CoLI-Dravidian: Word-level Code-Mixed Language Identification in Dravidian Languages, which is held as a shared task in the 16th
meeting of Forum for Information Retrieval Evaluation (FIRE 2024).
Language Identification (LI) involves detecting the language(s) used in a given text, which is a preliminary step for many applications such as sentiment analysis, machine translation, information retrieval,
and natural language understanding. In multilingual India, especially among the youth, social media often features code-mixed text, blending local languages with English at various levels. However, this poses significant challenges for LI, particularly when
languages are mixed within a single word. Dravidian languages, extensively spoken in southern India, are under-resourced despite their rich morphological structure. These languages face technological challenges, especially in script representation on digital
platforms, leading users to prefer Roman or hybrid scripts for communication. This prevalent code-mixing offers vast linguistic data for research yet remains understudied.
To address word-level LI challenges in code-mixed Dravidian languages, we are conducting a shared task by providing code-mixed datasets for four languages - Kannada, Tamil, Malayalam, and Tulu, to encourage
the development of advanced LI models.
There will be a real-time leaderboard, and the participants will be allowed to make a maximum of 10 submissions in the
training phase and 5 submissions in the testing phase through CodaLab. Each team will have to select the best submission for ranking.