LLMs with Limited Resources for Slavic Languages @ WMT2025 @ EMNLP2025
Website: https://www2.statmt.org/wmt25/limited-resources-slavic-llm.html
Join our Google Group! https://groups.google.com/g/slavic-llms-mt2025
HuggingFace Collection: https://huggingface.co/collections/tum-nlp/llms-for-slavic-languages-67f3993...
This shared task explores how LLMs perform on MT and QA jointly, aiming to investigate task synergy under limited data and compute resources. Ukrainian (uk) is a mid-resource language (~40M L1 speakers), while Upper Sorbian (hsb) and Lower Sorbian (dsb) are minority West Slavic languages (30k and 7k L1 speakers, respectively) spoken in Germany.
Data Overview
Ukrainian
-
MT directions: en→uk, cs→uk -
QA: Derived from high-school graduation exams (ZNO) -
Training sets examples: -
MT: WMT24++ https://huggingface.co/datasets/google/wmt24pp, SMOL https://huggingface.co/datasets/google/smol -
QA: UNLP2024 https://huggingface.co/datasets/osyvokon/zno, ZNO-EVAL https://github.com/NLPForUA/ZNO, Cohere INCLUDE https://huggingface.co/datasets/CohereForAI/include-base-44
Upper Sorbian & Lower Sorbian (two separate tracks)
-
MT directions: de→hsb, de→dsb -
QA: Multiple-choice questions based on actual CEFR-based language certification exams (A1–C1 levels) -
We will prepare the following resources: -
Parallel & monolingual corpora via Witaj-Sprachzentrum and Leipzig Corpora Collection; -
Previous WMT low-resource tracks (2020–2022); -
QA task adapted from language certifications of different levels.
Submission Guidelines
-
Models must produce both MT & QA outputs for the chosen language(s); -
Submissions are language-specific; submit to one or multiple language tracks; -
Participants can only use one of the following base models that are restricted to 3B parameters maximum: -
Qwen2.5-3B-Instruct https://huggingface.co/Qwen/Qwen2.5-3B-Instruct -
Qwen2.5-1.5B https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct -
Qwen2.5-0.5B https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct -
Quantized or Unsloth variants from HuggingFace collections
Key Dates (AoE)
-
Registration opens now!: Join our Google group https://groups.google.com/g/slavic-llms-mt2025 -
Training/dev data release: Late April -
Test data release: Late June -
Submission deadline: Early July -
System description deadline: Late July -
Final workshop: 5-9th November @ EMNLP 2025 in Suzhou, China!
Organisers
TUM Heilbronn:
Daryna Dementieva Marion di Marco Lukas Edman Alexander Fraser Kathy Hämmerl Shu Okabe
Witaj-Sprachzentrum:
Beate Brězan, Anita Hendrichowa Marko Měškank Tomaš Šołta
Acknowledgements We thank the UNLP 2024 Shared Task team (Roman Kyslyi, Mariana Romanyshyn, Oleksiy Syvokon) for kindly sharing Ukrainian QA resources.
Best regards, Daryna Dementieva On behalf of TUM Heilbronn Workshop Organizers