Test data has been released, and CodaLab competitions are up and running, so we encourage you to register if you still haven't! There is still a week before the deadline. :)
SummaryIn recent years, sets of
downstream tasks called benchmarks have become a very popular, if not
default, method to evaluate general-purpose word and sentence
embeddings. Starting with decaNLP (McCann et al., 2018) and SentEval
(Conneau & Kiela, 2018), multitask benchmarks for NLU keep appearing
and improving every year. However, even the largest multilingual
benchmarks, such as XGLUE, XTREME, XTREME-R or XTREME-UP (Hu et al.,
2020; Liang et al., 2020; Ruder et al., 2021, 2023), only include modern
languages. When it comes to ancient and historical languages, scholars
mostly adapt/translate intrinsic evaluation datasets from modern
languages or create their own diagnostic tests. We argue that there is a
need for a universal evaluation benchmark for embeddings learned from
ancient and historical language data and view this shared task as a
proving ground for it.
The shared task involves solving the
following problems for 12+ ancient and historical languages that belong
to 4 language families and use 6 different scripts. Participants will be
invited to describe their system in a paper for the
SIGTYP workshop proceedings. The task organizers will write an overview
paper that describes the task and summarizes the different approaches
taken, and analyzes their results.
Subtasks
For subtask A, participants are not allowed to use any additional data;
however, they can reduce and balance provided training datasets if they
see fit. For subtask B, participants are allowed to use any additional
data in any language, including pre-trained embeddings and LLMs.
A. Constrained
- POS-tagging
- Full morphological annotation
- Lemmatisation
B. Unconstrained
- POS-tagging
- Detailed morphological annotation
- Lemmatisation
- Filling the gaps
- Word-level
- Character-level