In 2024, SIGTYP is hosting a
Shared Task on Word Embedding Evaluation for Ancient and Historical Languages:
https://sigtyp.github.io/st2024.html The workshop will be co-located with EACL.
SummaryIn recent years, sets of
downstream tasks called benchmarks have become a very popular, if not
default, method to evaluate general-purpose word and sentence
embeddings. Starting with decaNLP (McCann et al., 2018) and SentEval
(Conneau & Kiela, 2018), multitask benchmarks for NLU keep appearing
and improving every year. However, even the largest multilingual
benchmarks, such as XGLUE, XTREME, XTREME-R or XTREME-UP (Hu et al.,
2020; Liang et al., 2020; Ruder et al., 2021, 2023), only include modern
languages. When it comes to ancient and historical languages, scholars
mostly adapt/translate intrinsic evaluation datasets from modern
languages or create their own diagnostic tests. We argue that there is a
need for a universal evaluation benchmark for embeddings learned from
ancient and historical language data and view this shared task as a
proving ground for it.
The shared task involves solving the
following problems for 12+ ancient and historical languages that belong
to 4 language families and use 6 different scripts. Participants will be
invited to describe their system in a paper for the
SIGTYP workshop proceedings. The task organisers will write an overview
paper that describes the task and summarises the different approaches
taken, and analyses their results.
Subtasks
For subtask A, participants are not allowed to use any additional data;
however, they can reduce and balance provided training datasets if they
see fit. For subtask B, participants are allowed to use any additional
data in any language, including pre-trained embeddings and LLMs.
A. Constrained
- POS-tagging
- Full morphological annotation
- Lemmatisation
B. Unconstrained
- POS-tagging
- Full morphological annotation
- Lemmatisation
- Filling the gaps
- Word-level
- Character-level
Data
For
tasks 1-3, we use Universal Dependencies v. 2.12 data (Zeman et al.,
2023) in 11 ancient and historical languages, complemented by 5 Old
Hungarian codices from the MGTSZ website (HAS Research Institute for
Linguistics, 2018) that are annotated to the same standard as the
corpora available through UD. For task 4, we add historical Irish data
from CELT (Ó Corráin et al., 1997), Corpas Stairiúil na Gaeilge (Acadamh
Ríoga na hÉireann, 2017), and digital editions of the St. Gall glosses
(Bauer et al., 2017) and the Würzburg glosses (Doyle, 2018) as a case
study of how performance may vary on different historical stages of the
same language. We set the upper temporal boundary to 1700 CE and do not
include texts created later than this date in our dataset. List of
languages:
- Ancient Greek
- Ancient Hebrew
- Classical Chinese
- Coptic
- Gothic
- Classical, Late & Medieval Latin
- Medieval Icelandic
- Old Church Slavonic
- Old East Slavic
- Old French
- Old Hungarian
- Old, Middle & Early Modern Irish
- Vedic Sanskrit
Important dates 05 Nov 2023: Release of training and validation data
02 Jan 2024: Release of test data
08 Jan 2024: Submission of the systems
13 Jan 2024: Notification of results
20 Jan 2024: Submission of shared task papers
27 Jan 2024: Notification of acceptance to authors
03 Feb 2024: Camera-ready
15 Mar 2024: Video recordings due
21/22 Mar 2024: SIGTYP workshop
Important links
Task organisers- Oksana Dereza, Insight SFI Research Centre for Data Analytics, Data Science Institute, University of Galway
- Priya Rani, SFI Centre for Research and Training in AI, Data Science Institute, University of Galway
- Atul Kr. Ojha, Insight SFI Research Centre for Data Analytics, Data Science Institute, University of Galway
- Adrian Doyle, Insight SFI Research Centre for Data Analytics, Data Science Institute, University of Galway
- Pádraic Moran, School of Languages, Literatures and Cultures, Moore Institute, University of Galway
- John P. McCrae, Insight SFI Research Centre for Data Analytics, Data Science Institute, University of Galway
Contact details