Dear colleagues,
[apologies for cross-posting]
We would like to remind you that this year SIGTYP is hosting a Shared Task on Word Embedding Evaluation for Ancient and Historical Language: https://github.com/sigtyp/ST2024/
Test data has been released, and CodaLab competitions are up and running, so we encourage you to register if you still haven't! There is still a week before the deadline. :)
*Summary* In recent years, sets of downstream tasks called benchmarks have become a very popular, if not default, method to evaluate general-purpose word and sentence embeddings. Starting with decaNLP (McCann et al., 2018) and SentEval (Conneau & Kiela, 2018), multitask benchmarks for NLU keep appearing and improving every year. However, even the largest multilingual benchmarks, such as XGLUE, XTREME, XTREME-R or XTREME-UP (Hu et al., 2020; Liang et al., 2020; Ruder et al., 2021, 2023), only include modern languages. When it comes to ancient and historical languages, scholars mostly adapt/translate intrinsic evaluation datasets from modern languages or create their own diagnostic tests. We argue that there is a need for a universal evaluation benchmark for embeddings learned from ancient and historical language data and view this shared task as a proving ground for it.
The shared task involves solving the following problems for 12+ ancient and historical languages that belong to 4 language families and use 6 different scripts. Participants will be invited to describe their system in a paper for the SIGTYP workshop proceedings. The task organizers will write an overview paper that describes the task and summarizes the different approaches taken, and analyzes their results.
*Subtasks* For subtask A, participants are not allowed to use any additional data; however, they can reduce and balance provided training datasets if they see fit. For subtask B, participants are allowed to use any additional data in any language, including pre-trained embeddings and LLMs.
A. Constrained
1. POS-tagging 2. Full morphological annotation 3. Lemmatisation
B. Unconstrained
1. POS-tagging 2. Detailed morphological annotation 3. Lemmatisation 4. Filling the gaps - Word-level - Character-level
*Important links*
- *Registration form* https://docs.google.com/forms/d/e/1FAIpQLSdINgMfzzZGIZ-uBVQhvyndB6yeaaj-wT7v45A6UB4F2h6QBQ/viewform?usp=sf_link - Detailed description, incl. submission format: https://github.com/ sigtyp/ST2024 https://github.com/sigtyp/ST2024 - Constrained subtask on CodaLab: https://codalab.lisn.upsaclay.fr/competitions/16822 - Unconstrained subtask on CodaLab: https://codalab.lisn.upsaclay.fr/competitions/16818
*Important dates*
*05 Nov 2023*: Release of training and validation data *02 Jan 2024*: Release of test data - * 09 Jan 2024:* Submission of results for Phase 1 of the Constrained Subtask - * 12 Jan 2024:* Submission of results for Phase 2 of the Constrained Subtask and for the Unconstrained Subtask *13 Jan 2024*: Notification of results *20 Jan 2024*: Submission of shared task papers *27 Jan 2024*: Notification of acceptance to authors *03 Feb 2024*: Camera-ready *15 Mar 2024*: Video recordings due *21/22 Mar 2024*: SIGTYP workshop
Kind regards,
Oksana and the organisers' team