[2nd call] SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages - SIGUL

4 Jan 2024


      Dear colleagues,
[apologies for cross-posting]
We would like to remind you that this year SIGTYP is hosting a Shared Task
on Word Embedding Evaluation for Ancient and Historical Language:
https://github.com/sigtyp/ST2024/
Test data has been released, and CodaLab competitions are up and running,
so we encourage you to register if you still haven't! There is still a week
before the deadline. :)
*Summary*
In recent years, sets of downstream tasks called benchmarks have become a
very popular, if not default, method to evaluate general-purpose word and
sentence embeddings. Starting with decaNLP (McCann et al., 2018) and
SentEval (Conneau & Kiela, 2018), multitask benchmarks for NLU keep
appearing and improving every year. However, even the largest multilingual
benchmarks, such as XGLUE, XTREME, XTREME-R or XTREME-UP (Hu et al., 2020;
Liang et al., 2020; Ruder et al., 2021, 2023), only include modern
languages. When it comes to ancient and historical languages, scholars
mostly adapt/translate intrinsic evaluation datasets from modern languages
or create their own diagnostic tests. We argue that there is a need for a
universal evaluation benchmark for embeddings learned from ancient and
historical language data and view this shared task as a proving ground for
it.
The shared task involves solving the following problems for 12+ ancient and
historical languages that belong to 4 language families and use 6 different
scripts. Participants will be invited to describe their system in a paper
for the SIGTYP workshop proceedings. The task organizers will write an
overview paper that describes the task and summarizes the different
approaches taken, and analyzes their results.
*Subtasks*
For subtask A, participants are not allowed to use any additional data;
however, they can reduce and balance provided training datasets if they see
fit. For subtask B, participants are allowed to use any additional data in
any language, including pre-trained embeddings and LLMs.
A. Constrained
1.     POS-tagging
   2.     Full morphological annotation
   3.     Lemmatisation
B. Unconstrained
1.     POS-tagging
   2.     Detailed morphological annotation
   3.     Lemmatisation
   4.     Filling the gaps
      - Word-level
      - Character-level
*Important links*
- *Registration form*
   https://docs.google.com/forms/d/e/1FAIpQLSdINgMfzzZGIZ-uBVQhvyndB6yeaaj-wT7v45A6UB4F2h6QBQ/viewform?usp=sf_link
   - Detailed description, incl. submission format: https://github.com/
   sigtyp/ST2024 https://github.com/sigtyp/ST2024
   - Constrained subtask on CodaLab:
   https://codalab.lisn.upsaclay.fr/competitions/16822
   - Unconstrained subtask on CodaLab:
   https://codalab.lisn.upsaclay.fr/competitions/16818
*Important dates*
*05 Nov 2023*: Release of training and validation data
    *02 Jan 2024*: Release of test data
- *    09 Jan 2024:* Submission of results for Phase 1 of the Constrained
Subtask
- *    12 Jan 2024:* Submission of results for Phase 2 of the Constrained
Subtask and for the Unconstrained Subtask    *13 Jan 2024*: Notification of
results
    *20 Jan 2024*: Submission of shared task papers
    *27 Jan 2024*: Notification of acceptance to authors
    *03 Feb 2024*: Camera-ready
    *15 Mar 2024*: Video recordings due
    *21/22 Mar 2024*: SIGTYP workshop
Kind regards,
Oksana and the organisers' team
-- 
[image: https://nuig.insight-centre.org/]
https://www.insight-centre.org/

Oksana Dereza  | PhD student on the Cardamom
http://cardamom.insight-centre.org/ project | Unit for Linguistic Data |
Insight Centre for Data Analytics | Data Science Institute | University of
Galway

Oksana Dereza  | Iarrthóir PhD ar thionscadal Cardamom
http://cardamom.insight-centre.org/ | An tAonad um Shonraí Teangeolaíocha
| Insight, Ionad na hAnailísíochta Sonraí | Institiúid Eolaíochta Sonraí |
Ollscoil na Gaillimhe