Call for Participation - VarDial Evaluation Campaign 2023 - Corpora

27 Jan 2023


      Call for Participation - VarDial Evaluation Campaign 2023
Within the scope of the tenth VarDial workshop, co-located with EACL 2023, we are organizing an evaluation campaign on similar languages, varieties and dialects with three shared tasks. To participate and to receive the training data please fill the registration form on the workshop website:
https://sites.google.com/view/vardial-2023/shared-tasks
We are organizing the following tasks this year (please check the website for more information):
1. SID for low-resource language varieties (SID4LR)
This task is Slot and Intent Detection (SID) for low-resource language varieties. Slot detection is a span labeling task, intent detection a classification task. The test set will contain Swiss German (GSW), South Tyrolean (DE-ST), and Neapolitan (NAP). This shared task seeks to answer the following question: How can we best do zero-shot transfer to low-resource language varieties without standard orthography?
The training data consists of the xSID-0.4 corpus, containing data from Snips and Facebook. The original training data is in English, but we also provide automatic translations of the training data into German, Italian and other languages (the projected nmt-transfer data from van der Goot et al., 2021). Participants are allowed to use other data to train on, as long as it is not annotated for SID in the target languages.
Participants are not required to submit systems for both tasks, it is also possible to only participate in one of the two tasks, intent detection (classification) or slot detection (span labeling). The systems will be evaluated with the span F1 score for slots and accuracy for intents as the main evaluation metric as is standard for these tasks. Participants may also submit systems for a subset of the three target languages.
2. Discriminating Between Similar Languages - True Labels (DSL-TL)
Discriminating between similar languages (e.g., Croatian and Serbian) and language varieties (e.g., Brazilian and European Portuguese) has been a popular topic at VarDial since its first edition. The DSL shared tasks organized in 2014, 2015, 2016, and 2017 have addressed this issue by providing participants with the DSL Corpus Collection (DSLCC), a collection of journalistic texts containing texts written in multiple similar languages and language varieties. The DSLCC was compiled under the assumption that each instance's gold label is determined by where the text is retrieved from. While this is a straightforward (and mostly accurate) practical assumption, previous research has shown the limitations of this problem formulation as some texts may present no linguistic marker that allows systems or native speakers to discriminate between two very similar languages or language varieties.
We tackle this important limitation by introducing the DSL True Labels (DSL-TL) task. DSL-TL will provide participants with a human-annotated DSL dataset. A sub-set of nearly 13,000 sentences were retrieved from the DSLCC and annotated by multiple native speakers of the included language and varieties included, namely English (American and British), Portuguese (Brazilian and European), Spanish (Argentinian and Peninsular). To the best of our knowledge this is the first dataset of its kind opening exciting new avenues for language identification research.
3. Discriminating Between Similar Languages - Speech (DSL-S)
In the DSL-S 2023 shared task, participants use training and development sets from the Mozilla Common Voice (CV) to develop a language identifier for speech. The nine languages selected for the task come from four different subgroups of Indo-European or Uralic language families. The test data used in this task is the Common Voice test data for the nine languages. The participants are asked not to evaluate their systems themselves nor in any other way investigate the test data before the shared task results have been published. The total amount of unpacked speech data is around 15 gigabytes. Only the .mp3 files from the test set must be used when generating the results. The metadata concerning the test audio files, including their transcriptions, must not be used. This task is audio only.
The 9-way classification task is divided into two separate tracks. Only the training and development data in the Common Voice dataset are allowed in the closed track, and no other data must be used. This prohibition includes systems and models trained (unsupervised or supervised) on any other data. On the open track, the use of any openly available (available to any possible shared task participant) datasets and models not including or trained on the Mozilla Common Voice test set is allowed.
Dates
Training set release: January 23, 2023
Test set release: February 6, 2023
Submissions due: February 17, 2023
Paper submission deadline: February 27, 2023
Notification of acceptance: March 13, 2023
Camera-ready papers due: March 27, 2023
Of course, VarDial also accepts research papers focusing on computational methods and language resources for closely related languages, language varieties, and dialects. The full call for papers can be found here:
https://sites.google.com/view/vardial-2023/call-for-papers
Contact: yves.scherrer@helsinki.fimailto:yves.scherrer@helsinki.fi or tommi.jauhiainen@helsinki.fimailto:tommi.jauhiainen@helsinki.fi