Opening of the Faetar Low-Resource ASR Challenge 2025
We are pleased to officially announce the opening of the Faetar Low-Resource ASR Challenge 2025. While we were not able to secure a special session dedicated to the challenge at the conference, we strongly encourage submission of papers describing your systems to Interspeech 2025. As such, we plan to adhere to a timeline that will allow us to return test results and announce winners in time for participants to prepare Interspeech papers (see below).
Challenge website: https://perceptimatic.github.io/faetarspeech/
The Faetar Low-Resource ASR Challenge aims to focus researchers’ attention on several issues which are common to many archival collections of speech data:
- noisy field recordings - lack of standard orthography, leading to noise in the transcriptions in the form of transcriber inconsistencies - only a few hours of transcribed data - a larger collection of untranscribed data - no additional data in the language (textual or speech) that is easily available - “dirty” transcriptions in documents, which contain matter that needs to be filtered out
By focusing multiple research groups on a single corpus of this kind, we aim to gain deeper insights into these problems than can be achieved otherwise.
The challenge uses the Faetar ASR Benchmark Corpus. Faetar (pronounced [fajdar]) is a variety of the Franco-Provençal language which developed in isolation in Italy, far from other speakers of Franco-Provençal, and in close contact with Italian. Faetar has less than 1000 speakers around the world, in Italy and in the diaspora. It is endangered, and preservation, learning, and documentation are a priority for many community members. The benchmark data represents the majority of all archived speech recordings of Faetar in existence, and it is not available from any other source.
We propose four tracks:
- Constrained ASR. Participants should focus on the challenge of improving ASR architectures to work with small, poor-quality sets. Participants may not use any resources to train / fine-tune their models beyond the files contained in the provided train set. No external pre-trained acoustic models or language models are allowed, and the use of the unlabelled portion of the Faetar challenge data set is not allowed either.
Three other “thematic tracks” can be explored, and should not be considered mutually exclusive:
- Using pre-trained acoustic models or language models. Participants focus on the most effective way to make use of models pre-trained on other languages. - Using unlabelled data. The challenge data also includes ~20 hrs of unlabelled data. Participants focus on finding the most effective way to make use of it. - Dirty data. The training data was extracted and automatically aligned from long-form audio and partial transcriptions in “cluttered” word processor files, relying on (error-prone) VAD, scraping, and alignment. Participants focus on improving the pipeline for extracting useful training data, with the ultimate goal of improving performance.
Submissions will be evaluated on phone error rate (PER) on the test set. Participants are provided with a dev kit allowing them to calculate the PER on dev and train, as well as reproduce the baselines.
For more information, and to register and obtain the data and the dev kit, please visit the challenge website:
https://perceptimatic.github.io/faetarspeech/
For more information, or for questions, please contact us by writing to faetarasrchallenge@gmail.com.