Welcome to SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes!
Task description: SHROOM participants will need to detect grammatically sound output that contains incorrect semantic information (i.e. unsupported or inconsistent with the source input), with or without having access to the model that produced the output.
Overview of the task: The modern NLG landscape is plagued by two interlinked problems:
On the one hand, our current neural models have a propensity to produce inaccurate but fluent outputs; on the other hand, our metrics are most apt at describing fluency, rather than correctness. This leads neural networks to “hallucinate”, i.e., produce fluent but incorrect outputs that we currently struggle to detect automatically. For many NLG applications, the correctness of an output is however mission critical. For instance, producing a plausible-sounding translation that is inconsistent with the source text puts in jeopardy the usefulness of a machine translation pipeline. With our shared task, we hope to foster the growing interest in this topic in the community.
With SHROOM we adopt a post hoc setting, where models have already been trained and outputs already produced: participants will be asked to perform binary classification to identify cases of fluent overgeneration hallucinations in two different tracks: a model-aware and a model-agnostic track. In the former, participants have access to the model that produced the output; in the latter, they do not. To ensure a low-barrier to entry, we format the task as a binary classification problem. We now also provide a baseline kit, containing a baseline system, a format checker and the scoring program.
All systems will be rated on accuracy (i.e., the proportion of test examples correctly labeled) and calibration (i.e., the correlation between the probability assigned by a system and the proportion of annotators marking a production as hallucinatory).
We provide to participants a collection of checkpoints, inputs, references and outputs of systems covering three NLG tasks: definition modeling (DM), machine translation (MT), and paraphrase generation (PG), trained with varying degrees of accuracy. The development set provides binary annotations from five different annotators and a majority vote gold label.
Anyone wishing to participate in the task is welcome! Participants will have to
* Submit at least once during the evaluation phase on January;
* Write a system description paper before February 19;
* Review other system description papers (max. 2).
Trial, dev and train data are now available on the task website: https://helsinki-nlp.github.io/shroom/
Codalab competition: https://codalab.lisn.upsaclay.fr/competitions/15726
Join the mailing group: https://groups.google.com/u/1/g/semeval-2024-task-6-shroom
Updates on Twitter: @shroom2024https://twitter.com/shroom2024
Important dates:
* Sample data ready: July 15th, 2023
* Validation data ready: September 11th, 2023
* Unlabeled train data ready: September 22nd, 2023
* Evaluation period starts (test set released): January 10th, 2024
* Evaluation period ends: January 31st, 2024
* Workshop paper submission deadline: February 19th, 2024
* Notification to authors: March 18th, 2024
* SemEval workshop: 16–21 June, Mexico (collocated with NAACL 2024)
Task organizers
* Elaine Zosa, Silo AI, Finland
* Raúl Vázquez, University of Helsinki, Finland
* Jörg Tiedemann, University of Helsinki, Finland
* Vincent Segonne, Southern Brittany University, France
* Teemu Vahtola, University of Helsinki, Finland
* Alessandro Raganato, University of Milano-Bicocca, Italy
* Timothee Mickus, University of Helsinki, Finland
* Marianna Apidianaki, University of Pennsylvania, USA