LeWiDi: Shared task on Learning With Disagreement
We'd like to invite researchers in disagreement and variation to participate in the third edition of the LeWidi shared tasks held in conjunction with the NLPerspectives workshop at the EMNLP conference in Suzhou, China. The LeWiDi series is positioned within the growing body of research that questions the practice of label harmonization and the reliance on a single ground truth in AI and NLP. This year's shared task challenges participants to leverage both instance-level disagreement and annotator-level information in classification. The proposed tasks include ones that address disagreement in both generation and labeling—with a dataset for Natural Language Inference (NLI) and another for paraphrase detection—as well as subjective tasks, including irony and sarcasm detection. ==== Subtasks and datasets ====
Participants will be able to submit to subtasks exploring different types of disagreement through dedicated datasets: 1. The Conversational Sarcasm corpus (CSC) – a dataset of context+response pairs rated for sarcasm, with ratings from 1 to 6. 2. The MultiPico dataset (MP) – a crowdsourced multilingual irony detection dataset. Annotators were tasked to detect whether a reply was ironic in the context of a brief post-reply exchange on social media. Annotators ids and metadata (gender, age, nationality, etc) are available. Languages include Arabic, German, English, Spanish, French, Hindi, Italian, Dutch, and Portuguese. 3. The Paraphrase dataset (Par) – a dataset of question pairs for which the annotators had to tell whether the two questions are paraphrases of each other, using values on a Likert scale. 4. TheVariErrNLI dataset (VariErrNLI) – a dataset originally designed for automatic error detection, distinguishing between annotation errors and legitimate human label variations in Natural Language Inference. Participants will be able to submit to one or multiple datasets. ==== Tasks and Evaluation ==== In this edition, only soft evaluation metrics will be used. We will however experiment with two forms of tasks and evaluation:
* TASK A (SOFT LABEL PREDICTION): Systems will be asked to output a probability distribution of the values. EVALUATION: the distance between this predicted soft label and that resulting from human annotations will be computed. * TASK B (PERSPECTIVIST PREDICTION): Systems will be asked to predict each annotator's label on items. EVALUATION: a measure of correctness of the predictions
Participants will be able to submit to one or both tasks. ==== Important Dates ==== Training data ready May 15th 2025 Evaluation Starts June 20th 2025 Evaluation Ends July 15th 2025 Paper submission due TBA Notification to authors: TBA NLPerspectives workshop: November 12-14, 2025 We are looking forward to your submission! The LeWidi team