We invite the community to participate in a shared task organized in the context of the CONDA workshop https://conda-workshop.github.io/ https://conda-workshop.github.io/.
Data contamination, where evaluation data is inadvertently included in pre-training corpora of large scale models, and language models (LMs) in particular, has become a concern in recent times (Sainz et al. 2023 https://aclanthology.org/2023.findings-emnlp.722/; Jacovi et al. 2023 https://aclanthology.org/2023.emnlp-main.308/). The growing scale of both models and data, coupled with massive web crawling, has led to the inclusion of segments from evaluation benchmarks in the pre-training data of LMs (Dodge et al., 2021 https://aclanthology.org/2021.emnlp-main.98/; OpenAI, 2023 https://arxiv.org/abs/2303.08774; Google, 2023 https://arxiv.org/abs/2305.10403; Elazar et al., 2023 https://arxiv.org/abs/2310.20707). The scale of internet data makes it difficult to prevent this contamination from happening, or even detect when it has happened (Bommasani et al., 2022 https://arxiv.org/abs/2108.07258; Mitchell et al., 2023 https://arxiv.org/abs/2212.05129). Crucially, when evaluation data becomes part of pre-training data, it introduces biases and can artificially inflate the performance of LMs on specific tasks or benchmarks (Magar and Schwartz, 2022 https://aclanthology.org/2022.acl-short.18/). This poses a challenge for fair and unbiased evaluation of models, as their performance may not accurately reflect their generalization capabilities.
The shared task is a community effort on centralized data contamination evidence collection. While the problem of data contamination is prevalent and serious, the breadth and depth of this contamination are still largely unknown. The concrete evidence of contamination is scattered across papers, blog posts, and social media, and it is suspected that the true scope of data contamination in NLP is significantly larger than reported.
With this shared task we aim to provide a structured, centralized platform for contamination evidence collection to help the community understand the extent of the problem and to help researchers avoid repeating the same mistakes. The shared task also gathers evidence of clean, non-contaminated instances. The platform is already available for perusal at https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Database https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Report.
Participants in the shared task need to submit their contamination evidence (see instructions below). The CONDA 2024 workshop organizers will review the evidence through pull requests.
*/Compilation Paper/*
As a companion to the contamination evidence platform, we will produce a paper that will provide a summary and overview of the evidence collected in the shared task. The participants who contribute to the shared task will be listed as co-authors in the paper.
*/ /*
*/Instructions for Evidence Submission/*
Each submission should report a case of contamination or lack of contamination thereof. The submission can be either about (1) contamination in the corpus used to pre-train language models, where the pre-training corpus contains a specific evaluation dataset, or about (2) contamination in a model that shows evidence of having seen a specific evaluation dataset while being trained. Each submission needs to mention the corpus (or model) and the evaluation dataset, in addition to some evidence of contamination. Alternatively, we also welcome evidence of a lack of contamination.
Reports must be submitted through a Pull Request in the Data Contamination Report space at HuggingFace. The reports must follow the Contribution Guidelines provided in the space and will be reviewed by the organizers. If you have any questions, please contact us at conda-workshop@googlegroups.com mailto:conda-workshop@googlegroups.com or open a discussion in the space itself.
URL with contribution guidelines: https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Database https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Report (“Contribution Guidelines” tab)
*/Important dates/*
* Deadline for evidence submission: July 1, 2024 * Workshop day: August 16, 2024
*/Sponsors/*
* AWS AI and Amazon Bedrock * HuggingFace * Google
*/Contact/*
* Website: https://conda-workshop.github.io/ https://conda-workshop.github.io/ * Email: conda-workshop@googlegroups.com mailto:conda-workshop@googlegroups.com
*/Organizers/* Oscar Sainz, University of the Basque Country (UPV/EHU) Iker García Ferrero, University of the Basque Country (UPV/EHU) Eneko Agirre, University of the Basque Country (UPV/EHU) Jon Ander Campos, Cohere Alon Jacovi, Bar Ilan University Yanai Elazar, Allen Institute for Artificial Intelligence and University of Washington Yoav Goldberg, Bar Ilan University and Allen Institute for Artificial Intelligence