Subject: Call for Shared Task Participation: Data Contamination Evidence Collection - CONDA workshop @ ACL 2024 - Corpora

15 Apr 2024


      We invite the community to participate in a shared task organized in the 
context of the CONDA workshop https://conda-workshop.github.io/ 
https://conda-workshop.github.io/.
Data contamination, where evaluation data is inadvertently included in 
pre-training corpora of large scale models, and language models (LMs) in 
particular, has become a concern in recent times (Sainz et al. 2023 
https://aclanthology.org/2023.findings-emnlp.722/; Jacovi et al. 2023 
https://aclanthology.org/2023.emnlp-main.308/). The growing scale of 
both models and data, coupled with massive web crawling, has led to the 
inclusion of segments from evaluation benchmarks in the pre-training 
data of LMs (Dodge et al., 2021 
https://aclanthology.org/2021.emnlp-main.98/; OpenAI, 2023 
https://arxiv.org/abs/2303.08774; Google, 2023 
https://arxiv.org/abs/2305.10403; Elazar et al., 2023 
https://arxiv.org/abs/2310.20707). The scale of internet data makes it 
difficult to prevent this contamination from happening, or even detect 
when it has happened (Bommasani et al., 2022 
https://arxiv.org/abs/2108.07258; Mitchell et al., 2023 
https://arxiv.org/abs/2212.05129). Crucially, when evaluation data 
becomes part of pre-training data, it introduces biases and can 
artificially inflate the performance of LMs on specific tasks or 
benchmarks (Magar and Schwartz, 2022 
https://aclanthology.org/2022.acl-short.18/). This poses a challenge 
for fair and unbiased evaluation of models, as their performance may not 
accurately reflect their generalization capabilities.
The shared task is a community effort on centralized data contamination 
evidence collection. While the problem of data contamination is 
prevalent and serious, the breadth and depth of this contamination are 
still largely unknown. The concrete evidence of contamination is 
scattered across papers, blog posts, and social media, and it is 
suspected that the true scope of data contamination in NLP is 
significantly larger than reported.
With this shared task we aim to provide a structured, centralized 
platform for contamination evidence collection to help the community 
understand the extent of the problem and to help researchers avoid 
repeating the same mistakes. The shared task also gathers evidence of 
clean, non-contaminated instances. The platform is already available for 
perusal at 
https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Database 
https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Report.
Participants in the shared task need to submit their contamination 
evidence (see instructions below). The CONDA 2024 workshop organizers 
will review the evidence through pull requests.
*/Compilation Paper/*
As a companion to the contamination evidence platform, we will produce a 
paper that will provide a summary and overview of the evidence collected 
in the shared task. The participants who contribute to the shared task 
will be listed as co-authors in the paper.
*/
/*
*/Instructions for Evidence Submission/*
Each submission should report a case of contamination or lack of 
contamination thereof. The submission can be either about (1) 
contamination in the corpus used to pre-train language models, where the 
pre-training corpus contains a specific evaluation dataset, or about (2) 
contamination in a model that shows evidence of having seen a specific 
evaluation dataset while being trained. Each submission needs to mention 
the corpus (or model) and the evaluation dataset, in addition to some 
evidence of contamination. Alternatively, we also welcome evidence of a 
lack of contamination.
Reports must be submitted through a Pull Request in the Data 
Contamination Report space at HuggingFace. The reports must follow the 
Contribution Guidelines provided in the space and will be reviewed by 
the organizers. If you have any questions, please contact us at 
conda-workshop@googlegroups.com 
mailto:conda-workshop@googlegroups.com or open a discussion in the 
space itself.
URL with contribution guidelines: 
https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Database 
https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Report (“Contribution 
Guidelines” tab)
*/Important dates/*
* Deadline for evidence submission: July 1, 2024
  * Workshop day: August 16, 2024
*/Sponsors/*
* AWS AI and Amazon Bedrock
  * HuggingFace
  * Google
*/Contact/*
* Website: https://conda-workshop.github.io/
    https://conda-workshop.github.io/
  * Email: conda-workshop@googlegroups.com
    mailto:conda-workshop@googlegroups.com
*/Organizers/*
Oscar Sainz, University of the Basque Country (UPV/EHU)
Iker García Ferrero, University of the Basque Country (UPV/EHU)
Eneko Agirre, University of the Basque Country (UPV/EHU)
Jon Ander Campos, Cohere
Alon Jacovi, Bar Ilan University
Yanai Elazar, Allen Institute for Artificial Intelligence and University 
of Washington
Yoav Goldberg, Bar Ilan University and Allen Institute for Artificial 
Intelligence