DEADLINE EXTENSION: The First Workshop on Data Contamination (CONDA) @ ACL 2024 - Corpora

13 May 2024


      *New paper submission and ARR commitment deadlines (see below)*
We invite you to participate and submit your work to the First Workshop 
on Data Contamination (CONDA) co-located with ACL 2024 in Bangkok, Thailand.
Data contamination, where evaluation data is inadvertently included in 
pre-training corpora of large scale models, and language models (LMs) in 
particular, has become a concern in recent times. The growing scale of 
both models and data, coupled with massive web crawling, has led to the 
inclusion of segments from evaluation benchmarks in the pre-training 
data of LMs. The scale of internet data makes it difficult to prevent 
this contamination from happening, or even detect when it has happened. 
Crucially, when evaluation data becomes part of pre-training data, it 
introduces biases and can artificially inflate the performance of LMs on 
specific tasks or benchmarks. This poses a challenge for fair and 
unbiased evaluation of models, as their performance may not accurately 
reflect their generalization capabilities.
Although a growing number of papers and state-of-the-art models mention 
issues of data contamination, there is no agreed-upon definition or 
standard methodology to ensure that a model does not report results on 
contaminated benchmarks. Addressing data contamination is a shared 
responsibility among researchers, developers, and the broader community. 
By adopting best practices, increasing transparency, documenting 
vulnerabilities, and conducting thorough evaluations, we can work 
towards minimizing the impact of data contamination and ensuring fair 
and reliable evaluations.
We welcome paper submissions on all topics related to data 
contamination, including but not limited to:
* Definitions, taxonomies, and gradings of contamination
  * Contamination detection (both manual and automatic)
  * Community efforts to discover, report, and organize contamination events
  * Documentation frameworks for datasets or models
  * Methods to avoid data contamination
  * Methods to forget contaminated data
  * Scaling laws and contamination
  * Memorization and contamination
  * Policies to avoid impact of contamination in publication venues and
    open source communities
  * Reproducing and attributing results from previous work to data
    contamination
  * Survey work on data contamination research
  * Data contamination in other modalities
*/
/*
*/Submission Instructions/*
We welcome two types of papers: regular workshop papers and non-archival 
submissions. Regular workshop papers will be included in the workshop 
proceedings. All submissions must be in PDF format and made through 
OpenReview.
*
* *Regular workshop papers:*Authors can submit papers up to 8 pages,
    with unlimited pages for references. Authors may submit up to 100 MB
    of supplementary materials separately and their code for
    reproducibility. All submissions undergo a double-blind single-track
    review. Best Paper Award(s) will be given based on nomination by the
    reviewers. Accepted papers will be presented as posters with the
    possibility of oral presentations.
  *
* *Non-archival submissions:*Cross-submissions are welcome. Accepted
    papers will be presented at the workshop but not included in the
    workshop proceedings. Papers must be in PDF format and will be
    reviewed in a double-blind fashion by workshop reviewers. We also
    welcome extended abstracts (up to 2 pages) of papers that are work
    in progress, under review or to be submitted to other venues. Papers
    in this category need to follow the ACL format.
  *
In addition to papers submitted directly to the workshop, which will be 
reviewed by our Programme Committee. We also accept papers reviewed 
through ACL Rolling Review and committed to the workshop. Please, check 
the relevant dates for each type of submission.
*/
/*
*/Important dates/*
Relevant deadlines to consider when submitting your paper are:
* *Paper submission deadline: May 31 (Friday), 2024*
  * *ARR pre-reviewed commitment deadline: June 14 (Friday), 2024*
  * Notification of acceptance: June 17 (Monday), 2024
  * Camera-ready paper due: July 1 (Monday), 2024
  * Workshop date: August 16, 2024
*/
/*
*/Sponsors/*
* AWS AI and Amazon Bedrock
  * HuggingFace
  * Google
*/
/*
*/Contact/*
* Website:https://conda-workshop.github.io/
    https://conda-workshop.github.io/
  * Email:conda-workshop@googlegroups.commailto:conda-workshop@googlegroups.com
*/
/*
*/Organizers/*
  Oscar Sainz, University of the Basque Country (UPV/EHU)
  Iker García Ferrero, University of the Basque Country (UPV/EHU)
  Eneko Agirre, University of the Basque Country (UPV/EHU)
  Jon Ander Campos, Cohere
  Alon Jacovi, Bar Ilan University
  Yanai Elazar, Allen Institute for Artificial Intelligence and 
University of Washington
  Yoav Goldberg, Bar Ilan University and Allen Institute for Artificial 
Intelligence
-- 


Eneko Agirre
HiTZ Hizkuntza Teknologiako Zentroa - Ixa Taldea
   Centro Vasco de Tecnología de la Lengua - Grupo Ixa
   Basque Center for Language Technology - Ixa NLP Group
University of the Basque Country (UPV/EHU)
hitz.ehu.eus/eneko https://hitz.ehu.eus/eneko