18th WORKSHOP ON BUILDING AND USING COMPARABLE CORPORA WITH SHARED TASK ON MULTILINGUAL TERMINOLOGY EXTRACTION FROM COMPARABLE CORPORA
Co-located with COLING 2025 (Abu Dhabi)
Paper submission deadline: 30 November, 2024
Workshop website: https://comparable.lisn.upsaclay.fr/bucc2025/
COLING website: https://coling2025.org/
Keynote speaker: Preslav Nakov, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi
**************************************************************
* Motivation
In the language engineering and linguistics communities, research in comparable corpora has been motivated by two main reasons. In language engineering, on the one hand, it is chiefly motivated by the need to use comparable corpora as training data for statistical NLP applications such as statistical and neural machine translation or cross-lingual retrieval. In linguistics, on the other hand, comparable corpora are of interest because they enable cross-language discoveries and comparisons. It is generally accepted in both communities that comparable corpora consist of documents that are comparable in content and form in various degrees and dimensions across several languages. Parallel corpora are on the one end of this spectrum, and unrelated corpora are on the other.
In recent years, the use of comparable corpora for pre-training Large Language Models (LLMs) has led to their impressive multilingual and cross-lingual abilities, which are relevant to a range of applications, including Information Retrieval, Machine Translation, Cross-lingual text classification, etc. The linguistic definitions and observations related to comparable corpora can improve methods to mine such corpora or to improve cross-lingual transfer of LLMs. Therefore, it is of great interest to bring together builders and users of such corpora.
* Shared Task This year we will run a shared task aimed at detecting translations of terms via comparable corpora. Please see the website for details: https://comparable.limsi.fr/bucc2025/bucc2025-task.html
* Topics We solicit contributions on all topics related to comparable (and parallel) corpora, including but not limited to the following:
Building Comparable Corpora: - Automatic and semi-automatic methods - Methods to mine parallel and non-parallel corpora from the web - Tools and criteria to evaluate the comparability of corpora - Parallel vs non-parallel corpora, monolingual corpora - Rare and minority languages, across language families - Multi-media/multi-modal comparable corpora
Applications of comparable corpora: - Human translation - Language learning - Cross-language information retrieval & document categorization - Bilingual and multilingual projections - (Unsupervised) Machine translation - Writing assistance - Machine learning techniques using comparable corpora
Mining from Comparable Corpora: - Cross-language distributional semantics, word embeddings and pre-trained multilingual transformer models - Extraction of parallel segments or paraphrases from comparable corpora - Methods to derive parallel from non-parallel corpora (e.g. to provide for low-resource languages in neural machine translation) - Extraction of bilingual and multilingual translations of single words, multi-word expressions, proper names, named entities, sentences, and paraphrases from comparable corpora, etc. - Induction of morphological, grammatical, and translation rules from comparable corpora - Induction of multilingual word classes from comparable corpora
Comparable Corpora in the Humanities:
- Comparing linguistic phenomena across languages in contrastive linguistics - Analyzing properties of translated language in translation studies - Studying language change over time in diachronic linguistics - Assigning texts to authors via authors' corpora in forensic linguistics - Comparing rhetorical features in discourse analysis - Studying cultural differences in sociolinguistics - Analyzing language universals in typological research
* Workshop Organizers - Serge Sharoff (University of Leeds) - Ayla Rigouts Terryn (Université de Montréal (UdeM), Mila) - Pierre Zweigenbaum (Université Paris-Saclay, CNRS, LISN, Orsay) - Reinhard Rapp (University of Mainz, Germany)
* Program Committee - Ebrahim Ansari (Institute for Advanced Studies in Basic Sciences, Iran) - Eleftherios Avramidis (DFKI, Germany) - Gabriel Bernier-Colborne (National Research Council, Canada) - Thierry Etchegoyhen (Vicomtech, Spain) - Alex Fraser (University of Munich, Germany) - Natalia Grabar (University of Lille, France) - Amal Haddad Haddad (Universidad de Granada, Spain) - Amir Hazem (University of Tokyo, Japan) - Kyo Kageura (University of Tokyo, Japan) - Natalie Kübler (Université Paris Cité, France) - Philippe Langlais (Université de Montréal, Canada) - Yves Lepage (Waseda University, Japan). - Shervin Malmasi (Amazon, USA) - Michael Mohler (Language Computer Corporation, USA) - Emmanuel Morin (Nantes Université, France) - Dragos Stefan Munteanu (RWS, USA) - Ted Pedersen (University of Minnesota, Duluth, USA) - Nasredine Semmar (CEA LIST, Paris, France) - Silvia Severini (Leonardo Labs, Italy) - Pranaydeep Singh (University of Gent, Belgium) - Richard Sproat (Google, USA) - Marko Tadić (University of Zagreb, Croatia) - François Yvon (Sorbonne Université, France)