October 2025 - Corpora

3rd call, training data ready: UniDive shared tasks on idiomaticity and multiword expressions
by Agata Savary 09 Oct '25

09 Oct '25

**3rd CALL FOR PARTICIPATION** Two peas in a pod:PARSEME 2.0 and AdMiRe 2.0 multilingual UniDive shared tasks on idiomaticity and multiword expressions https://unidive.lisn.upsaclay.fr/doku.php?id=other-events:parseme-admire-st… <https://unidive.lisn.upsaclay.fr/doku.php?id=other-events:parseme-admire-st…> Expression of interest: https://forms.gle/rwSfUmNR1sTsHDfx6 <https://forms.gle/rwSfUmNR1sTsHDfx6> ==================================================================== The UniDive COST Action <https://unidive.lisn.upsaclay.fr/>is happy to announce ADMIRE 2 and the PARSEME 2.0 shared tasks dedicated to detecting and interpreting idiomaticity and multiword expressions(MWEs). MWEs are groups of words that have non-compositional semantics, i.e. their meanings cannot be straightforwardly deduced from the meanings of their components. For instance, a bad appleis a person who has a bad influence on others. Both shared tasks will take place together and we hope to co-organise the workshop with SIGLEX-MWE section <https://multiword.org/>and co-locate it with EACL 2026 in Morocco(24-29 March 2026). The participating teams are to submit the results of their systems on CodaBench <https://www.codabench.org/>. The submission links will be published at the same time as the test data. We are delighted to confirm that UniDive <https://unidive.lisn.upsaclay.fr/>will provide funding <https://unidive.lisn.upsaclay.fr/doku.php?id=other-events:parseme-admire-st…>for selected system presenters. Important dates ----------------- * [1 OCTOBER] Training data and baseline systems released * [3 DECEMBER] Publication of test blind data * [8 DECEMBER] Submission of system predictions * [19 DECEMBER] Systems evaluated * [5 JANUARY] Submission deadline for system description papers * [9-23 JANUARY] Reviewing period (system teams will participate as reviewers) * [3 FEBRUARY] Submission deadline for camera-ready papers * [24-29 MARCH 2026] EACL, including the MWE workshop(to confirm) PARSEME 2.0is a shared task whose main objective is to identify and paraphrase multiword expressions (MWEs) in written text. We propose two subtasks: the first corresponds to the classical identification task in running text. The second consists in paraphrasing a sentence containing a MWE, so as to remove idiomaticity. Data annotation is finished and 17 languages are covered: Dutch, Egyptian (ca. 2700-2000 BC), French, Georgian, Greek (Ancient), Greek (Modern), Hebrew, Japanese, Latvian, Persian, Polish, Portuguese (Brazilian), Romanian, Serbian, Slovene, Swedish, and Ukrainian. Subtask 1 is on MWEs identification and Subtask 2 on paraphrasing MWEs. AdMIRe 2.0 (Advancing Multimodal Idiomaticity Representation) addresses the challenge of multilingual and multimodal idiomatic language understanding by evaluating how well models interpret potentially idiomatic expressions (PIEs) across languages and across modalities using both text and images. This new edition extends the AdMIRe 1 task <https://arxiv.org/pdf/2503.15358>adding more languages from the UNIDIVE network and beyond. Given a context sentence containing a PIE and a set of five images, the task is to rank the images based on how accurately they depict the meaning of the PIE used in that sentence. The task will be zero-shot for newly introduced languages. While the task is designed to encourage participation from teams working on multilingual and multimodal technologies, it also accommodates approaches focused only on a subset of the languages and on a single modality (text) with automatically generated descriptive captions for each image, allowing models to rely exclusively on text input if desired. Data ----------------- * Training data for AdMiRe 2.0: https://semeval2025-task1.github.io/data/training/training_data.html <https://semeval2025-task1.github.io/data/training/training_data.html> * Training data for PARSEME 2.0: https://gitlab.com/parseme/sharedtask-data/-/tree/master/2.0 <https://gitlab.com/parseme/sharedtask-data/-/tree/master/2.0> Organizing team --------------- PARSEME 2.0: * Manon Scholivet, Université Paris Saclay, LISN, FR * Takuya Nakamura, Université Paris Saclay, LISN, FR * Agata Savary, Université Paris Saclay, LISN, FR * Éric Bilinski, Université Paris Saclay, LISN, FR * Carlos Ramisch, Aix-Marseille Université, LIS, FR ADMIRE 2 : * Adriana Pagano <https://secure-web.cisco.com/1YAGKjWKddhtqiA-9wwpTzBrRHqMWqraLLDCi63yoSQPHp…>, Universidade Federal de Minas Gerais, BR * Aline Villavicencio <https://secure-web.cisco.com/17hRYtc48CxUTuQ_Lm5LvtIDhREp6JpTTNFu3smb4Yyjp1…>, University of Exeter, UK * Dilara Torunoğlu Selamet <https://secure-web.cisco.com/1U3Kz5oRS8032U7C3ikqTwLrLuHnRujaiXILauGPxfilqd…>, Istanbul Technical University, TR * Doğukan Arslan <https://secure-web.cisco.com/1iCwzaJ-FPO1KSa7r0luNlHcUTrCy6K9Wm8I3pk9d_-iG8…>, Istanbul Technical University, TR * Gülşen Eryiğit <https://secure-web.cisco.com/1nP_yCeZKo55vzyCSp6J79GtmP_8EODYLoOnic4AHIQmdV…>, Istanbul Technical University, TR * Rodrigo Wilkens <https://secure-web.cisco.com/1zTIs9aO7VfKy_Sg1CYj8xiCPOZhZHkSPR2xYyMQE456pF…>, University of Exeter, UK * Tom Pickard <https://secure-web.cisco.com/1AbiBJ6cGhN9SrjpkIlBYQeo08-8YJIDUds7Qfs3H5_KpL…>, University of Sheffield, UK * Wei He <https://secure-web.cisco.com/1HrUa3BUU6pl9p4Ia2mMqmEqrPU834VAhFFDUvAV6PbjrP…>, University of Exeter, UK

1 0

Shared task for spontaneous speech ASR
by Robert Pugh 08 Oct '25

08 Oct '25

Mozilla Data Collective (the new platform where Mozilla Common Voice datasets, among other datasets, are hosted) just kicked off a Shared Task on Spontaneous Speech ASR. It targets 21 underrepresented languages (from Africa, the Americas, Europe, and Asia), brand-new datasets, and prizes for the best systems in each task. For more information, visit https://community.mozilladatacollective.com/shared-task-mozilla-common-voic… Robert Pugh Senior Community Manager mozillafoundation.org (UTC-7)

1 0

Monthly online ILFC Seminar: interactions between formal and computational linguistics
by Timothée Bernard 08 Oct '25

08 Oct '25

Monthly online ILFC Seminar: interactions between formal and computational linguistics https://gdr-lift.loria.fr/monthy-online-ilfc-seminar/ The LIFT 2 research group is happy to announce the forthcoming sessions of the ILFC seminar on the interactions between formal and computational linguistics. The seminar is held on Zoom. To attend the seminar and get updates, please subscribe to our mailing list (we now only rarely communicate through other mailing lists): https://sympa.inria.fr/sympa/subscribe/seminaire_ilfc - 2025/10/15 16:30-17:30 UTC+2: *Noga Zaslavsky* (New York University) Title: *Cultural evolution of efficient semantic systems in humans and AI* Abstract: *Human languages efficiently compress meanings into words, but how did our semantic systems evolve to be that way? Are AI systems capable of evolving efficient semantic systems and representing meaning as we do? In this talk, I address these open questions from cognitive, cultural, and computational perspectives. First, I show that individual human learners favor efficiently compressed semantic representations. This inductive learning bias, when amplified via cultural transmission, drives the evolution of near-optimally efficient semantic systems. Second, I consider large language models (LLMs) and show that while they vary widely in their semantic alignment with humans, they nevertheless exhibit a similar tendency toward efficient compression: when simulating cultural evolution with LLMs, they iteratively restructure initially random semantic systems towards greater efficiency. Finally, I show that introducing an explicit pressure for efficient compression, grounded in the information bottleneck principle, enables multi-agent reinforcement learning systems to evolve efficient, human-like semantic systems without any human supervision. Taken together, these results demonstrate how humans and AI can evolve efficient semantic systems through social interaction and cultural transmission, and more broadly, they suggest that efficient compression may be a fundamental principle of intelligence.* - 2025/11/26 16:30-17:30 UTC+1: *Ece Takmaz* (Utrecht University) Title: [TBA] Abstract: [TBA] - 2025/12/17 16:30-17:30 UTC+1: *Ethan Wilcox* (Georgetown University) Title: [TBA] Abstract: [TBA] - 2026/01/21 16:30-17:30 UTC+1: *Gemma Boleda* (Universitat Pompeu Fabra) Title: [TBA] Abstract: [TBA] - 2026/03/18 16:30-17:30 UTC+1: *Adele Goldberg* (Princeton University) Title: [TBA] Abstract: [TBA]

1 0

Deadline Extension – International Workshop on Spoken Dialogue (IWSDS) 2026 (Trento, Italy)
by Giuseppe Riccardi 08 Oct '25

08 Oct '25

Dear all, We are pleased to announce that the submission deadline for the 16th International Workshop on Spoken Dialogue Systems (IWSDS 2026) has been extended: 📅 Important Dates (Extended): 📝 Paper Submission Deadline: October 12 → October 22, 2025 🔄 Paper Update Deadline: October 18 → October 28, 2025 ✅ Acceptance Notification: December 10, 2025 🎤 Workshop Dates: February 26 – March 1, 2026 We invite submissions of long papers, short papers, position papers, industry track papers, and demonstrations on a broad range of topics related to the Theoretical Foundations, Systems and Methods, and Applications of spoken and multimodal dialogue systems. Accepted papers will be included in the ACL Anthology. This year’s theme is: 🎯“Human-Machine Dialogue in the Era of Multimodal Foundation Models” Location: Trento, Italy – the gateway to the Dolomites, right after the Milano Cortina 2026 Winter Olympics Website & CfP: https://sites.google.com/unitn.it/iwsds26/ Twitter/X: https://x.com/iwsdsmeeting Bluesky: https://bsky.app/profile/iwsdsmeeting.bsky.social --- Prof. Dr.-Ing. Giuseppe Riccardi Founder and Director of the Signals and Interactive Systems Lab Department of Computer Science and Information Engineering University of Trento Room D206, via Sommarive 5 38123 Povo di Trento, Italy Home Page: http://disi.unitn.it/~riccardi/

1 0

Release of the massive HPLT v3.0 multilingual dataset
by Andrey Kutuzov 07 Oct '25

07 Oct '25

October is back and so are HPLT datasets (we've been doing this for three consecutive years now!) This time, we announce the release of the massive HPLT v3.0 multilingual dataset which can be considered a major upgrade for large-scale multilingual corpora. Accounting for 29 billion documents, 198 language-script combinations and 112 trillion characters, v3.0 shows significant gains over v2, driven by several improvements, including a new global deduplication process: - Unique content boosted from 52% to 73% on average. - Data substance and robustness remains high with better extraction and improved language identification. - Shows increased variety and better representativeness of natural web content. This release provides a cleaner, more robust dataset for building powerful LLMs and machine translation systems, including a myriad of low- to medium-resourced languages. And we have not said our last word: wait for more data soon because we are already working on it. Special thanks to all the collaborators and funding bodies, including the European Union's Horizon Europe program and UK Research and Innovation. Explore the data and see the full analysis and evaluation highlights on our website: https://hplt-project.org/datasets/v3.0 -- Andrey Language Technology Group (LTG) University of Oslo

1 0

Online talks Corpus linguistics & applied linguistics research 2025
by PASCUAL FRANCISCO PEREZ PAREDES 07 Oct '25

07 Oct '25

Dear colleagues, We are sharing the following details about the online event Corpus linguistics & applied linguistics research 2025, hosted by the University of Murcia from 3 to 27 November 2025. This year’s talks will focus on the impact of AI on corpus linguistics. Speakers: * Dr Lisa Cheung, The University of Hong Kong, & Dr Peter Crosthwaite, The University of Queensland — 3 Nov * Prof Qing Ma, The Education University of Hong Kong — 11 Nov * Prof Laurence Anthony, Waseda University — 25 Nov * Prof Atsushi Mizumoto, Kansai University, Osaka — 27 Nov Info: https://www.um.es/languagecorpora/clresearch2025/ The talks will take place via ZOOM at 11:00 a.m. (Madrid) / 5:00 p.m. (Hong Kong). Registration link: https://umurcia.zoom.us/webinar/register/WN_NAQ8nFTgSCO2obOFNNgc1A#/registr… For this edition, attendees who request it will receive a certificate of participation. Corpus linguistics & applied linguistics research 2025 is organized by the University of Murcia research group E020-07 “Lenguajes de especialidad, corpus lingüísticos y lingüística inglesa aplicada a la ingeniería del conocimiento”, with support from the Faculty of Arts, the Department of English Philology at the University of Murcia, and The Education University of Hong Kong. Follow updates on X: @languagecorpora Watch talks from previous editions: https://www.youtube.com/@corporaappliedlinguistics8358/playlists Kind regards from the organizing committee, Pascual Pérez-Paredes https://webs.um.es/pascualf

1 0

1st CfP: 6th International Workshop on Computational Approaches to Historical Language Change (LChange’26)
by Andrey Kutuzov 07 Oct '25

07 Oct '25

***Apologies for possible cross-posting *** First Call for Papers: 6th International Workshop on Computational Approaches to Historical Language Change (LChange’26) Co-located with EACL 2026, Rabat, Morocco & Online | March 24–29, 2026 📌 Website: https://www.changeiskey.org/event/2026-eacl-lchange/ 📧 Contact: lchange(a)changeiskey.org == About the Workshop == The LChange workshop brings together researchers interested in computational modeling of language change — both historical and synchronic. Following the success of LChange in 2019, 2021, 2022, 2023 and 2024, this sixth edition will be held as a hybrid half-day workshop at EACL 2026 conference in Rabat. We welcome contributions addressing all aspects of computational approaches to language change. Our goal is to foster dialogue on state-of-the-art computational methodologies, resources, and theories that explore the dynamic, time-varying nature of language. In addition to paper presentations and keynotes, we offer a mentorship program for students to engage with experienced researchers, regardless of whether they are submitting a paper or not. == Important Dates (tentative) == Direct Submission deadline: December 19, 2025 Pre-reviewed (ARR) submission deadline: January 2, 2026 Notification of acceptance: January 23, 2026 Camera-ready paper due: February 3, 2026 Workshop dates: March 24-29, 2026 == Submission Information == We accept the following types of submissions: - Long papers: up to 8 pages (+ references) - Short papers: up to 4 pages (+ references). Dataset and model release papers should be submitted as short papers. Final versions will be given one additional page of content so that reviewers' comments can be taken into account. == Review Process == Papers must be submitted anonymously. All submissions will undergo double-blind peer review by at least three reviewers, with final acceptance decisions made by the workshop organizers. Accepted papers will be published in the workshop proceedings and presented orally or as posters. Call for reviewers: If you have published in the field previously and are interested in helping out in the program committee to review papers, please email us at lchange(a)changeiskey.org! == Topics of Interest == We invite original research papers on (but not limited to): - Novel methods for detecting diachronic semantic change and lexical replacement - Automatic discovery and quantitative evaluation of laws of language change - Computational theories and generative models of language change - Sense-aware (semantic) change analysis - Diachronic word sense disambiguation - Novel methods for diachronic analysis of low-resource languages - Novel methods for diachronic linguistic data visualization - Novel applications and implications of language change detection - Quantification of sociocultural influences on language change - Cross-linguistic, phylogenetic, and developmental approaches to language change - Novel datasets for cross-linguistic and diachronic analyses of language == Organizers == Nina Tahmasebi, University of Gothenburg Pierluigi Cassotti, University of Gothenburg Syrielle Montariol, UC Berkeley, École polytechnique fédérale de Lausanne Andrey Kutuzov, University of Oslo Netta Huebscher, University of Gothenburg Elena Spaziani, Sapienza University of Rome Naomi Baes, University of Melbourne -- Andrey Language Technology Group (LTG) University of Oslo

1 0

Fully Funded PhD Studentship in Multimodal AI @ Queen’s University Belfast, UK
by Mohammed Hasanuzzaman 07 Oct '25

07 Oct '25

Dear Colleagues, We are looking for a PhD student in Multimodal AI for Proactive Herd Health and Dairy Farm Management (CLÁR Project) at the School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, UK. This fully funded PhD studentship is supported by SUSTAIN (https://www.sustain-cdt.ai/), the UKRI Centre for Doctoral Training in Sustainable Understandable agri-food Systems Transformed by Artificial Intelligence. SUSTAIN empowers the next generation of AI scientists to invent, develop, and deploy technologies co-created with growers and agri-food practitioners, facilitating meaningful partnerships between academia and industry. The successful candidate will work with our multidisciplinary supervisory team and our industry partner, CattleEye, gaining access to global real-world data and industry expertise. This project offers a unique opportunity to develop a skill set in AI applied to globally significant sustainability challenges in dairy farming. If you have a background in NLP, Data Science, ML/DL, or Computer Vision and are motivated to apply your expertise to real-world sustainability challenges in agriculture, this could be a fantastic opportunity for you. [??] Location: Queen’s University Belfast, UK [??] Eligibility: Open to students worldwide [??] Funding: Fully funded PhD studentship (tuition fees, tax-free stipend, Research Training Support Grant, and additional development support) [??] Deadline: Friday, October 17, 2025 [??] More details: https://www.findaphd.com/phds/project/project-q3138-cl-r-multimodal-ai-for-… For enquiries, contact m.hasanuzzaman(a)qub.ac.uk<mailto:m.hasanuzzaman@qub.ac.uk> with your CV. Best, M [https://mail.google.com/mail/u/0?ui=2&ik=83b4a8df94&attid=0.1&permmsgid=msg…]

1 0

PhD position in Emotionally and Socially Aware Natural Language Processing (LIACS, Leiden University)
by Plaza del Arco, F.M. (Flor Miriam) 07 Oct '25

07 Oct '25

Dear colleagues, We are looking for a PhD candidate in Emotionally and Socially Aware Natural Language Processing at LIACS, Leiden University<https://liacs.leidenuniv.nl/>. The PhD will be supervised by myself, Prof. Suzan Verberne<https://www.universiteitleiden.nl/en/staffmembers/suzan-verberne#tab-1>, and Prof. Joost Broekens<https://www.universiteitleiden.nl/en/staffmembers/joost-broekens#tab-1>. The position is part of the Human-AI cluster<https://www.universiteitleiden.nl/en/science/computer-science/research/huma…>, a great environment for interdisciplinary research where AI and machine learning meet philosophy, cognitive science, and the creative arts. This PhD position focuses on advancing AI models that don’t just optimize for accuracy, but also recognize and respond to emotions responsibly and adapt to social context. Current systems often reproduce or amplify social biases, generate toxic context, or do not respond safely to emotional cues. The goal of this PhD is to design AI systems that promote inclusivity, fairness, and emotional intelligence in human-AI interaction, with a particular focus on applications in mental well-being, education, and other socially sensitive contexts where how AI interacts with people has a big impact. The deadline to apply is November 17, 2025. Details on the position and the application procedure can be found in the job ad: https://www.universiteitleiden.nl/en/vacancies/2025/q4/16038-phd-candidate-… Please send me an email if you have any questions regarding the position. Best regards, Flor Plaza ------------------------------------------------ Flor Miriam Plaza del Arco, Ph.D. Assistant Professor Leiden Institute of Advanced Computer Science (LIACS), Human AI Leiden University Office: BM 3.03, Gorlaeus Gebouw – BE-vleugel Einsteinweg 55, 2333 CC Leiden, Netherlands 🌐 Website<https://fmplaza.github.io/> | 🦋 BlueSky<http://florplaza.bsky.social/>

1 0

Research Internship Opportunities (Master) – TALN Team, LS2N (Nantes Université)
by Florian Boudin 07 Oct '25

07 Oct '25

The TALN team (Natural Language Processing (https://taln.ls2n.fr/) of the LS2N laboratory (https://www.ls2n.fr/, Nantes Université) is offering 5 research internships at the Master 2 level (duration: 5 to 6 months), starting in February 2026. These internships will take place within the research themes of the team on Natural Language Processing (NLP) for specialized domains, in particular healthcare and science, with a focus on the study and adaptation of large language models (LLMs). Possible topics include: - Study and evaluation of LLMs in limited contexts (specialized data, constrained domains); - Recommendation and navigation systems to explore the scientific literature; - Text revision and scientific writing assistance tools; - Verification of claims in scientific articles, including text/image multimodality (in collaboration with the IPI team); - Clinical text analysis for estimating patient autonomy score. Candidate profile: Master’s students in computer science, specializing in Artificial Intelligence (AI), Natural Language Processing (NLP), machine learning, or related fields, with solid programming skills (Python) and a strong interest in research. Contacts: - Florian Boudin - florian.boudin(a)inria.fr - Richard Dufour - richard.dufour(a)univ-nantes.fr

1 0