June 2025 - Corpora - ELRA lists

CfP: Terminology Translation Task @WMT25
by Kirill Semenov 18 Jun '25

18 Jun '25

[Apologies for cross-posting] Terminology Translation Task at WMT2025 - Call for Participation We are excited to announce the third Shared Task on Terminology Translation<https://www2.statmt.org/wmt25/terminology.html>, which would be run within the 10th Conference on Machine Translation (WMT2025) in Suzhou, China. TL;DR: - We test the sentence-level and document-level translation of the texts in finance and IT domains, given the explicit terminology. - The language pairs are: English -> {Spanish, German, Russian, Chinese}, Chinese -> English. - We evaluate the overall quality of translation, terminology success rate and consistency. Additionally, we compare the performance of systems given no terms provided, proper terminology and random terms. - The task starts on 20th June 2025 AOE, the submission deadline is 20th July 2025 AOE. - Please pre-register via Google Forms here: https://forms.gle/ZSn2pNJkQJAzHFnA6 . OVERVIEW The advances in neural MT and LLM-assisted translation of the last decade show nearly human quality in general domain translation at least for the high-resource languages. However, when it comes to specialized domains like science, finance, or legal texts, where the correct and consistent use of special terms is crucial, the task is far from being solved. The Terminology Shared Task aims to assess the extent to which machine translation models can utilize additional information regarding the translation of terminologies. Compared to two previous editions, 2021 and 2023, the new test data have more various test cases, are more consistent in domains for each translation direction, and are broader in language coverage. TASK DESCRIPTION Track №1: Sentence/Paragraph-Level Translation You will be provided with sequence of input sentences long, and small terminology dictionaries that will correspond only to the terms present in the given sentence. Language Pairs: * en-de (English → German) * en-ru (English → Russian) * en-es (English → Spanish) Domain: information technology Track №2: Document-Level Translation The setup is similar to Track №1, with two exceptions: the length of the input texts now equals the document, and the dictionaries correspond to the whole set of input texts (i.e. they are corpus-level). This makes the task close to the real-life setup (where the dictionaries exist independently from the texts), while it may complicate the implementation (since for the solutions that require storing the whole dictionary it will take more memory). Additionally, for the whole document setup, the problem of the consistent usage of terms is becoming more important. Language Pairs: en-zh-Hant (English → Traditional Chinese) zh-Hant-en (Traditional Chinese → English) Domain: finance EVALUATION Terminology Modes: You are expected to compare your system’s performance under three modes: 1. No terminology: the system is only provided with input sentences/documents. 2. Proper terminology: the system is provided with input texts (same as 1.) and dictionaries of the format {source_term: target_term}. 3. Random terminology: the system is provided with input texts and translation dictionaries of the same format as in 2. The difference is that the dictionary items are not special terms but words randomly drawn from input texts. This mode is of special interest since we want to measure to what extent the proper term translations help to improve the system performance (2.), as opposed to an arbitrary broader input that does not contain the domain-specific terminology. Metrics: 1. Overall Translation Quality: we will evaluate the general aspects of machine translation outputs such as fluency, adequacy and grammaticality. We will do that with the general MT automatic metrics such as BLEU or COMET. In addition to that, we will pay special attention to the grammaticality of the translated terms. 2. Terminology Success Rate: This metric assesses the ability of the system to accurately translate technical terms given the specialized vocabulary. This will be carried out by comparing the occurrences of the correct term translations (i.e. the ones present in the dictionary) to the output terms. The goal is to have a higher success rate that will show adherence to dictionary translations. 3. Terminology Consistency: for domains such as science or legal texts, the consistent use of an introduced term throughout the text is crucial. In other words, we want a system to not only pick up a correct term in a target language but to use it consistently once it is chosen. This will be evaluated by comparing all translations of a given source term in a text and measuring the percentage of deviations from the most consistent translation. This metric is more important for the Document-Level track, but it will be used for both tracks. IMPORTANT DATES All dates are end of Anywhere on Earth (AoE). Data snippets released: 7th May 2025 Dev data released: 22nd May 2025 Test data release, task starts: 20th June 2025 (postponed) Submission deadline: 20th July 2025 (postponed) Paper submission to WMT25: in-line with WMT25 Camera-ready submission to WMT25: in-line with WMT25 Conference in Suzhou, China: 05-09 November 2025 SUBMISSION GUIDELINES 0. Please notify us about your participation prior to submission. This is optional, but will be very helpful for us for better understanding of our workload after submission. Please do it through this Google Form: https://forms.gle/ZSn2pNJkQJAzHFnA6 1. Check your submission files with the validation script. It will be published at test date publication. 2. Write a description of your system (optional). 3. Submit your system via Google Forms. The Google form with all necessary submission details will be published at the test set date. All details on submission as well as FAQ can be found at the webpage of the shared task. ORGANIZERS * Kirill Semenov (University of Zurich), main contact: FirstNаmе [dоt] LаstNаmе {аt} uzh /dоt/ ch * Nathaniel Berger (Heidelberg University) * Pinzhen Chen (University of Edinburgh & Aveni.ai) * Xu Huang (Nanjing University) * Arturo Oncevay (JP Morgan) * Dawei Zhu (Amazon) * Vilém Zouhar (ETH Zurich) WEBSITE: https://www2.statmt.org/wmt25/terminology.html In case of query, please send an email to Kirill Semenov (see email above).

1 0

Call for papers: The First Workshop on Natural Language Processing and Language Models for Digital Humanities (LM4DH 2025) @ RANLP_2025
by Nanomi Arachchige, Isuri (Postgraduate Researcher) 18 Jun '25

18 Jun '25

Call for papers: The First Workshop on Natural Language Processing and Language Models for Digital Humanities (LM4DH_2025) @ RANLP_2025 Date: 11th- to 13th September 2025 (TBC) Venue : Varna, Bulgaria Website: https://www.clarin.eu/event/2025/clarin-workshop-ranlp-2025 Submissions Portal: https://softconf.com/ranlp25/LM4DH2025/ Digital Humanities has emerged as an interdisciplinary field of research, serving as an intersection of computer science with many other fields such as linguistics, social sciences, history, psychology, etc. With the development of Large Language Models (LLMs), state-of-the-art Natural Language Processing (NLP) tasks such as entity recognition, sentiment analysis, and text summarisation have been significantly enhanced, offering powerful tools to analyse and interpret complex historical and cultural data. These developments offer transformative capabilities for analysing and interpreting complex historical and cultural datasets, including oral histories, archival documents, and literary texts. These advancements provide powerful tools for analysing and interpreting intricate historical, cultural, and social data, enabling researchers to identify patterns, extract meaningful relationships, and generate interpretations at unprecedented scale and precision. This workshop aims to provide a common platform for researchers, practitioners, and students from diverse disciplines to collaboratively explore and apply AI-driven techniques in the Digital Humanities. Through interdisciplinary discussion, the event aims to generate creative approaches, exchange best practices, and create a community committed to furthering AI-based research on human culture and history. The focus of the workshop is on applying natural language processing techniques to digital humanities research. The topics can be anything of digital humanities interest with a natural language processing or LLM-based application. We expect contributions related (but not limited) to the following topics: * Text analysis and processing related to the humanities using computational methods * Usage of the interpretability of large language models' output for DH-related tasks * Dataset creation and curation for NLP (e.g. digitisation, datafication, and data preservation * Automatic error detection, correction, and normalisation of textual data * Generation and analysis of literary works such as poetry and novels * Analysis and detection of text genres * Emotion analysis for the humanities and literature * Modelling of information and knowledge in the Humanities, Social Sciences, and Cultural Heritage * Low-resource and historical language processing * Search for scientific and/or scholarly literature * Profiling and authorship attribution Submission & Publication All papers must represent original and unpublished work that is not currently under review. Papers will be evaluated according to their significance, originality, technical content, style, clarity, and relevance to the workshop. Submissions must follow the RANLP 2025 submission guidelines<https://ranlp.org/ranlp2025/index.php/submissions/>, using ACL-style templates (LaTeX or MS Word). Paper must be submitted using SoftConf at https://softconf.com/ranlp25/LM4DH2025/ All papers will be double-blind peer reviewed. Authors of the accepted papers will present their work in either the oral or poster session. All accepted papers will appear in the workshop proceedings that will be published in ACL Anthology. Important Dates * Paper submission deadline: 20th July 2025 * Notification of acceptance: 2nd August 2025 * Camera-ready paper: 20th August 2025 * Workshop date: 11th September 2025 Organising Committee * Isuri Anuradha, Lancaster University, UK * Francesca Frontini, CNR-ILC, Italy & CLARIN ERIC * Paul Rayson, Lancaster University, UK * Ruslan Mitkov, Lancaster University, UK * Deshan Sumanathilake, Swansea University, UK This workshop has been organised with the generous support and coordination of CLARIN-EU. Gmail: dhranlp2(a)gmail.com<mailto:%20dhranlp2@gmail.com>

1 0

FIRE 2025- Call for Participation in Tracks - 12 Tracks -17th meeting of the Forum for Information Retrieval Evaluation -Dec. Varanasi
by Thomas Mandl 17 Jun '25

17 Jun '25

*Call for Participation in Tracks * *FIRE 2025: 17th meeting of the Forum for Information Retrieval Evaluation* Indian Institute of Technology (BHU) Varanasi 17th - 20th December Website: fire.irsi.org.in <http://fire.irsi.org.in/> *Call for Participation in Tracks* FIRE 2025 offers the following exciting tracks this year: * Cross-Lingual Mathematical Information Retrieval (CLMIR) <https://clmir2025.github.io/> * Code-Mixed Information Retrieval from Social Media Data (CMIR) <https://cmir-iitbhu.github.io/cmir/index.html> * Hate Speech and Offensive Content Identification in Memes in Bengali, Hindi, Gujarati and Bodo (HASOC-meme) <https://hasocfire.github.io/hasoc/2025/> * Information Retrieval in Software Engineering (IRSE) <https://sites.google.com/view/irse-2025/home> * Misinformation Detection and Prompt Recovery (PROMID) <https://promid.github.io/index.html> * Multilingual Story Illustration: Bridging Cultures through AI Artistry (MUSIA) <https://cse-iitbhu.github.io/MUSIA/index.html> * Offensive Language Identification in Dravidian Languages (DravidianCodeMix) <https://dravidian-codemix.github.io/2025/dataset.html> * Opinion Extraction and Question Answering from CryptoCurrency-Related Tweets and Reddit posts (CryptOQA) <https://sites.google.com/view/cryptoqa-2025/> * Research Highlight Generation from Scientific Papers (SciHigh) <https://sites.google.com/jadavpuruniversity.in/scihigh2025/home> * Spoken-Query Cross-Lingual Information Retrieval for the Indic Languages (SqCLIR) <https://sites.google.com/view/sqclir-2025> * Varanasi Tourism in Question Answer System (VATIKA) <https://sites.google.com/view/vatika-2025/> * Word-Level Identification of Languages in Dravidian Languages (WILD) <https://www.codabench.org/competitions/7902/> Research groups are invited to participate in the experiments. Please register directly with the organizers. FIRE 2025 is the 17th edition of the annual meeting of Forum for Information Retrieval Evaluation (fire.irsi.org.in). Since its inception in 2008, FIRE had a strong focus on shared tasks similar to those offered at Evaluation forums like TREC, CLEF, and NTCIR. The shared tasks focus on solving specific problems in the area information access and, more importantly help in generating evaluation datasets for the research community. Visit fire.irsi.org.in <http://fire.irsi.org.in>

1 0

Call for papers : Ethical LLMs @ RANLP2025
by Dola Mullage, Damith (Postgraduate Researcher) 17 Jun '25

17 Jun '25

Ethical LLMs 2025: The first Workshop on Ethical Concerns in Training, Evaluating and Deploying Large Language Models<https://sites.google.com/view/ethical-llms-2025> @ RANLP2025<https://ranlp.org/ranlp2025/> Call for papers: Scope Large Language Models (LLMs) represent a transformative leap in Artificial Intelligence (AI), delivering remarkable language-processing capabilities that are reshaping how we interact with technology in our daily lives. With their ability to perform tasks such as summarisation, translation, classification, and text generation, LLMs have demonstrated unparalleled versatility and power. Drawing from vast and diverse knowledge bases, these models hold the potential to revolutionise a wide range of fields, including education, media, law, psychology, and beyond. From assisting educators in creating personalised learning experiences to enabling legal professionals to draft documents or supporting mental health practitioners with preliminary assessments, the applications of LLMs are both expansive and profound. However, alongside their impressive strengths, LLMs also face significant limitations that raise critical ethical questions. Unlike humans, these models lack essential qualities such as emotional intelligence, contextual empathy, and nuanced ethical reasoning. While they can generate coherent and contextually relevant responses, they do not possess the ability to fully understand the emotional or moral implications of their outputs. This gap becomes particularly concerning when LLMs are deployed in sensitive domains where human values, cultural nuances, and ethical considerations are paramount. For example, biases embedded in training data can lead to unfair or discriminatory outcomes, while the absence of ethical reasoning may result in outputs that inadvertently harm individuals or communities. These limitations highlight the urgent need for robust research in Natural Language Processing (NLP) to address the ethical dimensions of LLMs. Advancements in NLP research are crucial for developing methods to detect and mitigate biases, enhance transparency in model decision-making, and incorporate ethical frameworks that align with human values. By prioritising ethics in NLP research, we can better understand the societal implications of LLMs and ensure their development and deployment are guided by principles of fairness, accountability, and respect for human dignity. This workshop will dive into these pressing issues, fostering a collaborative effort to shape the future of LLMs as tools that not only excel in technical performance but also uphold the highest ethical standards. Submission Guidelines We follow the RANLP 2025 standards for submission format and guidelines. EthicalLLMs 2025 invites the submission of long papers, up to eight pages in length, and short papers, up to six pages in length. These page limits only apply to the main body of the paper. At the end of the paper (after the conclusions but before the references) papers need to include a mandatory section discussing the limitations of the work and, optionally, a section discussing ethical considerations. Papers can include unlimited pages of references and an unlimited appendix. To prepare your submission, please make sure to use the RANLP 2025 style files available here: * Latex<https://ranlp.org/ranlp2025/wp-content/uploads/2025/05/ranlp2025-LaTeX.zip> * Word<https://ranlp.org/ranlp2025/wp-content/uploads/2025/05/ranlp2025-word.docx> Papers should be submitted through Softconf/START using the following link: https://softconf.com/ranlp25/EthicalLLMs2025/ Topics of interest The workshop invites submissions on a broad range of topics related to the ethical development and evaluation of LLMs, including but not limited to the following. 1. Bias Detection and Mitigation in LLMs Research focused on identifying, measuring, and reducing social, cultural, and algorithmic biases in large language models. 2. Ethical Frameworks for LLM Deployment Approaches to integrating ethical principles—such as fairness, accountability, and transparency—into the development and use of LLMs. 3. LLMs in Sensitive Domains: Risks and Safeguards Case studies or methodologies for deploying LLMs in high-stakes fields such as healthcare, law, and education, with an emphasis on ethical implications. 4. Explainability and Transparency in LLM Decision-Making Techniques and tools for improving the interpretability of LLM outputs and understanding model reasoning. 5. Cultural and Contextual Understanding in NLP Systems Strategies for enhancing LLMs’ sensitivity to cultural, linguistic, and social nuances in global and multilingual contexts. 6. Human-in-the-Loop Approaches for Ethical Oversight Collaborative models that involve human expertise in guiding, correcting, or auditing LLM behaviour to ensure responsible use. 7. Mental Health and Emotional AI: Limits of LLM Empathy Discussions on the role of LLMs in mental health support, highlighting the boundary between assistive technology and the need for human empathy. Organisers Damith Premasiri – Lancaster University, UK Tharindu Ranasinghe – Lancaster University, UK Hansi Hettiarachchi – Lancaster University, UK Contact If you have any questions regarding the workshop, please contact Damith: d.dolamullage(a)lancaster.ac.uk

1 0

Survey on Queries in Syntactically Annotated Corpora
by Niklas Deworetzki 17 Jun '25

17 Jun '25

Dear all, We are currently doing a project aiming to make querying in syntactically annotated corpora easier and more accessible. For this purpose, we want to know what researchers are actually searching for. If you have a minute of your time, please feel free to fill out this form. https://forms.office.com/e/a8DgETSabB Feel free to reach out to ekavol(a)chalmers.se or nikdew(a)chalmers.se if you have any further questions. Best regards Niklas Deworetzki & Katja Voloshina PhD Students Department of Computer Science and Engineering Chalmers University of Technology | University of Gothenburg SE-412 96 Göteborg, Sweden www.gu.se<http://www.gu.se/> www.chalmers.se<http://www.chalmers.se/> [cid:a8138665-78e4-4530-80d5-cf9cbf2bd3c2]

1 0

CLEF 2025 – Registration Open
by JORGE AMANDO CARRILLO DE ALBORNOZ CUADRADO 17 Jun '25

17 Jun '25

CLEF 2025 – Registration Open Conference and Labs of the Evaluation Forum We are pleased to announce CLEF 2025, taking place 9–12 September 2025 in Madrid, Spain at UNED. This peer‑reviewed conference and associated labs foster research in multilingual, multimodal, and cross‑language information access https://clef2025.clef-initiative.eu/. Register now – Early‑bird registration is open! Standard registration opened earlier this year, and early-bird rates are currently available . Why attend? * Present and discuss original research at main conference. * Engage in innovative labs and challenges, including LifeCLEF, ImageCLEF, EXIST, eRisk, CheckThat!, and more https://clef2025.clef-initiative.eu/index.php?page=Pages/labs.html. * Benefit from rich networking with academic and industry experts in IR, NLP, multimedia retrieval, and evaluation sciences. For detailed conference and lab registration, registration deadlines, and pricing, please visit the official site: https://clef2025.clef-initiative.eu/index.php?page=Pages/registrationConfer… Important Dates * Early‑bird registration ongoing * Registration closes: 31 August 2025 * Conference & labs: 9–12 September 2025 — Madrid, Spain We look forward to welcoming participants from across the global community — see you this September in Madrid at CLEF 2025! Jorge Carrillo-de-Albornoz On behalf of the CLEF 2025 Organising Committee AVISO LEGAL. Este mensaje puede contener información reservada y confidencial. Si usted no es el destinatario no está autorizado a copiar, reproducir o distribuir este mensaje ni su contenido. Si ha recibido este mensaje por error, le rogamos que lo notifique al remitente. Le informamos de que sus datos personales, que puedan constar en este mensaje, serán tratados en calidad de responsable de tratamiento por la UNIVERSIDAD NACIONAL DE EDUCACIÓN A DISTANCIA (UNED) c/ Bravo Murillo, 38, 28015-MADRID-, con la finalidad de mantener el contacto con usted. La base jurídica que legitima este tratamiento, será su consentimiento, el interés legítimo o la necesidad para gestionar una relación contractual o similar. En cualquier momento podrá ejercer sus derechos de acceso, rectificación, supresión, oposición, limitación al tratamiento o portabilidad de los datos, ante la UNED, Oficina de Protección de datos<https://www.uned.es/dpj>, o a través de la Sede electrónica<https://sede.uned.es/> de la Universidad. Para más información visite nuestra Política de Privacidad<https://descargas.uned.es/publico/pdf/Politica_privacidad_UNED.pdf>.

1 0

Second call for papers: Special issue on Machine Translation for Low-Resource Languages@Language Resources and Evaluation Journal
by Atul K. Ojha 16 Jun '25

16 Jun '25

Apologies for cross-posting. --------------------------------------------------------------------------- *CALL FOR PAPERS: Language Resources and Evaluation Journal- Special Issue on Machine Translation for Low-Resource Languages* https://link.springer.com/collections/gbdgacbgbg *Guest Editors:* - Atul Kr. Ojha (Insight Research Ireland Centre for Data Analytics, DSI, University of Galway, Ireland) - Chao-Hong Liu (Industrial Technology Research Institute, Potamu Research Ltd.) - Ekaterina Vylomova (University of Melbourne, Australia) - Flammie Pirinen (UiT The Arctic University of Norway, Tromsø) - Jonathan Washington (Swarthmore College, USA) - Nathaniel Oco (De La Salle University, Philippines) - Xiaobing Zhao (Minzu University of China) Machine translation (MT) technologies have been improved significantly in the last decade using neural MT (NMT) approaches. However, most of these methods rely on the availability of large parallel data for training the MT systems, resources which are not available for the majority of language pairs. Hence, current technologies often fall short in their ability to be applied to low-resource languages. Developing MT technologies using relatively small corpora still presents a major challenge for the MT community. In addition, many methods for developing MT systems still rely on several natural language processing (NLP) tools to pre-process texts in source languages and post-process MT outputs in target languages. The performance of these tools often has a great impact on the quality of the resulting translation. The availability of MT technologies and NLP tools can facilitate equal access to information for the speakers of a language and determine on which side of the digital divide they will end up. The lack of these technologies for many of the world's languages provides opportunities both for the field to grow and for making tools available for speakers of low-resource languages. In the past few years, several workshops and evaluations have been organized to promote research on low-resource languages. NIST has been conducting Low Resource Human Language Technology evaluations (LoReHLT) annually from 2016 to 2019. In LoReHLT evaluations, there is no training data in the evaluation language. Participants receive training data in related languages but need to bootstrap systems in the surprise evaluation language at the start of the evaluation. Methods for this include pivoting approaches and taking advantage of linguistic universals. The evaluations are supported by DARPA's Low Resource Languages for Emergent Incidents (LORELEI) program, which seeks to advance technologies that are less dependent on large data resources and that can be quickly pivoted to new languages within a very short amount of time so that information from any language can be extracted in a timely manner to provide situation awareness to emergent incidents. There are also the Workshop on Technologies for MT of Low-Resource Languages (LoResMT), Special Interest Group on Under-resourced Languages (SIGUL), Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia (EURALI), the Workshop on Deep Learning Approaches for Low-Resource Natural Language Processing (DeepLo). AfricaNLP, TurkLang, Conference on Machine Translation (WMT), and International Conference on Spoken Language Translation (IWSLT) workshop, which provide a venue for sharing research and working on research and development in this field. This topical collection solicits original research papers on MT systems/methods and related NLP tools for low-resource languages in general. LoReHLT, LORELEI, LoResMT, SIGUL, EURALI, DeepLo, WMT, and IWSLT participants are very welcome to submit their work to the special issue. Summary papers on MT research for specific low-resource languages, as well as extended versions (>40% difference) of published papers from relevant conferences/workshops, are also welcome. Topics of the special issue include, but are not limited to: * Research and review papers on MT systems/methods for low-resource languages * Research and review papers on pre-processing and/or post-processing NLP tools for MT * Word tokenizers/de-tokenizers for low-resource languages * Word/morpheme segmenters for low-resource languages * Use of morphological analyzers and/or morpheme segmenters in MT * Multilingual/cross-lingual NLP tools for MT * Review of available corpora of low-resource languages for MT * Pivot MT for low-resource languages * Zero-shot MT for low-resource languages * Fast building of MT systems for low-resource languages * Re-usability of existing MT systems and/or NLP tools for low-resource languages * Machine translation for language preservation * Techniques that work across many languages and modalities * Techniques that are less dependent on large data resources * Use of language-universal resources * Bootstrap-trained resources for the short development cycle * Entity, relation- and event-extraction * Sentiment detection in MT * MT Summarisation * Processing diverse languages, genres (news, social media, etc.) and modalities (text, speech, video, etc.) * Speech Translation for low-resource languages * Multimodal MT for low-resource languages * MT models using LLMs for low-resource languages * Generative AI models for low-resource languages * Evaluation metrics and datasets for low-resource languages For further information on this initiative, please refer to https://link.springer.com/collections/gbdgacbgbg *IMPORTANT DATES* *August 26, 2025: Paper submission deadlineDecember 05, 2025: Revised papers dueMarch 2026: Publication* * SUBMISSION GUIDELINES* Authors should follow the "Instructions for Authors <https://link.springer.com/journal/10579/submission-guidelines> ( https://link.springer.com/journal/10579/submission-guidelines or Overleaf <https://link.springer.com/journal/10579/updates/17234296>)" on the LRE journal website <https://link.springer.com/journal/10579>. Thanks,

1 1

June 2025 Newsletter - LDC
by Penn LDC 16 Jun '25

16 Jun '25

In this newsletter: LDC data and commercial technology development New publications: Chinese Sentence Pattern Structure Treebank<https://catalog.ldc.upenn.edu/LDC2025T06> IWSLT 2022-2023 Shared Task Training, Development and Test Set<https://catalog.ldc.upenn.edu/LDC2025S05> KAIROS Schema Learning Complex Event Annotation<https://catalog.ldc.upenn.edu/LDC2025T07> ________________________________ LDC data and commercial technology development For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing<https://www.ldc.upenn.edu/data-management/using/licensing> page for further information. ________________________________ New publications: Chinese Sentence Pattern Structure Treebank<https://catalog.ldc.upenn.edu/LDC2025T06> was developed at Beijing Normal University<https://english.bnu.edu.cn/> and Peking University<https://english.pku.edu.cn/>. It contains 5,016 sentences and 119,627 tokens syntactically annotated following the concept of sentence constituent analysis which emphasizes sentence pattern structure. The source data consists of 27 chapters extracted from modern Mandarin and ancient Chinese works. There are three annotation layers: lexical sense and structural mode for dynamic words; syntactic structure for clauses; and inter-clause relation within complex sentence and sentence clusters. These structures can be visualized using the Jbw-viewer tool<https://github.com/bnucip/jbwviewer> which is included in the release. 2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee. * IWSLT 2022 - 2023 Shared Task Training, Development and Test Set<https://catalog.ldc.upenn.edu/LDC2025S05> was developed by LDC and contains 210 hours of Tunisian<https://catalog.ldc.upenn.edu/LDC2025S05> Arabic conversational telephone speech, transcripts, English translations, speaker metadata, and documentation. This material constitutes the training, development, and test data used in the International Conference on Spoken Language Translation (IWSLT) Dialectal Speech Translation task (2022)<https://iwslt.org/2022/dialect> and the Dialectal and Low-resource track (2023)<https://iwslt.org/2023/low-resource>. The telephone speech was collected by LDC in 2016-2017 from native speakers of Tunisian Arabic in Tunis. Speakers were recruited to make telephone calls to people in their social networks from a variety of noise conditions and handsets. Transcripts are orthographic following Buckwalter<https://catalog.ldc.upenn.edu/LDC2004L02> transliteration and cover 175 hours of the collected speech. IPA transcripts were added to a subset of the data. All transcribed segments were translated into English. 2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee. * KAIROS Schema Learning Complex Event Annotation<https://catalog.ldc.upenn.edu/LDC2025T07> was developed by LDC to support the DARPA KAIROS program. It contains English and Spanish text, audio, video, and image data labeled for 93 real-world complex events with event, relation, and argument annotations linking to document provenance. Source data was collected from the web; 3431 root web pages were collected and processed, yielding 1919 text data files, 24019 image files, 1472 video files, and 16 audio files. The DARPA KAIROS (Knowledge-directed Artificial Intelligence Reasoning Over Schemas) program aimed to build technology capable of understanding and reasoning about complex real-world events in order to provide actionable insights to end users. KAIROS systems utilized formal event representations in the form of schema libraries that specified the steps, preconditions, and constraints for an open set of complex events; schemas were then used in combination with event extraction to characterize and make predictions about real-world events in a large, multilingual, multimedia corpus. 2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee. To unsubscribe from this newsletter, log in to your LDC account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to "Receive Newsletter" under Account Options or contact LDC for assistance. Membership Coordinator Linguistic Data Consortium<ldc.upenn.edu> University of Pennsylvania T: +1-215-573-1275 E: ldc(a)ldc.upenn.edu<mailto:ldc@ldc.upenn.edu> M: 3600 Market St. Suite 810 Philadelphia, PA 19104

1 0

CLIN35 -- Deadline extension
by Vincent Vandeghinste 16 Jun '25

16 Jun '25

Dear CLIN enthusiasts We are extending the submission deadline for CLIN abstracts by one week. The new, final deadline is June 20th. Below you can find the original call for abstracts with a modified date. Website: https://clin35.ccl.kuleuven.be/ We invite submissions for CLIN35, the 35th edition of the Computational Linguistics in the Netherlands (CLIN) conference, which will take place in Leuven on September 12th, 2025. Abstracts describing theoretical or applied research in any area of computational linguistics and natural language processing are welcome. We especially encourage submissions related to the Dutch language, but contributions on other languages and multilingual approaches are equally welcome. Abstracts must be written in English and should not exceed 500 words. Submissions should include: * Name and affiliation of each author * Contact details * Presentation title and short abstract (max. 500 words) * Keywords * Your presentation format preference (We will do our best to accommodate your preference but may need to make changes to provide a well-balanced program) Abstracts must be submitted via the form on the website<https://clin35.ccl.kuleuven.be/call-for-abstracts> by Friday, 20th of June 2025. Notifications of acceptance will be sent out by Friday, 4th of July 2025. Accepted abstracts will be presented at the conference as oral or poster presentations. Authors with accepted abstracts will also have the opportunity to submit a full paper after the conference for publication in the CLIN Journal<https://www.clinjournal.org/clinj/>. Please share this call with your interested colleagues and network! For any questions you can reach us at this email address (clin35(a)kuleuven.be<mailto:clin35@kuleuven.be>). We look forward to your submissions and to welcoming you to CLIN35! CLIN35 local organizers ________________________________ Denk je aan het milieu? Print alleen als het nodig is. Aan dit bericht kunnen geen rechten worden ontleend. Het bericht is alleen bestemd voor de geadresseerde. Indien het bericht niet voor u is bestemd, verzoeken wij u dit aan ons te melden en het bericht te verwijderen. This message shall not constitute any obligations. This message is intended solely for the addressee. If you have received this message in error, please inform us and delete the message. ________________________________

1 0

EVALITA 2026 - Second Call for Tasks (NEW DEADLINES and TIMELINE)
by Marco Antonio Stranisci 13 Jun '25

13 Jun '25

****************************************************** ********* EVALITA 2026: Call for tasks ********* ******* NEW DEADLINES and TIMELINE ****** ****************************************************** EVALITA 2026 is an initiative of AILC (Associazione Italiana di Linguistica Computazionale, AILC https://www.ai-lc.it/). As in the previous editions (https://www.evalita.it/), EVALITA 2026 will be organized along a few selected tasks, which provide participants with opportunities to discuss and explore both emerging and traditional areas of Natural Language Processing and Speech. The participation is encouraged for teams working both in academic institutions and industrial organizations. TASK PROPOSAL SUBMISSION Task proposals should be no longer than 4 pages and should include: - task title and acronym; - names and affiliation of the organizers (minimum 2 organizers); - brief task description, including motivations and state of the art; - explanation of the international relevance of the task; - description and examples of the data, including information about their availability, development stage, and issues concerning privacy and data sensitivity. The examples are mandatory because they are intended to give potential participants an idea of what the task data will look like, how it’ll be formatted, etc. - expected number of participants and attendees; - names and contact information of the organizers. We also accept the re-annotation/expansion of datasets from previous years and previous challenges with new annotation levels, and texts from publicly available corpora. However, test annotations must be new and unpublished, as participants must not have access to the test data annotations until the end of EVALITA campaign. For new tasks, organizers must specify in the proposal why it would attract a reasonable number of participants, and why it is needed. For re-runs, organizers must describe the element of novelty from previous challenges. In submitting your proposal, please bear in mind that we strongly encourage: - tasks that pose non-trivial challenges and stimulate the creation of innovative systems (i.e., that integrate linguistic insights or external knowledge sources), rather than being easily addressed by off-the-shelf LLM prompting techniques; - tasks focused on multimodality, e.g., considering both textual and visual or any other modality; - tasks characterized by different levels of complexity, e.g., with a straightforward main subtask and one or more sophisticated additional subtasks; - to consider providing competitive baselines (e.g., small-scale LLMs in zero-shot setups), which participants are expected to improve upon, in order to encourage the design of advanced solutions; - application-oriented tasks, that is, tasks that have a clearly defined end-user application showcasing; - multilingual tasks, i.e. with data both in Italian and in other languages; - industrial tasks, i.e. tasks with real data provided by companies. The organizers of the accepted tasks should take care of planning, according to the scheduled deadlines (see below): - the development and distribution of datasets needed for the contest, i.e. data for training and development, and data for testing; the scorer to be used to evaluate the submitted systems should be included in the release of development data; - the development of task guidelines, where all the instructions for the participation are made clear, together with a detailed description of data and evaluation metrics applied for the evaluation of the participant's results; - the collection of participants' results; - the evaluation of participants' results according to standard metrics and baseline(s); - the solicitation of participation and submissions; - the reviewing process of the papers describing the participants' approach and results (according to the template to be made available by the EVALITA 2026 chairs); - the production of a paper describing the task (according to the template to be made available by the EVALITA 2026 chairs). *** Email your proposal in PDF format to evalitacampaign(a)gmail.com with "EVALITA 2026 TASK Proposal" as the subject line by the submission deadline: July 28th 2025. *** Please feel free to contact the EVALITA 2026 chairs at evalitacampaign(a)gmail.com in case of any questions or suggestions. Deadlines of the task proposal: - July 21th 2025 July 28th 2025: submission of task proposals - July 31th 2025 August 7th 2025: notification of task proposal acceptance Timelines of EVALITA 2026: - 22nd September 2025: development data available to participants - 3 - 17th November 2025: evaluation windows - 28th November 2025: assessments returned to participants - 15th December 2025: final reports (from participants) due to task organizers - 22nd December 2025: final reports (from task organizers) due to EVALITA chairs - 19th January 2025: review deadline - 2nd February 2026: camera-ready version deadline - 26 - 27th February 2026: final workshop in Bari EVALITA 2026 CHAIRS Francesco Cutugno (Università di Napoli) Alessio Miaschi (Istituto di Lingustica Computazionale “A. Zampolli” - CNR) Alessio Palmero Aprosio (Università di Trento) Giulia Rambelli (Università di Bologna) Lucia Siciliani (Università di Bari) Marco Antonio Stranisci (Università di Torino) FURTHER INFORMATION Website: https://www.evalita.it/campaigns/evalita-2026/call-for-tasks/ Mail: evalitacampaign(a)gmail.com Marco, UNITO <https://www.unito.it/persone/mstranis> and aequa-tech <https://aequa-tech.com/>

1 0

2026

2025

2024

2023

2022

Corpora June 2025