June 2025 - Corpora - ELRA lists

[Deadline Extension] 1st Workshop on Multilingual Data Quality Signals at COLM 2025
by Pedro Ortiz Suarez 23 Jun '25

23 Jun '25

Dear colleagues, We are pleased to announce the first call for papers of the *1st Workshop on Multilingual Data Quality Signals at COLM 2025* Important information: 🗓️ CfP Deadline Extended to: July 3, Workshop: October 10 📍 Montréal, Canada 🌐 https://wmdqs.org Scope Recent research has shown that large language models (LLMs) not only need large quantities of data, but also need data of sufficient quality. Ensuring data quality is even more important in a multilingual setting, where the amount of acceptable training data in many languages is limited. Indeed, for many languages even the fundamental step of language identification remains a challenge, leading to unreliable language labels and thus noisy datasets for underserved languages. In response to these challenges, we will be holding the first Workshop on Multilingual Data Quality Signals (WMDQS) in tandem with COLM. We invite the submission of long and short research papers related to data quality in multilingual data. Even though most previous work on data quality has been targeted at LLM development, we believe that research in this area can also benefit other research communities in areas such as web search, web archiving, corpus linguistics, digital humanities, political sciences and beyond. We therefore encourage submissions from a wide range of disciplines. WMDQS will also include a shared task on language identification for web text. We invite participants to submit novel systems which address current problems with language identification for web text. We will provide a training set of annotated documents sourced from Common Crawl to aid development. Topics We welcome submissions of (1) original research papers, (2) review/opinion papers, (3) online systems on the topics listed below, and (4) extended abstracts. We especially welcome work-in-progress projects and all novel ideas covering research in multilinguality, underserved/low-resource languages, under-represented linguistic communities and all types of work covering data quality signals. Suggested areas include: - Data pipelines for data annotation and data filtering - Undesirable content detection in a multilingual setting - Multilingual or language independent content ranking - Human annotation platforms and systems - Multilingual tokenization mechanisms - Small language models and embeddings - Linguistic studies in underserved languages - Corpus creation and curation methods, especially for underserved languages - Machine translation - Digital humanities - Historical and constructed languages Shared task The lack of training data—especially high-quality data—is the root cause of poor language model performance for many languages. One obstacle to improving the quantity and quality of available text data is language identification (LangID or LID). Lang ID remains far from solved for many languages. Several of the commonly used LangID models were introduced in 2017 (e.g. fastText and CLD3). The aim of this shared task is to encourage innovation in open-source language identification and improve accuracy on a broad range of languages. All accepted authors will be invited to contribute a larger paper, which will be submitted to a high-impact NLP venue. Important dates for the Workshop: Workshop paper submission deadline (extended): July 3, 2025 Workshop paper acceptance notification: July 24, 2025 Workshop: October 10, 2025 Important dates for the Shared Task: 1st Deadline to contribute annotations: July 7, 2025 1st Annotations released (train split): July 14, 2025 Abstract Deadline: July 21, 2025 Decision Notification: July 24, 2025 Camera Ready Deadline: September 21, 2025 (All deadlines are 23:59 AoE.) Organizers: For any questions, please drop a mail to wmdqs-pcs(a)googlegroups.com Program Chairs: Pedro Ortiz Suarez (Common Crawl Foundation) Sarah Luger (MLCommons) Laurie Burchell (Common Crawl Foundation) Kenton Murray (Johns Hopkins University) Catherine Arnett (EleutherAI) Organizing Committee: Thom Vaughan (Common Crawl Foundation) Sara Hincapié (Factored) Rafael Mosquera (MLCommons)

1 0

[CFP] MAHED 2025: the First Shared Task on Multimodal Detection of Hope and Hate Emotions in Arabic Content
by Wajdi Zaghouani 22 Jun '25

22 Jun '25

We are pleased to announce MAHED 2025, the first multimodal shared task dedicated to Hope and Hate Detection in Arabic content. This novel multimodal challenge will be co-located with EMNLP 2025 at the ArabicNLP 2025 Conference. MAHED 2025 addresses critical real-world challenges in Arabic natural language processing by focusing on the detection of hate speech, hope speech, and emotions in both Arabic text and memes. This shared task aims to advance research in ethical AI while addressing the linguistic diversity and dialectal variations inherent in Arabic content. The shared task comprises three subtasks: Task 1: Text-based Hope & Hate Speech Classification Participants will develop models to classify Arabic text as containing hope speech, hate speech, or neutral content. Task 2: Multitask Learning for Emotion, Offensive Content, and Hate Detection This task involves simultaneous detection of emotions, offensive language, and hate speech in Arabic text. Task 3: Multimodal Hateful Meme Detection Participants will work with Arabic memes to detect hateful content using both textual and visual modalities. Registration Links: * Task 1: https://www.codabench.org/competitions/9136/ * Task 2: https://www.codabench.org/competitions/9166/ * Task 3: https://www.codabench.org/competitions/9192/ Important Dates: * June 10, 2025: Training data and evaluation scripts released * July 20, 2025: Final registration deadline and test set release * July 25, 2025: Test submission deadline * November 5-9, 2025: ArabicNLP 2025 Workshop at EMNLP 2025, Suzhou, China Resources and Registration: Website: https://marsadlab.github.io/mahed2025/ Dataset and Code: https://github.com/marsadlab/MAHED2025Dataset

1 0

The 16th IEEE International Conference on Knowledge Graphs (ICKG 2025): Last Call for Papers
by Announce 22 Jun '25

22 Jun '25

*** Last Call for Papers *** The 16th IEEE International Conference on Knowledge Graphs (ICKG 2025) November 13-14, 2025, 5* St. Raphael Resort and Marina, Limassol, Cyprus https://cyprusconferences.org/ickg2025/ (*** Proceedings to be published by IEEE ***) (*** Submission Deadline: July 4, 2025 AoE (extended and firm!) ***) The annual IEEE International Conference on Knowledge Graph (ICKG) provides a premier international forum for presentation of original research results in knowledge discovery and graph learning, discussion of opportunities and challenges, as well as exchange and dissemination of innovative, practical development experiences. The conference covers all aspects of knowledge discovery from data, with a strong focus on graph learning and knowledge graph, including algorithms, software, platforms. ICKG 2025 intends to draw researchers and application developers from a wide range of areas such as knowledge engineering, representation learning, big data analytics, statistics, machine learning, pattern recognition, data mining, knowledge visualization, high performance computing, and World Wide Web etc. By promoting novel, high quality research findings, and innovative solutions to address challenges in handling all aspects of learning from data with dependency relationship. All accepted papers will be published in the conference proceedings by the IEEE Computer Society. Awards, including Best Paper, Best Paper Runner up, Best Student Paper, Best Student Paper Runner up, will be conferred at the conference, with a check and a certificate for each award. The conference also features a survey track to accept survey papers reviewing recent studies in all aspects of knowledge discovery and graph learning. At least five high quality papers will be invited for a special issue of the Knowledge and Information Systems Journal, in an expanded and revised form. In addition, at least eight quality papers will be invited for a special issue of Data Intelligence Journal in an expanded and revised form with at least 30% difference. TOPICS OF INTEREST Topics of interest include, but are not limited to: • Foundations, algorithms, models, and theory of knowledge discovery and graph learning • Knowledge engineering with big data. • Machine learning, data mining, and statistical methods for data science and engineering. • Acquisition, representation and evolution of fragmented knowledge. • Fragmented knowledge modeling and online learning. • Knowledge graphs and knowledge maps. • Graph learning security, privacy, fairness, and trust. • Interpretation, rule, and relationship discovery in graph learning. • Geospatial and temporal knowledge discovery and graph learning. • Ontologies and reasoning. • Topology and fusion on fragmented knowledge. • Visualization, personalization, and recommendation of Knowledge Graph navigation and interaction. • Knowledge Graph systems and platforms, and their efficiency, scalability, and privacy. • Applications and services of knowledge discovery and graph learning in all domains including web, medicine, education, healthcare, and business. • Big knowledge systems and applications. • Crowdsourcing, deep learning and edge computing for graph mining. • Large language models and applications • Open source platforms and systems supporting knowledge and graph learning. • Datasets and benchmarks for graphs • Neurosymbolic & Hybrid AI systems • Graph Retrieval Augmented Generation SURVEY TRACK Survey paper reviewing recent study in keep aspects of knowledge discover and graph learning. In addition to the above topics, authors can also select and target the following Special Track topics. Each special track is handled by respective special track chairs, and the papers are also included in the conference proceedings. • Special Track 01: KGC and Knowledge Graph Building • Special Track 02: KR and KG Reasoning. • Special Track 03: KG and Large Language Model • Special Track 04: GNN and Graph Learning • Special Track 05: QA and Graph Database • Special Track 06: KG and Multi-modal Learning. • Special Track 07: KG and Knowledge Fusion. • Special Track 08: Industry and Applications SUBMISSION GUIDELINES Paper submissions should be no longer than 8 pages, in the IEEE 2-column format, including the bibliography and any possible appendices. Submissions longer than 8 pages will be rejected without review. All submissions will be reviewed by the Program Committee based on technical quality, originality, significance, and clarity. For survey track paper, please preface the descriptive paper title with “Survey:”, followed by the actual paper title. For example, a paper entitled “A Literature Review of Streaming Knowledge Graph”, should be changed as “Survey: A Literature Review of Streaming Knowledge Graph”. This is for the reviewers and chairs to clearly bid and handle the papers. Once the paper is accepted, the word, such as “Survey:”, can be removed from the camera-ready copy. For special track paper, please preface the descriptive paper title with “SS##:”, where “##” is the two digits special track ID. For example, a paper entitled “Incremental Knowledge Graph Learning”, intended to target Special Track 01 (Machine learning and knowledge graph) should be changed as “SS01: Incremental Knowledge Graph Learning”. All manuscripts are submitted as full papers and are reviewed based on their scientific merit. The reviewing process is single blind, meaning that each submission should list all authors and affiliations. There is no separate abstract submission step. There are no separate industrial, application, or poster tracks. Manuscripts must be submitted electronically in the online submission system. No email submission is accepted. To help ensure correct formatting, please use the style files for U.S. Letter as template for your submission. These include LaTeX and Word. SUBMISSION LINK https://wi-lab.com/cyberchair/2025/ickg25/ IMPORTANT DATES • Paper submission (abstract and full paper): July 4, 2025 (AoE) (extended and firm!) • Notification of acceptance/rejection: September 5, 2025 • Camera-ready, copyright forms and author registration: September 20, 2025 • Early (non-author) registration: October 10, 2025 • Conference dates: November 13-14, 2025 ORGANISATION Conference and Local Organising Chair • George A. Papadopoulos, University of Cyprus Conference Co-Chair • Dan Guo, Hefei University of Technology Program Chairs • Cesare Alippi, Università della Svizzera italiana • Shirui Pan, Griffith University Local Organising Vice Chair • Irene Kinlanioti, National Technical University of Athens Finance Chair • Constantinos Pattichis, University of Cyprus Steering Committee Chair • Xindong Wu, Hefei University Of Technology

1 0

Call for Abstracts – NARNiHS 2026 – 8th Annual Meeting of the North American Research Network in Historical Sociolinguistics
by Lauersdorf, Mark R. 21 Jun '25

21 Jun '25

*** NARNiHS 2026 *** North American Research Network in Historical Sociolinguistics *** Eighth Annual Meeting *** 100% IN PERSON *** Co-Located with the Linguistic Society of America (LSA) Annual Meeting *** New Orleans, Louisiana USA *** 8-11 January 2026 This event offers an opportunity for historical sociolinguistics scholars from all over the world to gather and share leading research. We encourage our fellow historical sociolinguists and scholars in related fields from our global scholarly community to **join us in New Orleans** for our Eighth Annual Meeting. Consult this Call for Abstracts on the web: https://narnihs.org/?page_id=3135 . --------------- Call for Abstracts ---------------. Abstract submission online: https://easyabs.linguistlist.org/conference/NARNiHS_26/ . Deadline: Friday, 15 August 2025, 11:59 PM US Eastern Time. Late abstracts will not be considered. The North American Research Network in Historical Sociolinguistics (NARNiHS) is accepting abstracts for its Eighth Annual Meeting in New Orleans, Thursday, January 8 -- Sunday, January 11, 2026. The 8th edition of this inclusive NARNiHS event seeks to provide a collaborative environment where presenters bring fully developed work for presentation and enrichment. We see the NARNiHS Annual Meeting as a place for showcasing excellent projects in historical sociolinguistics, seeking feedback from peers, and engaging in productive development of the field’s enduring questions. NARNiHS welcomes papers in all areas of historical sociolinguistics, which is understood as the application and/or development of sociolinguistic theories, methods, and models for the study of historical language variation and change over time, or more broadly, the study of the interaction of language and society in historical periods and from historical perspectives. Thus, a wide range of linguistic areas, subdisciplines, methodologies, and adjacent disciplines easily find their place within historical sociolinguistics, and we encourage submission of abstracts that reflect this broad scope. Abstracts will be accepted for both 20-minute papers and posters. Please note that, at the NARNiHS annual meeting, poster presentations are an integral part of the conference (not second-tier presentations). Abstracts will be assigned a paper or a poster presentation based on determinations in the review process about the most effective format for the submission. However, if you prefer that your submission be considered primarily for poster presentation, please specify this in your abstract. Successful abstracts will demonstrate *thorough grounding* in historical sociolinguistics, *scientific rigor* in the formulation of research questions, and promise for rich discussion of ideas. Successful abstracts will be explicit about which *theoretical frameworks*, *methodological protocols*, and *analytical strategies* are being applied or critiqued. *Data sources and examples* should be sufficiently presented, so as to allow reviewers a full understanding of the scope and claims of the research. Please note that the *connection of your research to the field of historical sociolinguistics* should be explicitly outlined in your abstract. Failure to adhere to these criteria will likely result in rejection. *** Abstract Format Guidelines***. - Abstracts must be submitted in PDF format. - Abstracts must fit on one 8.5x11 inch page, with margins no smaller than 1 inch and a font style and size no smaller than Times New Roman 12 point. You are encouraged to use the entire page, providing a full and robust description of the research. All additional supporting content (visualizations, trees, tables, figures, captions, examples, and references) must fit on a single (1) additional page. No exceptions to these requirements are allowed; abstracts longer than one page or with more than one additional page of supporting content will be rejected without review. - Specify if you prefer your submission be considered primarily for a poster presentation. - Anonymize your abstract. We realize that sometimes complete anonymity is not attainable, but there is a difference between the nature of the research creating an inability to anonymize and careless non-anonymizing (in citations, references, file names, etc.). Be sure to anonymize your PDF file (you may do so in Adobe Acrobat Reader by clicking on "File", then "Properties", removing your name if it appears in the "Author" line of the "Description" tab, and re-saving the file before submission). Do not use your name when saving your PDF (e.g. Smith_Abstract.pdf); file names will not be automatically anonymized by the EasyAbs system. Rather, use non-identifying information in your file name (e.g. HistSoc4Lyfe.pdf). Your name should only appear in the online form accompanying your abstract submission. Papers that are not sufficiently anonymized wherever possible will be rejected without review. *** General Requirements ***. - Abstracts must be submitted electronically using the following link: https://easyabs.linguistlist.org/conference/NARNiHS_26/ . - Authors may submit a maximum of two abstracts: One single-author abstract and one co-authored abstract. - Authors may not submit identical abstracts for presentation at the NARNiHS annual meeting and the LSA annual meeting or another LSA sister society meeting (ADS, ANS, NAHoLS, SCiL, SPCL, or SSILA). - After submission, no changes of author, title, or wording of the abstract may occur. If your abstract is accepted, adjustment of typographical errors is permitted before a final version of the abstract is printed in the conference booklet. - Papers and posters must be delivered as projected in the abstract or represent bona fide developments of the same research. - Authors are expected to attend the conference in-person and present their own papers and posters. This will not be a hybrid event. Contact us at NARNiHistSoc(a)gmail.com with any questions.

1 0

Deadline extension: The 1st Workshop on Large Language Models for Cross-Temporal Research at COLM 2025
by wei.zhao＠abdn.ac.uk 20 Jun '25

20 Jun '25

We invite you to submit your ongoing, published or pre-reviewed works to our workshop on Large Language Models for Cross-Temporal Research (XTempLLMs) at COLM 2025. Our workshop website is available at https://xtempllms.github.io/2025/ *The deadline for submission has been extended to June 30, 2025 AOE* Workshop Description: Large language models (LLMs) have been used for a variety of time-sensitive applications such as temporal reasoning, forecasting and planning. In addition, there has been a growing number of interdisciplinary works that use LLMs for cross-temporal research in several domains, including social science, psychology, cognitive science, environmental science and clinical studies. However, LLMs are hindered in their understanding of time due to many different reasons, including temporal biases and knowledge conflicts in pretraining and RAG data but also a fundamental limitation in LLM tokenization that fragments a date into several meaningless subtokens. Such inadequate understanding of time would lead to inaccurate reasoning, forecasting and planning, and time-sensitive findings that are potentially misleading. Our workshop looks for (i) cross-temporal work in the NLP community and (ii) interdisciplinary work that relies on LLMs for cross-temporal studies. Cross-temporal work in the NLP community: * Novel benchmarks for evaluating the temporal abilities of LLMs across diverse date and time formats, culturally grounded time systems, and generalization to future contexts; * Novel methods (e.g., neuro-symbolic approaches) for developing temporally robust, unbiased, and reliable LLMs; * Data analysis such as the distribution of pretraining data over time and conflicting knowledge in pretraining and RAG data; * Interpretability regarding how temporal information is processed from tokenization to embedding across different layers, and finally to model output; * Temporal applications such as reasoning, forecasting and planning; * Consideration of cross-lingual and cross-cultural perspectives for linguistic and cultural inclusion over time. Interdisciplinary work that relies on LLMs for cross-temporal studies: * Time-sensitive discoveries, such as social biases over time and personality testing over time; * Assessment of time-sensitive discoveries to identify misleading findings if any; * Interdisciplinary evaluation benchmarks for LLMs’ temporal abilities, e.g., psychological time perception and episodic memory evaluation. Submission Modes: * Standard submissions: We invite the submission of papers that will receive up to three double-blind reviews from the XTempLLMs committee, and a final decision of acceptance from the workshop chairs. * Pre-reviewed submissions: We invite unpublished papers that have already been reviewed either through ACL ARR, or recent AACL/EACL/ACL/EMNLP/COLING venues. These papers will not receive new reviews but will be judged together with their reviews via a meta-review from the workshop chairs. * Published papers: We invite papers that have been published recently elsewhere to present at XTempLLMs. Please send the details of your paper (Paper title, authors, publication venue, abstract, and a link to download the paper) directly to xtempllms(a)gmail.com. This allows such papers to gain more visibility from the workshop audience. All deadlines are 11.59 pm UTC -12h (“Anywhere on Earth”): * June 30, 2025: Submission deadline (standard and published papers) * July 18, 2025: Submission deadline for papers with ARR reviews * July 24, 2025: Notification of acceptance * October 10, 2025: Workshop day Invited Speakers: * Jose Camacho Collados, Cardiff University, United Kingdom * Ali Emami, Brock University, Canada * Alexis Huet, Huawei Technologies, France * Bahare Fatemi, Google Research, Canada * Vivek Gupta, Arizona State University, United States Organizing Committee: * Wei Zhao, University of Aberdeen, United Kingdom * Maxime Peyrard, Université Grenoble Alpes & CNRS, France * Katja Markert, Heidelberg University, Germany

1 0

LREC 2026 First Call for Papers
by info＠elda.org 19 Jun '25

19 Jun '25

[Apologies for cross-postings] FIRST CALL FOR PAPERS LREC 2026 Organised by the ELRA Language Resources Association Palma, Mallorca, Spain 11-16 May 2026 The Fifteenth biennial Language Resources and Evaluation Conference (LREC) will be held at the Palau de Congressos de Palma in Palma, Mallorca, Spain, on 11-16 May 2026. LREC serves as the primary forum for presentations describing the development, dissemination, and use of language resources involving both traditional and recently developed approaches. The scientific program will include invited talks, oral presentations, and poster and demo presentations, as well as a keynote address by the winner of the Antonio Zampolli Prize. Submissions describing all aspects of language resource development and use are invited, including, but not limited to, the following: Language Resource Development Methods and tools for mono- and multi-lingual language resource development and annotation Knowledge discovery/representation (knowledge graphs, linked data, terminologies, lexicons, ontologies, etc.) Resource development for less-resourced/endangered languages Guidelines, standards, best practices, and models for interoperability Language Resource Use Use of language resources in systems and applications for any area of language and speech processing Use of language resources in assistive technologies, support for accessibility Efficient/low-resource methods for language and speech processing Evaluation Methodologies and protocols for evaluation and benchmarking of language technologies Measures for validation of language resources and quality assurance Usability of user interfaces and dialogue systems Bias, safety, and user satisfaction metrics Interpretability/explainability of language models and language and speech processing tools Language Resources and Large Language Models Language resource development for LLMs (monolingual, multilingual, multimodal) (Semi-)automatic generation of training data Training, fine-tuning, adaptation, alignment, and representation learning Guardrails, filters, and modules for generative AI models Policy and Organizational Considerations International and national activities, projects, initiatives, and policies Language coverage and diversity Replicability and reproducibility Organisational, economic, ethical, climate, and legal issues Separate calls will be issued for Workshops, Tutorials and Industry Track. Submission Submissions should be 4 to 8 pages in length (excluding references) and follow the LREC stylesheet, which will soon be available on the conference website. At the time of submission, authors are offered the opportunity to share related language resources with the community. All repository entries are linked to the LRE Map [https://lremap.elra.info/], which provides metadata for the resource. Accepted papers will appear in the conference proceedings, which include both oral and poster papers in the same format. Determination of the presentation format (oral vs. poster) is based solely on an assessment of the optimal method of communication (more or less interactive), given the paper content. Important dates (All deadlines are 11:59PM UTC-12:00 (“anywhere on Earth”) Oral and poster (or poster+demo) paper submission: 17 October 2025 Notification of acceptance: 13 February 2026 Camera Ready due: 6 March 2026 Workshop and tutorial proposals submission: 17 October 2025 LREC 2026 conference: 11-16 May 2026 More information on LREC 2026: https://lrec2026.info/ Contact: info(a)lrec2026.info

1 0

Deadline extension: First Workshop on Optimal Reliance and Accountability in Interactions with Generative Language Models
by Nikhil Krishnaswamy 18 Jun '25

18 Jun '25

The First Workshop on Optimal Reliance and Accountability in Interactions with Generative Language Models (*ORIGen*) will be held in conjunction with the Second Conference on Language Modeling (COLM) at the Palais des Congrès in Montreal, Quebec, Canada, on October 10, 2025! *The deadline for submission has been extended to June 27, 2025, Anywhere on Earth.* With the rapid integration of generative AI, exemplified by large language models (LLMs), into personal, educational, business, and even governmental workflows, such systems are increasingly being treated as “collaborators” with humans. In such scenarios, underreliance or avoidance of AI assistance may obviate the potential speed, efficiency, or scalability advantages of a human-LLM team, but simultaneously, there is a risk that subject matter non-experts may overrely on LLMs and trust their outputs uncritically, with consequences ranging from the inconvenient to the catastrophic. Therefore, establishing optimal levels of reliance within an interactive framework is a critical open challenge as language models and related AI technology rapidly advances. * What factors influence overreliance on LLMs? * How can the consequences of overreliance be predicted and guarded against? * What verifiable methods can be used to apportion accountability for the outcomes of human-LLM interactions? * What methods can be used to imbue such interactions with appropriate levels of “friction” to ensure that humans think through the decisions they make with LLMs in the loop? The ORIGen workshop provides a new venue to address these questions and more through a multidisciplinary lens. We seek to bring together broad perspectives from AI, NLP, HCI, cognitive science, psychology, and education to highlight the importance of mediating human-LLM interactions to mitigate overreliance and promote accountability in collaborative human-AI decision-making. Submissions are due *June 27, 2025*. Please see our call for papers [1] for more! [1] https://origen-workshop.github.io/submissions/ Organizers: - Nikhil Krishnaswamy, Colorado State University - James Pustejovsky, Brandeis University - Dilek Hakkani-Tür, University of Illinois Urbana Champaign - Vasanth Sarathy, Tufts University - Tejas Srinivasan, University of Southern California - Mariah Bradford, Colorado State University - Timothy Obiso, Brandeis University - Mert Inan, Northeastern University

1 0

Job Offers: Senior and Junior AI & Language Technologies Specialists – EUSKORPORA (San Sebastián, Spain)
by info 18 Jun '25

18 Jun '25

Dear colleagues, EUSKORPORA, a newly created Linguistic Data Center for Basque digital technologies based in San Sebastián (Donostia), Spain, is seeking candidates for two key roles in its Technology area: 1) Senior AI and Language Technologies Specialist 2) Junior AI and Language Technologies Specialist Both positions are part of the Center's mission to position the Basque language in the global digital space through open-source development and cutting-edge research. === SENIOR AI AND LANGUAGE TECHNOLOGIES SPECIALIST === EUSKORPORA, the Linguistic Data Center for Basque Digital Technologies, a new association based in Donostia/San Sebastián, is seeking a senior expert in AI technologies applied to natural language processing, with experience, to lead key tasks related to language technologies applied to the Basque language. The selected person will be part of an interdisciplinary team and will participate in projects involving the collection, analysis, and annotation of linguistic data, as well as the development of open-source foundational language models (ASR, TTS, MT, NLP) oriented to Basque, in a research and development context closely connected to industry. Responsibilities: - Supervise and optimize processes for linguistic corpus collection, annotation, and management - Lead the design and development of foundational language models applied to Basque (speech recognition, synthesis, translation, text processing, etc.) - Contribute to the technological architecture of the Center - Coordinate internal and external teams and mentor junior staff - Identify innovation opportunities and contribute to proposals, reports, and dissemination - Establish strategic relationships with ecosystem stakeholders Requirements: - Advanced degree (Master or PhD) in Computational Linguistics, NLP, AI, Computer Engineering, Data Science or related fields - Minimum 5 years of experience in language or speech technologies - Proven experience with ASR, TTS, MT, or NLP models - Strong programming skills in Python and familiarity with frameworks such as Hugging Face, PyTorch, TensorFlow, spaCy, Kaldi, ESPnet, Fairseq - Knowledge of MLOps, Git, and data science best practices - Familiarity with open repositories and licensing Languages: - Basque: desirable, intermediate level (B2 or higher) - Spanish: fluent - English: high level (especially technical) We offer: - Participation in strategic national and international projects - Competitive salary according to experience - Interdisciplinary environment and opportunities for professional growth === JUNIOR AI AND LANGUAGE TECHNOLOGIES SPECIALIST === EUSKORPORA, the Linguistic Data Center for Basque Digital Technologies, a new association based in Donostia/San Sebastián, is seeking young professionals at the beginning of their careers to support key tasks related to the creation of linguistic resources and language technologies for the Basque language. Selected individuals will join an interdisciplinary team and participate in projects involving the collection, annotation, and analysis of linguistic data, as well as the development of open-source foundational language models (ASR, TTS, MT, NLP) oriented to Basque, in a research and development context closely connected to industry. Responsibilities: - Support the collection, cleaning and annotation of linguistic corpora (text and audio) - Assist in the training and evaluation of language and speech models - Collaborate in the documentation and maintenance of language resources - Contribute to the integration of open-source NLP tools and libraries - Assist in reports and dissemination activities - Work in coordination with technical, linguistic and project management profiles Requirements: - Degree or Master in Computational Linguistics, Computer Engineering, Data Science, or similar - Basic knowledge of NLP, language models, or speech technologies - Python programming (basic/intermediate level) - Familiarity with linguistic annotation or text processing tools - Experience with Git and frameworks like Hugging Face or spaCy is a plus Languages: - Basque: high level (B2 or higher) - Spanish: fluent - English: high level (B2 or higher) We offer: - Dynamic and innovative environment based in San Sebastián - Continuous training in cutting-edge technologies - Real opportunities for growth within the team - Competitive salary according to training and experience For further information or to apply, please contact: info(a)euskorpora.eus Best regards, EUSKORPORA [Euskorpora]<https://www.euskorpora.eus/> Euskorpora info(a)euskorpora.eus<mailto:sarregi@euskorpora.eus> +(34) 611 02 81 72 Mezu elektroniko honetan jasotzen den informazioa hartzaileen erabilera pertsonal eta konfidentzialerako da. Okerreko mezu hau jaso baduzu, mesedez, jakinarazi eta ezabatu. [https://www.euskorpora.eus/wp-content/uploads/2025/02/eco.png] Ez inprimatu mezu hau behar-beharrezkoa ez bada.

1 0

Second CfP [New Dates]: Interdisciplinary Workshop on Observations of Misunderstood, Misguided and Malicious Use of Language Models (OMMM 2025)
by Piotr Przybyła 18 Jun '25

18 Jun '25

We are pleased to invite submissions for the first Interdisciplinary Workshop on Observations of Misunderstood, Misguided and Malicious Use of Language Models (OMMM 2025). The workshop will be held with the RANLP 2025 conference in Varna, Bulgaria, on 11-13 September 2025. Overview The use of Large Language Models (LLMs) pervades scientific practices in multiple disciplines beyond the NLP/AI communities. Alongside benefits for productivity and discovery, widespread use often entails misuse due to misalignment of values, lack of knowledge, or, more rarely, malice. LLM misuse has the potential to cause real harm in a variety of settings. Through this workshop, we aim to gather researchers interested in identifying and mitigating inappropriate and harmful uses of LLMs. These include misunderstood usage (e.g., misrepresentation of LLMs in the scientific literature); misguided usage (e.g., deployment of LLMs without adequate training or privacy safeguards); and malicious usage (e.g., generation of misinformation and plagiarism). Sample topics are listed below, but we welcome submissions on any domain related to the scope of the workshop. Important Dates Submission deadline *[NEW]*: *15 July 2025*, at 23:59 Anywhere on Earth Notification of acceptance: 01 August 2025 Camera-ready papers due: 30 August 2025 Workshop dates: September 11, 12, or 13, 2025 Submission Guidelines Submissions will be accepted as short papers (4 pages) and as long papers (8 pages), plus additional pages for references. All submissions undergo a double-blind review, so they should not include any identifying information. Submissions should conform to the RANLP guidelines; for further information and templates, please see https://ranlp.org/ranlp2025/index.php/submissions/ We welcome submissions from diverse disciplines, including NLP and AI, psychology, HCI, and philosophy. We particularly encourage reports on negative results that provide interesting perspectives on relevant topics. In-person presenters will be prioritised when selecting submissions to be presented at the workshop, but the workshop will take place in a hybrid format. Accepted papers will be included in the workshop proceedings in the ACL Anthology. Papers should be submitted on the RANLP conference system at https://softconf.com/ranlp25/OMMM2025/ Keynote Speaker We are excited to have Dr. Stefania Druga as the keynote speaker for the inaugural OMMM workshop. Dr. Druga is a Research Scientist at Google DeepMind, where she designs novel multimodal AI applications. Topics of Interest We welcome paper submissions on all topics related to inappropriate and harmful uses of LLMs, including but not limited to: - Misunderstood use (and how to improve understanding): - Misrepresentation of LLMs (e.g., anthropomorphic language) - Attribution of consciousness - Interpretability - Overreliance on LLMs - Misguided use (and how to find alternatives): - Underperformance and inappropriate applications - Structural limitations and ethical considerations - Deployment without proper training or safeguards - Malicious use (and how to mitigate it): - Adversarial attacks, jailbreaking - Detection and watermarking of machine-generated content - Generation of misinformation or plagiarism - Bias mitigation and trust design For more information, please refer to the workshop website: https://ommm-workshop.github.io/2025/. For any questions, please contact the organisers at ommm-workshop(a)googlegroups.com. The organisers, Piotr Przybyła, Universitat Pompeu Fabra Matthew Shardlow, Manchester Metropolitan University Clara Colombatto, University of Waterloo Nanna Inie, IT University of Copenhagen

1 0

CfP: Terminology Translation Task @WMT25
by Kirill Semenov 18 Jun '25

18 Jun '25

[Apologies for cross-posting] Terminology Translation Task at WMT2025 - Call for Participation We are excited to announce the third Shared Task on Terminology Translation<https://www2.statmt.org/wmt25/terminology.html>, which would be run within the 10th Conference on Machine Translation (WMT2025) in Suzhou, China. TL;DR: - We test the sentence-level and document-level translation of the texts in finance and IT domains, given the explicit terminology. - The language pairs are: English -> {Spanish, German, Russian, Chinese}, Chinese -> English. - We evaluate the overall quality of translation, terminology success rate and consistency. Additionally, we compare the performance of systems given no terms provided, proper terminology and random terms. - The task starts on 20th June 2025 AOE, the submission deadline is 20th July 2025 AOE. - Please pre-register via Google Forms here: https://forms.gle/ZSn2pNJkQJAzHFnA6 . OVERVIEW The advances in neural MT and LLM-assisted translation of the last decade show nearly human quality in general domain translation at least for the high-resource languages. However, when it comes to specialized domains like science, finance, or legal texts, where the correct and consistent use of special terms is crucial, the task is far from being solved. The Terminology Shared Task aims to assess the extent to which machine translation models can utilize additional information regarding the translation of terminologies. Compared to two previous editions, 2021 and 2023, the new test data have more various test cases, are more consistent in domains for each translation direction, and are broader in language coverage. TASK DESCRIPTION Track №1: Sentence/Paragraph-Level Translation You will be provided with sequence of input sentences long, and small terminology dictionaries that will correspond only to the terms present in the given sentence. Language Pairs: * en-de (English → German) * en-ru (English → Russian) * en-es (English → Spanish) Domain: information technology Track №2: Document-Level Translation The setup is similar to Track №1, with two exceptions: the length of the input texts now equals the document, and the dictionaries correspond to the whole set of input texts (i.e. they are corpus-level). This makes the task close to the real-life setup (where the dictionaries exist independently from the texts), while it may complicate the implementation (since for the solutions that require storing the whole dictionary it will take more memory). Additionally, for the whole document setup, the problem of the consistent usage of terms is becoming more important. Language Pairs: en-zh-Hant (English → Traditional Chinese) zh-Hant-en (Traditional Chinese → English) Domain: finance EVALUATION Terminology Modes: You are expected to compare your system’s performance under three modes: 1. No terminology: the system is only provided with input sentences/documents. 2. Proper terminology: the system is provided with input texts (same as 1.) and dictionaries of the format {source_term: target_term}. 3. Random terminology: the system is provided with input texts and translation dictionaries of the same format as in 2. The difference is that the dictionary items are not special terms but words randomly drawn from input texts. This mode is of special interest since we want to measure to what extent the proper term translations help to improve the system performance (2.), as opposed to an arbitrary broader input that does not contain the domain-specific terminology. Metrics: 1. Overall Translation Quality: we will evaluate the general aspects of machine translation outputs such as fluency, adequacy and grammaticality. We will do that with the general MT automatic metrics such as BLEU or COMET. In addition to that, we will pay special attention to the grammaticality of the translated terms. 2. Terminology Success Rate: This metric assesses the ability of the system to accurately translate technical terms given the specialized vocabulary. This will be carried out by comparing the occurrences of the correct term translations (i.e. the ones present in the dictionary) to the output terms. The goal is to have a higher success rate that will show adherence to dictionary translations. 3. Terminology Consistency: for domains such as science or legal texts, the consistent use of an introduced term throughout the text is crucial. In other words, we want a system to not only pick up a correct term in a target language but to use it consistently once it is chosen. This will be evaluated by comparing all translations of a given source term in a text and measuring the percentage of deviations from the most consistent translation. This metric is more important for the Document-Level track, but it will be used for both tracks. IMPORTANT DATES All dates are end of Anywhere on Earth (AoE). Data snippets released: 7th May 2025 Dev data released: 22nd May 2025 Test data release, task starts: 20th June 2025 (postponed) Submission deadline: 20th July 2025 (postponed) Paper submission to WMT25: in-line with WMT25 Camera-ready submission to WMT25: in-line with WMT25 Conference in Suzhou, China: 05-09 November 2025 SUBMISSION GUIDELINES 0. Please notify us about your participation prior to submission. This is optional, but will be very helpful for us for better understanding of our workload after submission. Please do it through this Google Form: https://forms.gle/ZSn2pNJkQJAzHFnA6 1. Check your submission files with the validation script. It will be published at test date publication. 2. Write a description of your system (optional). 3. Submit your system via Google Forms. The Google form with all necessary submission details will be published at the test set date. All details on submission as well as FAQ can be found at the webpage of the shared task. ORGANIZERS * Kirill Semenov (University of Zurich), main contact: FirstNаmе [dоt] LаstNаmе {аt} uzh /dоt/ ch * Nathaniel Berger (Heidelberg University) * Pinzhen Chen (University of Edinburgh & Aveni.ai) * Xu Huang (Nanjing University) * Arturo Oncevay (JP Morgan) * Dawei Zhu (Amazon) * Vilém Zouhar (ETH Zurich) WEBSITE: https://www2.statmt.org/wmt25/terminology.html In case of query, please send an email to Kirill Semenov (see email above).

1 0

2026

2025

2024

2023

2022

Corpora June 2025