- Corpora - ELRA lists

Edge Hill Corpus Research Group: Meeting #13
by Costas Gabrielatos 02 Nov '24

02 Nov '24

The next meeting of the Edge Hill Corpus Research Group will take place online (via MS Teams) on Friday 15 November 2024, 2-4 pm (GMT). Topic: Discourse-Oriented Corpus Studies 2-3 pm Katia Adimora (Edge Hill University) Mexican immigration/immigrants in American and Mexican newspapers 3-4 pm Dan Malone (Edge Hill University) When is the extreme also typical? Using prototypicality to investigate representations of the lone-wolf terrorist Attendance is free. The abstracts and registration link are here: https://sites.edgehill.ac.uk/crg/next Registration closes on Wednesday 13 November, 11 am (GMT). If you have any questions, please contact Costas Gabrielatos (gabrielc(a)edgehill.ac.uk<mailto:gabrielc@edgehill.ac.uk>). ________________________________ Edge Hill University<http://ehu.ac.uk/home/emailfooter> Modern University of the Year, The Times and Sunday Times Good University Guide 2022<http://ehu.ac.uk/tef/emailfooter> University of the Year, Educate North 2021/21 ________________________________ This message is private and confidential. If you have received this message in error, please notify the sender and remove it from your system. Any views or opinions presented are solely those of the author and do not necessarily represent those of Edge Hill or associated companies. Edge Hill University may monitor email traffic data and also the content of email for the purposes of security and business communications during staff absence.<http://ehu.ac.uk/itspolicies/emailfooter>

1 0

Final CFP : Workshop on Challenges in Processing South Asian Languages (CHiPSAL @ COLING 2025)
by Velayuthan Menan 01 Nov '24

01 Nov '24

Hello Everyone, Based on the emails received we have extended the submission deadline to November 7th 2024. Please read below for more information on the workshop and the updated timeline. ---- Don’t miss this unique opportunity to discuss key issues and contribute to the advancement of language processing in the South Asian region, home to 25% of the world’s population and rich in linguistic and cultural diversity. Submit your papers by October 30, 2024, and join us at the first Workshop on Challenges in Processing South Asian Languages (CHiPSAL), taking place at COLING 2025 on January 19, 2025. Please submit your papers via *https://softconf.com/coling2025/CHiPSAL25 https://softconf.com/coling2025/CHiPSAL25 * ---- *CHiPSAL 2025*, the First workshop on Challenges in Processing South Asian Languages (CHiPSAL), will be held as part of the 31st International Conference on Computational Linguistics (COLING 2025) in Abu Dhabi, UAE, on *January 19, 2025*. The workshop will be conducted in *virtual mode*. CHiPSAL 2025 invites the submission of original research papers, review/opinion papers, and system demonstration papers, in short or long forms, on topics that highlight the challenges related to South Asian languages, including but not limited to the following areas: - Encoding and Unicode Issues in South Asian Scripts - Orthographic Complexities and Their Impact on Language Technology - Morphological Analysis and Generation in South Asian Languages - Dialectal Variations and Language Standardisation - Code-Mixing and Multilingualism in South Asian Contexts - Building Linguistic Resources for South Asian Languages - Speech Recognition and Synthesis for South Asian Languages - Preserving Linguistic Heritage through Technology - Benchmarking Models for South Asian Languages *Important Dates* All deadlines are 11:59PM UTC-12:00 (“Anywhere on Earth”). The First CFP Monday, 15 July 2024 Submission Deadline October 30, 2024 *November 7, 2024* Notification of acceptance November 29, 2024 Camera-ready papers December 13, 2024 Pre-recorded video due January 5, 2024 Workshop (Virtual) January 19, 2025 *For more information: https://sites.google.com/view/chipsal https://sites.google.com/view/chipsal*

1 1

Opening of the Faetar Low-Resource ASR Challenge 2025
by Ewan Dunbar 01 Nov '24

01 Nov '24

Opening of the Faetar Low-Resource ASR Challenge 2025 We are pleased to officially announce the opening of the Faetar Low-Resource ASR Challenge 2025. While we were not able to secure a special session dedicated to the challenge at the conference, we strongly encourage submission of papers describing your systems to Interspeech 2025. As such, we plan to adhere to a timeline that will allow us to return test results and announce winners in time for participants to prepare Interspeech papers (see below). Challenge website: https://perceptimatic.github.io/faetarspeech/ The Faetar Low-Resource ASR Challenge aims to focus researchers’ attention on several issues which are common to many archival collections of speech data: - noisy field recordings - lack of standard orthography, leading to noise in the transcriptions in the form of transcriber inconsistencies - only a few hours of transcribed data - a larger collection of untranscribed data - no additional data in the language (textual or speech) that is easily available - “dirty” transcriptions in documents, which contain matter that needs to be filtered out By focusing multiple research groups on a single corpus of this kind, we aim to gain deeper insights into these problems than can be achieved otherwise. The challenge uses the Faetar ASR Benchmark Corpus. Faetar (pronounced [fajdar]) is a variety of the Franco-Provençal language which developed in isolation in Italy, far from other speakers of Franco-Provençal, and in close contact with Italian. Faetar has less than 1000 speakers around the world, in Italy and in the diaspora. It is endangered, and preservation, learning, and documentation are a priority for many community members. The benchmark data represents the majority of all archived speech recordings of Faetar in existence, and it is not available from any other source. We propose four tracks: - Constrained ASR. Participants should focus on the challenge of improving ASR architectures to work with small, poor-quality sets. Participants may not use any resources to train / fine-tune their models beyond the files contained in the provided train set. No external pre-trained acoustic models or language models are allowed, and the use of the unlabelled portion of the Faetar challenge data set is not allowed either. Three other “thematic tracks” can be explored, and should not be considered mutually exclusive: - Using pre-trained acoustic models or language models. Participants focus on the most effective way to make use of models pre-trained on other languages. - Using unlabelled data. The challenge data also includes ~20 hrs of unlabelled data. Participants focus on finding the most effective way to make use of it. - Dirty data. The training data was extracted and automatically aligned from long-form audio and partial transcriptions in “cluttered” word processor files, relying on (error-prone) VAD, scraping, and alignment. Participants focus on improving the pipeline for extracting useful training data, with the ultimate goal of improving performance. Submissions will be evaluated on phone error rate (PER) on the test set. Participants are provided with a dev kit allowing them to calculate the PER on dev and train, as well as reproduce the baselines. For more information, and to register and obtain the data and the dev kit, please visit the challenge website: https://perceptimatic.github.io/faetarspeech/ For more information, or for questions, please contact us by writing to faetarasrchallenge(a)gmail.com.

1 0

Deadline extension CFP: The 1st Workshop on NLP for Languages Using Arabic Script (AbjadNLP 2025)
by Amal Haddad 31 Oct '24

31 Oct '24

CALL FOR PAPERS: THE 1ST WORKSHOP ON NLP FOR LANGUAGES USING ARABIC SCRIPT (ABJADNLP 2025) Co-located with COLING 2025 Conference, Abu Dhabi, UAE (19-20 January 2025) https://wp.lancs.ac.uk/abjad/ Submission URL [1] AbjadNLP is dedicated to advancing innovation and gaining deeper insights into Natural Language Processing (NLP) for languages that use the Arabic script. Our primary focus is on Abjad and Ajami languages that utilise the Arabic script or its variations. Traditionally associated with Semitic languages, Abjad scripts represent consonants in every syllable. In contrast, Ajami scripts denote the alphabetic use of the Arabic script in various African contexts, representing non-Arabic languages. We are interested in research on languages that fall under the Abjad or Ajami categories that use the Arabic script or any variations of it. We invite contributions, discussions, and explorations that delve deep into the unique linguistic structures, resources, challenges, and untapped potential presented by Abjad and Ajami languages within the realm of NLP and language resources. Our goal is to create synergies among researchers by addressing the diverse phenomena and challenges inherent in these rich linguistic traditions. The workshop is proud to highlight our connections with the Masakhane NLP community and collaborations with institutions worldwide, such as COMSATS on Urdu, and the long-standing UCREL NLP Group at Lancaster University, whose work encompasses over 20 languages worldwide, including Abjad and Ajami languages. Note: We chose the name Abjad for simplicity, but our focus includes Abjad and other languages that have adopted the Arabic and Perso-Arabic scripts, as well as Ajami languages. We acknowledge that Sorani Kurdish, when written in Arabic script, follows an alphabet style rather than an Abjad style. TOPICS OF INTEREST: * Core Technologies: morphological analysis, disambiguation, tokenisation, POS tagging, named entity detection, chunking, parsing, semantic role labelling, sentiment analysis, language modelling, etc. * Applications: machine translation, speech recognition, speech synthesis, optical character recognition, assistive technologies, social media, etc. * Resources and Tools: dictionaries, annotated data, corpora, orthography descriptions, font technology, glyph rendering, text input methodologies, spell-checking, speech-to-text solutions, BLARK descriptions, open access corpora. * Cultural and Sociolinguistic Considerations: text processing, transliteration challenges, and solutions, cultural contexts in NLP applications. SUBMISSION GUIDELINES: We follow the COLING 2025 standards for submission format and guidelines. Submissions should conform to the following types: * Long papers: Up to eight (8) pages, presenting substantial, original, completed, and unpublished work. * Short papers: Up to four (4) pages, describing a small focused contribution, negative results, system demonstrations, etc. KEY DATES: * 1st Call for Papers Announcement: 16 July 2024 * 2nd Call for Papers Announcement: 16 August 2024 * Paper Submission Deadline: 2 December 2024 * Workshop Date: 19 or 20 January 2025 ORGANISING COMMITTEE: General Chair: Mo El-Haj, Lancaster University Programme Chairs: * Hugh Paterson III, Collaborative Scholar * Saad Ezzini, Lancaster University * Ignatius Ezeani, Lancaster University Review Committee: * Mahum Hayat Khan, University of La Rioja * Muhammad Sharjeel, COMSATS University Islamabad Publication Chair: Sina Ahmadi, University of Zurich Publicity Chairs: * Cynthia Amol, Maseno University * Amal Haddad Haddad, University of Granada * Jaleh Delfani, University of Surrey Advisory Committee: * Ruslan Mitkov, Lancaster University * Paul Rayson, Lancaster University -- Amal Haddad Haddad (She/her) Facultad de Traducción e Interpretación Universidad de Granada |https://www.ugr.es/personal/amal-haddad-haddad Lexicon Research Group |http://lexicon.ugr.es/haddad Co-Convenor, BAAL SIG 'Humans, Machines, Language'|https://r.jyu.fi/humala Event Coordinator, BAAL SIG 'Language, Learning and Teaching' =============== Cláusula de Confidencialidad: "Este mensaje se dirige exclusivamente a su destinatario y puede contener información privilegiada o confidencial. Si no es Ud. el destinatario indicado, queda notificado de que la utilización, divulgación o copia sin autorización está prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, se ruega lo comunique inmediatamente por esta misma vía y proceda a su destrucción. This message is intended exclusively for its addressee and may contain information that is CONFIDENTIAL and protected by professional privilege. If you are not the intended recipient you are hereby notified that any dissemination, copy or disclosure of this communication is strictly prohibited by law. If this message has been received in error, please immediately notify us via e-mail and delete it" =============== Links: ------ [1] https://softconf.com/coling2025/AbjadNLP25/

1 0

UMRs in Boston 2025 Summer School - Second Call for Applications
by kristine.stenzel＠colorado.edu 31 Oct '24

31 Oct '24

UMRs in Boston Summer School – 2nd Call for Applications June 9-13, 2025 Brandeis University, Massachusetts, USA URL: https://umr4nlp.github.io/web/SummerSchool2025.html We invite applications for a five-day summer school on Uniform Meaning Representations (UMR). Impressive progress has been made in many aspects of natural language processing (NLP) in recent years. Most notably, the achievements of transformer-based large language models such as ChatGPT would seem to obviate the need for any type of semantic representation beyond what can be encoded as contextualized word embeddings of surface text. Advances have been particularly notable in areas where large training data sets exist, and it is advantageous to build an end-to-end training architecture without resorting to intermediate representations. For any truly interactive NLP applications, however, a more complete understanding of the information conveyed by each sentence is needed to advance the state of the art. Here, "understanding'' entails the use of some form of meaning representation. NLP techniques that can accurately capture the required elements of the meaning of each utterance in a formal representation are critical to making progress in these areas and have long been a central goal of the field. As with end-to-end NLP applications, the dominant approach for deriving meaning representations from raw textual data is through the use of machine learning and appropriate training data. This allows the development of systems that can assign appropriate meaning representations to previously unseen text. In this five-day course, instructors from the University of Colorado and Brandeis University will describe the framework of Uniform Meaning Representations (UMRs), a recent cross-lingual, multi-sentence incarnation of Abstract Meaning Representations (AMRs), that addresses these issues and comprises such a transformative representation. Incorporating Named Entity tagging, discourse relations, intra-sentential coreference, negation and modality, and the popular PropBank-style predicate argument structures with semantic role labels into a single directed acyclic graph structure, UMR builds on AMR and keeps the essential characteristics of AMR while making it cross-lingual and extending it to be a document-level representation. It also adds aspect, multi-sentence coreference and temporal relations, and scope. Each day will include lectures and hands-on practice. Topics to be covered may include the following, among others: 1. The basic structural representation of UMR and its application to multiple languages; 2. How UMR encodes different types of MWE (multi-word expressions), discourse and temporal relations, and TAM (tense-aspect-modality) information in multiple languages, and differences between AMR and UMR; 3. Going from IGT (interlinear glossed text) to UMR graphs semi-automatically; 4. Formal semantic interpretation of UMR incorporating a continuation-based semantics for scope phenomena involving modality, negation, and quantification; 5. Extension to UMR for encoding gesture in multimodal dialogue, Gesture AMR (GAMR), which aligns with speech-based UMR to account for situated grounding in dialogue. 6. UMR parsing and applications To apply, please complete this form by Nov. 15, 2024. https://www.colorado.edu/linguistics/umrs-boston-summer-school-application Other important dates: ● Notification of acceptance: Dec. 15, 2024 ● Confirmation of participation: Jan. 31, 2025 Participation will be fully funded (reasonable airfare, lodging, and meals). This summer school has been made possible by funding from NSF Collaborative Research: Building a Broad Infrastructure for Uniform Meaning Representations (Award # 2213805), with additional support from Brandeis University.

1 0

Webminar by Elena Sokolova (Amazon Text-to-Speech Group)
by HiTZ zentroa 31 Oct '24

31 Oct '24

**** We apologize for the multiple copies of this email. In case you are already registered to the next webinar, you do not need to register again. **** ------------------------------------------------------------------------ Dear colleague, We are happy to announce the next webinar in the Language Technology webinar series organized by the HiTZ Chair of AI&LT (https://hitz.eus). You can check the videos of previous webinars and the schedule for upcoming webinars here: http://www.hitz.eus/webinars Next webinar: *Speaker:* Elena Sokolova (Amazon Text-to-Speech Group) *Title:* How we do research in Speech at Amazon *Date: * Thursday, November 7, 2024 - 15:00 CET *Summary:* In this talk we will present how Speech technology has developed in the past 20 years. We will take a dive deep into the research that we do at Amazon in our Text-to-Speech lab, describe the challenges that we face and how we solve them at scale. We will also give an overview of the internship opportunities we have in our department for those of you who want to join our team in 2025. *Bio:* Elena is a Machine Learning team manager at Amazon, where she leads novel research in the field of speech technology. Over the past five years, she has overseen the deployment of machine learning projects into production and collaborated with her team to publish cutting-edge research on text-to-speech technology. Before joining Amazon, Elena completed her PhD at Radboud University Nijmegen in the Netherlands and gained industry experience as a Senior Machine Learning Scientist at Booking.com. * Upcoming webinars:* · Javier de la Rosa (December 12, 2024) · Ekaterina Shutova (January 30, 2025) · Sebastian Ruder (February 6, 2025) If you are interested in participating, please complete this registration form: http://www.hitz.eus/webinar_izenematea If you cannot attend this seminar, but you want to be informed of the following HiTZ webinars, please complete this registration form instead: http://www.hitz.eus/webinar_info Best wishes, HiTZ Zentroa P.S: HiTZ will not grant any type of certificate for attendance at these webinars.

1 0

2 year post-doc position available at Lattice (Paris/Montrouge, France) F/H
by Thierry Poibeau 31 Oct '24

31 Oct '24

Offer Description For the full description, see: https://euraxess.ec.europa.eu/jobs/286797 for the online description. Lattice is offering a two-year postdoctoral position starting on January 1, 2025 or soon thereafter. Candidates may choose from one of the following research topics: 1. AI and Society: impact of AI on society, including topics such as climate effects, changes in the workforce, issues around data access (especially. copyright challenges linked to large models), and more. 2. Interpretation of Large Language Models (LLMs) from a Linguistic Perspective: relationship between LLMs and linguistic theory. Topics might include what LLMs reveal about language, their (non) alignment with linguistic theory, etc. 3. Cultural Analytics Using Large Corpora: cultural analysis based on large datasets, particularly those from the Bibliothèque nationale de France (which include literature, but also newspapers and francophone web archives). The successful candidate should have completed their PhD in recent years or be close to completion. A relevant publication track record in a related field is required. To apply, please submit a CV (including a list of publications), a short research proposal (2–4 pages), a link to one or two relevant publications, and the names of two references to thierry.poibeau(a)ens.psl.eu by November 15, 2024. The proposal should clearly outline the research topic and the intended methodology. I can briefly answer questions (for ex. on the adequacy of a research topic), but the successful candidate will be selected through interviews after Nov 15. The position is based in Montrouge and Paris (Montrouge is a 5-minute walk from the Mairie de Montrouge metro station). Salary will be commensurate with experience and follows the ENS salary scale. Lattice is an interdisciplinary lab conducting research in linguistics, natural language processing, and computational humanities. International profiles are strongly encouraged to apply. Mastering French is a plus, but is not mandatory.

1 0

PhD positions in Research Training Group KEMAI on Medical XAI
by Christiane Böhm 31 Oct '24

31 Oct '24

Dear all, starting January 2025, 9 doctoral positions are available within our DFG Research Training Group KEMAI (Knowledge Infusion and Extraction for Explainable Medical AI) at Ulm University in Germany. The KEMAI team aims at combining the benefits of knowledge- and learning-based systems, to not only allow for state-of-the-art accuracy in medical diagnosis, but to also clearly communicate the obtained predictions to physicians, considering ethical implications within the medical decision process. KEMAI’s main purpose is to interdisciplenarily train PhD students from computer science, medicine, and ethics in the area of explainable medical AI. The RTG offers a structured doctoral program that creates an environment in which young scientists can conduct research at the highest level in the field of medical AI. We invite highly motivated candidates with a passion for research and a desire to contribute to an interdisciplinary academic environment to apply for these positions. (The positions are fully funded for 3+1 years and come with an E13 salary.) Projects include: Data Exploitation • A1 – Harvesting Medical Guidelines using Pre-trained Language Models Project Leads: Prof. Scherp (Computer Science), Prof. Braun (Computer Science), Dr. Vernikouskaya (Medicine) This project focuses on researching multimodal pre-trained language models (LM) that extract symbolic knowledge on medical diagnosis and treatments from input documents. The models will incorporate structured knowledge and represent extracted information using an extended process ontology. The project applies these models to COVID-19 related imaging and treatments, contributing to OpenClinical’s COVID-19 Knowledge reference model and adapting to various benchmarks. • A2 – Stability Improved Learning with External Knowledge through Contrastive Pre-training Project Leads: Jun.-Prof. Götz (Medicine), Prof. Scherp (Computer Science) This project aims to improve machine learning reliability in small data settings by learning from disconnected datasets using contrastive learning. It investigates if contrastive learning can reduce classifier susceptibility to confounders, reverse confounding effects, and identify out-of-distribution test samples. The project seeks to find approaches that address these technical challenges. Knowledge Infusion • B2 – Semantic Design Patterns for High-Dimensional Diagnostics Project Leads: Prof. Kestler (Medicine), Prof. M. Beer (Medicine) This project defines semantic design patterns for incorporating SemDK in ML algorithms to improve clinical predictions and tumor characterization. The patterns will be categorized by their mechanisms and knowledge representation, providing guidelines for application. The project evaluates these patterns in image analysis and molecular diagnostics based on high-dimensional data. Knowledge Extraction • C2 – Learning Search and Decision Mechanisms in Medical Diagnoses Project Leads: Prof. Neumann (Computer Science), Jun.-Prof. Götz (Medicine) This project studies human attentive search and object attention principles for vision-based medical diagnosis. It investigates mechanisms of object-based attention and visual routines for task execution. The goal is to formalize human visual search strategies and integrate them into deep neural networks (DNNs) for improved medical diagnosis. Model Explanation • D1 – Accountability of AI-based Medical Diagnoses Project Leads: Prof. Steger (Ethics), Prof. Ropinski (Computer Science) This project addresses the ethical analysis of AI system designs for medical diagnoses. It focuses on determining which AI-supported processes need to be explainable and transparent, generating comprehensive information to help users understand AI-driven medical decisions. • D2 – Explainability, Understanding, and Acceptance Requirements Project Leads: Prof. Hufendiek (Philosophy), Prof. Glimm (Computer Science), Dr. Lisson (Medicine) This project applies philosophical insights on understanding and explanations to the use of AI in medical diagnosis. It clarifies the roles of understanding and abductive reasoning in medical diagnosis, identifies conflicts between stakeholders, and suggests ways to develop and integrate AI explanations with human experts' reasoning processes. Medical PhD Projects (10 months) The outlined PhD projects are complemented by medical PhD projects, which complement the technical and ethical projects, and which are targeted towards medical researchers. For further information on KEMAI and application please go to https://kemai.uni-ulm.de/ Best regards Christiane Böhm - Coordinator - RTG KEMAI Ulm University James-Franck-Ring - O27 Room 321 D-89081 Ulm Germany phone: +49 731 50 31321 christiane.boehm(a)uni-ulm.de kemai(a)uni-ulm.de

1 0

[CfP] "Contemporary Trends in English-Language Studies 2" (Zielona Góra)
by Leszek Szymański 30 Oct '24

30 Oct '24

Dear Colleagues,the Institute of Modern Languages at the University of Zielona Góra announces the "Contemporary Trends in English-Language Studies 2" conference. This year's edition will be held in a hybrid mode on April 3-4, 2025.The abstract submission deadline is January 31, 2025.The 2025 edition of the conference is organized under the honorary patronage of the Polish Linguistic Society.More information is available at: https://sites.google.com/view/ctiels/ Thank you!Leszek Szymański

1 0

Call for Papers: NAACL Student Research Workshop (SRW) 2025
by mengyan3＠andrew.cmu.edu 30 Oct '24

30 Oct '24

### Call for Papers The Student Research Workshop (SRW) provides a venue for student researchers to present their work in computational linguistics and natural language processing. Students receive feedback from the general conference audience as well as from mentors specifically assigned according to the topic of their work. We invite papers in two different categories: <b>Research Papers </b>: Papers in this category can describe completed work, or work in progress with preliminary results. For these papers, the first author must be a current student. Topics of interest for the SRW are the same as NAACL main conference. See the list of topics here. <b>Thesis Proposals</b>: This category is appropriate for advanced students who have decided on a thesis topic and wish to get feedback on their proposal and broader ideas for their continuing work. ### Submissions Papers should be submitted on OpenReview by December 1, 2024. OpenReview link: https://openreview.net/group?id=aclweb.org/NAACL/2024/Workshop/Student_Rese… Submissions should be no more than five pages not including references. Make sure to read through the Author Guidelines on our website for detailed submission instructions. Not following the author guidelines may result in your paper being desk-rejected. Please use the standard ACL templates and style guide, though please keep in mind that the SRW page limit is 5 pages (not including references) ACL templates: https://github.com/acl-org/acl-style-files ### Website Please see our CfP on the website for more details: https://naacl2025-srw.github.io/cfp

1 0

2026

2025

2024

2023

2022

Corpora