- Corpora - ELRA lists

PhD job offer: Language technology for cultural heritage (1.0 FTE; deadline 14 October)
by a.w.van.cranenburgh＠rug.nl 04 Oct '24

04 Oct '24

The Computational Linguistics group (GroNLP) of the Center for Language and Cognition Groningen (CLCG) is looking for a PhD student in “Language technology for cultural heritage: New discoveries with little data” within the HAICu research project. The HAICu project is a large-scale Dutch research project by universities and cultural-heritage institutions into new forms of Artificial Intelligence-based access to multimodal Cultural-Heritage data, both contemporary and historical. Within HAICu, AI researchers, Digital Humanities researchers and a wide range of public and private partners will co-develop scientific solutions to unlock the true societal potential of the current heterogeneous digital heritage collections. It will provide easier, richer and more reliable data access to citizens, journalists, civic organisations, and various other stakeholders. HAICu is funded by the NWO National Science Agenda (NWA) and has a budget of about EUR 10 million. HAICu has started in January 2024 and will last 6 years (until Jan 2030). For more information about HAICu, please see https://www.haicu.science/ The PhD Project This specific PhD position is about effectively dealing with missing and sparse labels in humanities datasets such as literature, history, philosophy. Cultural heritage institutions, and especially the National Library of the Netherlands, offer access to a lot of digitized data which can be leveraged through computational approaches. However, it is very common that the data is incomplete. This is a challenge for typical machine learning methods that rely on being fed with representative and complete data, leading to systems that cannot handle distribution shifts or extrapolating beyond their training set. Recent developments in artificial intelligence have shown that large language models are able to learn from small amounts of training data, or even none at all (few shot and zero shot learning). Paired with more and more accessible techniques for specializing existing models for target domains and tasks, a lot of new possibilities open up for cultural heritage data, which will be explored within this project. Examples of possible topics include - Investigating literary reception and prestige over time. - Detecting and mapping intertextuality within texts. - Uncovering the influences and biases over time in datasets. - Monitoring the evolution of concepts in textual datasets. - Improving the robustness of models to out-of-distribution data. The project will, in collaboration with the National Library of The Netherlands, be coordinated by Andreas van Cranenburgh, Tommaso Caselli, and Malvina Nissim at the University of Groningen. This is an interdisciplinary project at the intersection of Computational Linguistics/Natural Language Processing (NLP) and the humanities. You will be asked to - Develop a specific research proposal within the proposed theme. - Review the academic literature relevant to the project’s goals. - Carry out research, present your results and author scientific articles on the above mentioned topics. - Collaborate with members of the Computational Linguistics group at the University of Groningen, the National Library, and with the broader Haicu consortium. - Engage and collaborate with other researchers working on computational humanities research. - Complete a PhD thesis written in English in the specified timeframe (4 years). - Collaborate on outreach and public engagement activities. - Gain teaching experience. This PhD project offers a unique opportunity to work in an international environment and to acquire valuable research experience: You will be carrying out research in the context of the Computational Linguistics group of the Center for Language and Cognition (CLCG) of the University of Groningen, and will be spending at least one day a month at the National Library in The Hague. For more information, see https://www.rug.nl/about-ug/work-with-us/job-opportunities/?details=00347-0…

1 0

CFP - SemEval 2025 Task 8: Question Answering on Tabular Data
by Eugenio Martínez Camara 04 Oct '24

04 Oct '24

[apologies if you receive multiple copies of this call] CALL FOR PARTICIPATION - SemEval 2025 Task 8: Question Answering on Tabular Data We are pleased to announce the first SemEval task on Question Answering on Tabular Data. Our SemEval 2025 task consists of Question Answering over Tabular Data making use of the DataBench benchmark. DataBench is a benchmark composed of real-world table datasets from different domains and with large size of rows and columns, as well as a wide variety of data types that allow to assess distinct sort of questions related to each data type. We propose a task to encourage participants to develop a system that answers the questions of the kind present in DataBench over day-to-day datasets, where the answer is either a number, a categorical value, a boolean value or lists of several types. DataBench can be used as a training and validation set, while we will release another test set explicitly compiled for the task competition. The system developed by the participants will be provided by a series of (dataset, question) pairs and will need to provide an answer which would then be compared with a gold standard. The answer might be achieved through a variety of methods. In our paper we illustrate two different approaches: In-Context Learning and Code Generation. You may use any of these or come up with your own approach. There will be two subtasks: Subtask I : DataBench QA Participants will be provided with a dataset (of any size) and a question over it. The question should be answered using the data from the dataset only. Subtask II: DataBench Lite QA The task is essentially the same as the previous subtask, but involves using the sampled version of each dataset with a maximum of 20 rows per dataset (see explanation on DataBench Lite). The question should be answered using the data from the sampled dataset only. For the test set, we will similarly provide a reduced version of each dataset for this subtask. This task is especially relevant when testing for models with a smaller window size. Important Dates Official Competition start 10 January 2025 Competition end 31 January 2025 Task Organizers Jorge Osés Grijalba - Graphext L. Alfonso Ureña-López and Eugenio Martínez Cámara - University of Jaén Jose Camacho-Collados - Cardiff University Competition website: https://jorses.github.io/semeval/ Codabench: https://www.codabench.org/competitions/3360/ Google Group: https://groups.google.com/g/semeval-25-t8-tabularqa -- Suelo trabajar a deshoras por lo que este correo puede haberte llegado fuera de tu horario laboral, y al cual puedes responder en el momento que mejor se ajuste a tus hábitos de trabajo. | I sometimes work at irregular times and this email might arrive out of working hours so please be assured that I respect your working pattern and look forward to your response when it suits you. ------- Eugenio Martínez Cámara. Vicepresidente de la SEPLN <http://www.sepln.org/> | Vice President of the SEPLN <http://www.sepln.org/en>. Investigador en Proc. del Lenguaje Natural | Postdoctoral Researcher in Natural Language Proc. Grupo de Investigación SINAI <http://sinai.ujaen.es/> | SINAI <http://sinai.ujaen.es/> Research Group. Profesor Titular | Associate Professor. Dpto. de Informática | Computer Science Department. Universidad de Jaén.

1 0

2nd Call for Papers: The First Workshop of Evaluation of Multi-Modal Generation
by mongyuansim＠gmail.com 04 Oct '24

04 Oct '24

Dear Colleagues, We are delighted to announce the second CfP for The First Workshop of Evaluation of Multi-Modal Generation @ COLING 2025. It is now open and closes on 20th November 2024 (11:59PM AoE UTC-12). Website link: https://evalmg.github.io/ The 1st Evaluation of Multi-Modal Generation Workshop @ COLING 2025 ---------------------------------------------------------------------------------------------------------- Multimodal generation techniques have opened new avenues for creative content generation. However, evaluating the quality of multimodal generation remains underexplored and some key questions are unanswered, such as the contributions of each modal, the utility of pre-trained large language models for multimodal generation, and measuring faithfulness and fairness in multimodal outputs. This workshop aims to foster discussions and research efforts by bringing together researchers and practitioners in natural language processing, computer vision, and multimodal AI. Our goal is to establish evaluation methods for multimodal research and advance research efforts in this direction. Call for Papers ---------------------- Both long paper and short papers (up to 8 pages and 4 pages respectively with unlimited references and appendices) are welcomed for submission. A list of topics relevant to this workshop (but not limited to): - Evaluation metrics for multimodal text generation for assessing informativeness, factuality and faithfulness - New benchmark datasets, evaluation protocols and annotations - Challenges in evaluating multimodal coherence, relevance and contribution of modalities and inter- and intra-interactions - Assessing information integration and aggregation across multiple modalities - Adversarial evaluation approaches for testing the robustness and reliability of multimodal generation systems - Ethical considerations in the evaluation of multimodal text generation, including bias detection and mitigation strategies - Multilingual multimodal text generation systems for low-resource languages - Evaluating fairness and privacy in multimodal learning and applications Important Dates --------------------------- - Nov 20, 2024: Paper submission due date - Dec 05, 2024: Notification of acceptance - Dec 11, 2024: Camera-ready version due - Jan 20, 2025: Workshop Date Note: All deadlines are 11:59PM UTC-12:00 (“Anywhere on Earth”) Submission Instructions ------------------------------------- You are invited to submit your papers in our START/SoftConf submission portal. All the submitted papers have to be anonymous for double-blind review. The content of the paper should not be longer than 8 pages for long papers and 4 pages for short papers, strictly following the COLING 2025 templates, with the mandatory limitation section not counting towards the page limit. Supplementary and appendices (either as separate files or appended after the main submission) are allowed. We encourage code link submissions for reproducibility. Submission Link: https://softconf.com/coling2025/EvalMG25 ACL style template: https://coling2025.org/calls/submission_guidlines/ Non-archival Option ------------------------------- To promote discussions within the community, our workshop includes non-archival track. Authors have the flexbility to submit their unpublished work or papers accepted to COLING main conference to our workshop. The organisers may offer the opportunity to give oral or poster presentation. Organisers ------------------------------- * Wei Emma Zhang, The University of Adelaide * Xiang Dai, CSIRO * Desmond Elliot, University of Copenhagen * Byron Fang, CSIRO * Haojie Zhuang, The University of Adelaide * Mong Yuan Sim, The University of Adelaide & CSIRO * Weitong Chen, The University of Adelaide Kind regards, COLING25 EvalMG Organisers

1 0

ComputEL-8 workshop: regular + special sessions - final Call-for-Papers (DL postponed to Oct. 14)
by Antti Arppe 04 Oct '24

04 Oct '24

ComputEL-8: Eighth Workshop on the Use of Computational Methods in the Study of Endangered Languages FINAL CALL FOR PAPERS for REGULAR SESSION (and SPECIAL SESSION) Submission deadline (POSTPONED): October 14, 2024 Submission link: https://easychair.org/conferences/?conf=computel8 REGULAR SESSION (For details about Special Session, scroll further below.) We encourage submissions that explore the interface and intersection of computational linguistics, documentary linguistics, and community-based efforts in language revitalization and reclamation. This includes submissions that: (i) propose or demonstrate new methods or technologies for tasks or applications focused on low-resource settings, and in particular, endangered languages (ii) examine the use of specific methods in the analysis of data from low-resource languages, or propose new methods for analysis of such data, oriented toward the goals of language reclamation and revitalization (iii) propose new models for the collection, management, and mobilization of language data in community settings, with attention to e.g. issues of data sovereignty and community protocols (iv) explore concrete steps for a more fruitful interaction among computer scientists, documentary linguists, and language communities IMPORTANT DATES 14-Oct-2024 Deadline for submission of papers or extended abstracts 22-Nov-2024 Notification of Acceptance 10-Jan-2025 Camera-ready papers due 4 & 5 March 2025 Workshop PRESENTATIONS Presentation of accepted papers will be in both oral sessions and a poster session. The decision on whether a presentation for a paper will be oral and/or poster will be made by the Organizing Committee on the advice of the Program Committee, taking into account the subject matter and how the content might be best conveyed. Oral and poster presentations will not be distinguished in the Proceedings. SUBMISSIONS In line with our goal of reaching multiple overlapping communities, we offer two modes of submission: extended abstract and full paper. The mode of submission does not influence the likelihood of acceptance. Either can be submitted to one of the workshop’s tracks: (a) language community perspective and (b) academic perspective. Submissions must be uploaded to EasyChair (https://easychair.org/conferences/?conf=computel8) no later than October 14, 2024 11:59PM (UTC-12, “anywhere on earth”). Submissions may be considered for both the regular session and the special session. All submissions must be anonymous following ACL guidelines and will be peer-reviewed by the scientific Program Committee. A. Extended Abstract: Please submit anonymous abstracts of up to 1500 words, excluding references. Extended abstracts must be submitted as attached documents. B. Full Paper: Please submit anonymously either a) a long paper - max. 8 pages excluding references and appendices; or b) a short paper - max. 4 pages excluding references, according to the style and formatting guidelines provided in by ACL Style Files (download template files for LaTeX or Microsoft Word: https://github.com/acl-org/acl-style-files). PROCEEDINGS The authors of selected accepted full papers (long or short) will be invited by the Organizing Committee to submit their papers for online publication via the open-access ACL Anthology. Final versions of long and short papers will be allotted one additional page (altogether 5 and 9 pages) excluding references. Proceedings papers should be revised and improved versions of the work that was submitted for, and which underwent, review. Any revisions should concern responses to reviewer comments or the addition of relevant details and clarifications, but not entirely new, unreviewed content. Camera-ready versions of the articles for publication will be due on January 10, 2025. Please see the ComputEL-8 website for further information: https://computel-workshop.org/computel-8/ SPECIAL THEME SESSION - BUILDING TOOLS TOGETHER In addition to the main session, ComputEL-8 invites self-identified submissions to a special themed session on “Building Tools Together”, oriented toward amplifying our shared understanding of how best to work together across disciplinary and cultural boundaries to build technological tools that support community language revitalization. We invite presentations that: (1) describe collaborations in the development of new tools and technologies; and/or (2) describe or identify technological or computational needs within community language reclamation contexts, and/or propose solutions. 1. For presentations that describe a collaboration among language communities, academic researchers, and (in some cases) industry or non-governmental organizations towards the development of new tools, resources, and technologies in, we encourage submissions which address questions such as: a. How did the idea for the tool or technology come about? b. How did the team members meet and come to work together? c. What has been the impact of this tool? How are you evaluating it? How has the project d. benefitted community efforts at language maintenance and revitalization? d. What are some challenges (logistical, technical, interdisciplinary, intercultural) that you encountered, and how did you address them? e. How have you balanced the needs and priorities of different team members through the lifespan of the project? f. What lessons have you learned that might benefit similar collaborations? 2. For presentations that identify technological or computational needs within community language reclamation contexts, and/or propose solutions, e we encourage submissions which address questions such as: a. What is the need that this tool would meet? Who will it serve? b. What is the blue-sky version of this tool? What is the minimum viable product version? c. What kinds of data, digital assets, or media content would be required to create the tool, and how would they be assembled? d. What challenges might the team face in the development process? e. How do you anticipate the collaborative process to best incorporate diverse areas of expertise from cultural and community-grounded knowledge to academic, technical, and production-oriented knowledge? Please submit anonymous extended abstracts of up to 1500 words, excluding references. Submissions representing community-led collaborations are strongly encouraged. Submissions must be uploaded to EasyChair (https://easychair.org/conferences/?conf=computel8) no later than October 14, 2024 11:59PM (UTC-12, “anywhere on earth”). Submissions may be considered for both the regular session and the special session. Notification of acceptance to the Special Session will be sent out by November 22, 2024. All authors of papers in the Special Theme Session will be invited to contribute to a follow-up paper that synthesizes the findings of the Session. IMPORTANT DATES 14-Oct-2024 Deadline for submission of papers or extended abstracts 22-Nov-2024 Notification of Acceptance 10-Jan-2025 Camera-ready papers due 4 & 5 March 2025 Workshop Please see the ComputEL-8 website for further information: https://computel-workshop.org/special-theme-session-building-tools-together/ ORGANIZING COMMITTEE Godfred Agyapong (University of Florida) Antti Arppe (University of Alberta) Aditi Chaudhary (Google DeepMind) Jordan Lachler (University of Alberta) Sarah Moeller (University of Florida) Shruti Rijhwani (Google DeepMind) Daisy Rosenblum (University of British Columbia) Olivia Waring (University of Hawai'i Mānoa) CONTACT US WEB: https://computel-workshop.org/ComputEL-8/ EMAIL: computel.workshop(a)gmail.com -- ====================================================================== Antti Arppe - Ph.D (General Linguistics), M.Sc. (Engineering) Professor of Quantitative Linguistics Director, Alberta Language Technology Lab (ALTLab) Project Director, 21st Century Tools for Indigenous Languages (21C) Past President, ACL SIG for Endangered Languages (SIGEL) Department of Linguistics, University of Alberta E-mail: arppe(a)ualberta.ca, antti.arppe(a)iki.fi WWW: www.ualberta.ca/~arppe, altlab.artsrn.ualberta.ca Mānahtu ina rēdûti ihza ummânūti ihannaq - dulum ugulak úmun ingul ----------------------------------------------------------------------

1 0

HPLT Datasets version 2.0 released
by Andrey Kutuzov 03 Oct '24

03 Oct '24

(Apologies for cross-posting) Version 2.0 of the HPLT Datasets is now published, with web-derived corpora in 193 languages. These collections are available under the Creative Commons CC0 license and bring significant improvements compared to previous releases (version 1.2). Similarly to 1.2, the release comes in two variants: de-duplicated (21 TB in size) and cleaned (15 TB in size). The cleaned variant contains the same documents as de-duplicated minus those filtered out by our cleaning heuristics. The cleaned variant is the recommended one, unless you want to try your own cleaning pipelines. Download the corpora here: https://hplt-project.org/datasets/v2.0 Similar to the previous releases, version 2.0 datasets are hosted by Sigma2 NIRD Data Lake (https://www.sigma2.no/service/nird-data-lake), and text extraction pipeline was run on LUMI supercomputer (https://lumi-supercomputer.eu/). *What's new* - The size of the source web collections has increased 2.5x: 4.5 petabytes of compressed web data in total, mostly from Internet Archive (https://archive.org/), but also from Common Crawl (https://commoncrawl.org/). - The text extraction pipeline now uses Trafilatura (https://trafilatura.readthedocs.io/), which results in more efficient boilerplate removal: thus, less noise in the data. - Language identification now uses a refined version of OpenLID (https://aclanthology.org/2023.acl-short.75/). - This, in turn, allowed us to publish data in 193 languages, compared to 75 languages in version 1.2. - We switched from two-letter ISO 639-1 language codes to three-letter ISO 639-3 language codes, augmented with a postfix denoting writing system. For example, `pol_Latn` is Polish written in Latin script. Mapping from the old to the new codes is available at https://github.com/hplt-project/warc2text-runner/blob/main/stats/_langs/lan…. - The documents are now annotated with their compliance to the robots.txt file of the original website. This metadata field can be used to filter out documents explicitly forbidden for crawling by website owners, making the resulting corpora somewhat less prone to copyright violations . The cleaned variant contains only robots.txt compliant documents. More details at https://github.com/hplt-project/monotextor-slurm/blob/main/README.md#robots… - De duplication is done at collection-level, not at dataset level. - Documents have also been annotated for PII information with multilingual-pii-tool (https://github.com/mmanteli/multilingual-PII-tool). These are identified in the form of Unicode character offsets for every match. - Segment-level language-model-based scores have been replaced by document quality scores computed with web-docs-scorer. - Filtering and cleaning criteria have been simplified (https://github.com/hplt-project/monotextor-slurm?tab=readme-ov-file#filters). HPLT Monolingual Datasets version 2.0 (the de-duplicated variant) feature about 7.6 trillion whitespace-separated words and about 52 trillion characters extracted from 21 billion documents, compared to 5.6 trillion words and 42 trillion characters extracted from 5 billion documents in version 1.2. All in all, you can expect less noise and boilerplate, less duplicates, more unique documents, and generally better quality texts to train language models on or for other NLP tasks. *How was this dataset produced* You may want to read section 3 of our Deliverable HPLT pipeline and tools (https://hplt-project.org/HPLT_D7_2___HPLT_pipelines_and_tools.pdf) to have a full description on how did we produced this dataset. If you don't have much time for reading, maybe this chart is enough for your purposes: https://hplt-project.org/_next/static/media/dataset-pipeline-light.c2521ee1… Each language is accompanied with an HPLT Analytics report. These automated reports provide useful information and statistics about the clean version of the HPLT v.2.0 datasets. They are the result of running the HPLT Analytics Tool (https://github.com/hplt-project/data-analytics-tool) on them. They are helpful for inspecting the datasets even before downloading them. *What is HPLT?* HPLT (High Performance Language Technologies) is an EU Horizon Europe funded project which aims at collecting large quantities of data in many languages and training powerful and efficient language and translation models. An important feature of HPLT is openness and transparency: all the artifacts of the project are publicly available under permissive licenses. https://hplt-project.org/ -- Andrey Language Technology Group (LTG) University of Oslo

1 0

Second Call for Papers: The First Workshop on Language Models for Low-Resource Languages (LoResLM 2025@COLING)
by Ranasinghe, Tharindu 03 Oct '24

03 Oct '24

Neural language models have revolutionised natural language processing (NLP) and have provided state-of-the-art results for many tasks. However, their effectiveness is largely dependent on the pre-training resources. Therefore, language models (LMs) often struggle with low-resource languages in both training and evaluation. Recently, there has been a growing trend in developing and adopting LMs for low-resource languages. LoResLM aims to provide a forum for researchers to share and discuss their ongoing work on LMs for low-resource languages. >> Topics LoResLM 2025 invites submissions on a broad range of topics related to the development and evaluation of neural language models for low-resource languages, including but not limited to the following. * Building language models for low-resource languages. * Adapting/extending existing language models/large language models for low-resource languages. * Corpora creation and curation technologies for training language models/large language models for low-resource languages. * Benchmarks to evaluate language models/large language models in low-resource languages. * Prompting/in-context learning strategies for low-resource languages with large language models. * Review of available corpora to train/fine-tune language models/large language models for low-resource languages. * Multilingual/cross-lingual language models/large language models for low-resource languages. * Applications of language models/large language models for low-resource languages (i.e. machine translation, chatbots, content moderation, etc. >> Important Dates * Paper submission due – 5th November 2024 * Notification of acceptance – 25th November 2024 * Camera-ready due – 13th December 2024 * LoResLM 2025 workshop – 19th / 20th January 2025 co-located with COLING 2025 >> Submission Guidelines We follow the COLING 2025 standards for submission format and guidelines. LoResLM 2025 invites the submission of long papers of up to eight pages and short papers of up to four pages. These page limits only apply to the main body of the paper. At the end of the paper (after the conclusions but before the references), papers need to include a mandatory section discussing the limitations of the work and, optionally, a section discussing ethical considerations. Papers can include unlimited pages of references and an unlimited appendix. To prepare your submission, please make sure to use the COLING 2025 style files available here: * Latex - https://coling2025.org/downloads/coling-2025.zip * Word - https://coling2025.org/downloads/coling-2025.docx * Overleaf - https://www.overleaf.com/latex/templates/instructions-for-coling-2025-proce… Papers should be submitted through Softconf/START using the following link: https://softconf.com/coling2025/LoResLM25/ >> Organising Committee * Hansi Hettiarachchi, Lancaster University, UK * Tharindu Ranasinghe, Lancaster University, UK * Paul Rayson, Lancaster University, UK * Ruslan Mitkov, Lancaster University, UK * Mohamed Gaber, Birmingham City University, UK * Damith Premasiri, Lancaster University, UK * Fiona Anting Tan, National University of Singapore, Singapore * Lasitha Uyangodage, University of Münster, Germany >> Programme Committee * Burcu Can - University of Stirling, UK * Çağrı Çöltekin - University of Tübingen, Germany * Debashish Das - Birmingham City University, UK * Alphaeus Dmonte - George Mason University, USA * Daan van Esch - Google * Ignatius Ezeani - Lancaster University, UK * Anna Furtado - University of Galway, Ireland * Amal Htait - Aston University, UK * Ali Hürriyetoğlu - Wageningen University & Research, Netherlands * Diptesh Kanojia - University of Surrey, UK * Jean Maillard - Meta * Maite Melero - Barcelona Supercomputing Centre, Spain * Muhidin Mohamed - Aston University, UK * Nadeesha Pathirana - Aston University, UK * Alistair Plum - University of Luxembourg, Luxembourg * Sandaru Seneviratne - Australian National University, Australia * Ravi Shekhar - University of Essex, UK * Taro Watanabe - Nara Institute of Science and Technology, Japan * Phil Weber - Aston University, UK URL - https://loreslm.github.io/ Twitter - https://x.com/LoResLM2025 Best Regards Tharindu Ranasinghe

1 0

[CFP] SemEval 2025 Task 5 - LLMs4Subjects - 1st Call for Shared Task Participation
by Jennifer D'Souza 03 Oct '24

03 Oct '24

[apologies if you receive multiple copies of this call] Dear colleagues and friends, *We are pleased to release the 1st Call for Participation to the LLMs4Subjects Shared Task organized as part of SemEval 2025.* *Overview:* As the first of its kind, LLMs4Subjects invites the research community to *develop cutting-edge LLM-based semantic solutions for the subject tagging of the Leibniz University's Technical Library's open-access collection*. The shared task provides an opportunity for the research community to creatively utilize LLMs for subject tagging of technical records. *Systems need to demonstrate bilingual language modeling in understanding technical documents in both German and English.* Moreover, successful solutions may be directly integrated into the operational workflows of the TIB Leibniz Information Centre for Science and Technology University Library. *What we provide to participants:* a human-readable form of a subject's taxonomy (this is the *GND* or *Gemeinsame Normdatei*, the integrated authority file used for cataloging in German-speaking countries) and a large collection of technical records tagged with these subjects from the TIB's open-access collection called *TIBKAT*. More details on the task website: https://sites.google.com/view/llms4subjects/ *LLMs4Subjects defines the following three tasks:* - Task 1: Learn the GND - Task 2: Align subject tagging to the TIBKAT collection - (Optional and Fun) Task 3: Develop Elegant Frontend Interfaces for Subject Tagging *LLMs4Subjects will have three separate evaluations:* - Evaluation 1: Quantitative Metrics-based Evaluations - Evaluation 2: Qualitative Evaluations by the Human Subject Specialists - (Optional) Evaluation 3: HCI evaluations for subject indexing interfaces submitted *To participate in the LLMs4Subjects shared task,* 1. please submit your interest to participate using our online form ( https://forms.gle/YQzupcoySAyJi45c6), 2. sign up to the shared task Google Groups ( https://groups.google.com/u/6/g/llms4subjects) for FAQs, news, and announcements, and, 3. last but not the least, download the datasets ( https://github.com/jd-coderepos/llms4subjects/) to begin development. *Dates* Training and validation datasets available:October 2, 2024 Test data available/Evaluation starts: January 10, 2025 Evaluation ends: January 31, 2025 Participant paper submissions due: February 28, 2025 Notification to authors: March 31, 2025 Camera ready due: April 21, 2025 SemEval workshop: TBD *Task Organizers* Jennifer D'Souza, Sameer Sadruddin, Holger Israel, Mathias Begoin et al. All organizers are affiliated with the TIB Leibniz Information Centre for Science and Technology - Germany (https://www.tib.eu/en/) *We look forward to having you on board!* *Contact: * llms4subjects [at] gmail.com

1 0

eLex 2025: Intelligent lexicography - first announcement
by Iztok Kosem 03 Oct '24

03 Oct '24

Dear all, This is the first announcement for the eLex 2025 conference on electronic lexicography, so please mark your calendars. Dates: 18-20 November 2025 Location: Bled, Slovenia Website: https://elex.link/elex2025/ Organiser: Centre for Language Resources and Technologies, University of Ljubljana The theme of next year’s conference will be “Intelligent lexicography”. More information on the theme and the first call for papers will be published soon. We are also accepting proposals for pre- or postconference workshops. Looking forward to seeing you in Slovenia Iztok Kosem In the name of the organising committee

1 0

[2nd CFP] The 1st Workshop and Shared Task on Multilingual Counterspeech Generation
by mevallec＠ujaen.es 03 Oct '24

03 Oct '24

----------------------------------------------------- Workshop on Multilingual Counterspeech Generation ----------------------------------------------------- Background and Scope --------------------- While interest in automatic approaches to Counterspeech generation has been steadily growing, including studies on data curation (Chung et al., 2019a; Fanton et al., 2021), detection (Chung et al., 2021a; Mathew et al., 2018), and generation (Tekiroglu et al., 2020; Chung et al., 2021b; Zhu and Bhat, 2021; Tekiroglu et al., 2022), the large majority of the published experimental work on automatic Counterspeech generation has been carried out for English. This is due to the scarcity of both non-English manually curated training data and to the crushing predominance of English in the generative Large Language Models (LLMs) ecosystem. A workshop on exploring Multilingual Counterspeech Generation is proposed to promote and encourage research on multilingual approaches for this challenging topic. Thus, this workshop aims to test monolingual and multilingual LLMs in particular and Language Technology in general to automatically generate counterspeech not only in English but also in languages with fewer resources. In this sense, an important goal of the workshop will be to understand the impact of using LLMs, considering for example how to deal with pressing issues such as biases, hallucinated content, data scarcity or data contamination. We seek to maximize the scientific and social impact of this workshop by promoting the creation of a community of researchers from diverse fields, such as computer and social sciences, as well as policy makers and other stakeholders interested in automatic counterspeech generation. By doing so we aim to gain a deeper understanding of how counterspeech is currently used to tackle abuse by individuals, activists, and organizations and how Natural Language Processing (NLP) and Generation (NLG) may be best applied to counteract it. Call for Papers --------------------- We welcome submissions on the following topics (but not limited to): - Models and methods for generating counterspeech in different languages. - Automatic Counterspeech generation for low resource languages with scarce training data. - Dialogue agents that use counterspeech to combat offensive messages that are directed to individuals or groups, targeted based on various aspects such as ideology, gender, sexual orientation and religion. - Methods for human and automatic evaluation of counterspeech. - Multidisciplinary studies providing different perspectives on the topic such as computer science, social science, psychology, etc. - Development of taxonomies and quality datasets for counterspeech in multiple languages. - Potentials and limitations (e.g., fairness, biases, hallucinated content) of applying different NLP methods, such as LLMs, to generate counterspeech. - Social impact and empirical studies of counterspeech in social networks, including research on the effectiveness and consequences for users of using counterspeech to combat hate online. Submission --------------------- We welcome two types of papers: regular workshop papers and non-archival submissions. Regular workshop papers will be included in the workshop proceedings. All submissions must be in PDF format and made through START [https://softconf.com/coling2025/MCG25/] - Regular workshop papers: Authors can submit papers up to 8 pages, with unlimited pages for references. Authors may submit up to 100 MB of supplementary materials separately and their code for reproducibility. All submissions undergo an double-blind single-track review. Accepted papers will be presented as posters with the possibility of oral presentations. - Non-archival submissions: Cross-submissions are welcome. Accepted papers will be presented at the workshop, but will not be included in the workshop proceedings. Papers must be in PDF format and will be reviewed in a double-blind fashion by workshop reviewers. We also welcome extended abstracts (up to 2 pages) of papers that are work in progress, under review or to be submitted to other venues. Papers in this category need to follow the COLING format. Important Dates --------------------- - Submission: November 20th, 2024 - Notification of Acceptance: December 2nd, 2024 - Camera-Ready Papers Due: December 10th, 2024 ----------------------------------------------------- Shared Task on Multilingual Counterspeech Generation ----------------------------------------------------- *TRAIN DATA RELEASED!* In addition to paper contributions, we are organizing a shared task on multilingual counterspeech generation with the aim of sharing in a central space current efforts, especially those for languages different to English. It is envisaged that the shared task would allow the community to study how we can improve counterspeech generation for both lower resource languages but also to reinforce the strong body of research already existing for English. The counterspeech generated by participants should be respectful, non-offensive, and contain information that is specific and truthful with respect to the following targets: Jews, LGBT+, immigrants,, people of color, women. Data --------------------- We release new data consisting of 596 Hate Speech-Counter Narrative (HS-CN) pairs. In this dataset, the HS are taken from MTCONAN [https://github.com/marcoguerini/CONAN/tree/master/Multitarget-CONAN], while the CN are newly generated. Together with each HS-CN pair, we also provide 5 background knowledge sentences, some of which are relevant for obtaining the Counter Narratives. The dataset is available in 4 different languages (Basque, English, Italian and Spanish) and divided in the following splits: - Development: 100 pairs. [AVAILABLE NOW!] [https://huggingface.co/datasets/LanD-FBK/ML_MTCONAN_KN] - Train: 396 pairs [AVAILABLE NOW!] [https://huggingface.co/datasets/LanD-FBK/ML_MTCONAN_KN] - Test: 100 pairs [AVAILABLE ON 21st OCTOBER] In order to score the shared task participants, the CNs will be kept hidden during the shared task while the HS and the background knowledge will be released for participants to prepare their submissions. The languages, Basque, English, Italian and Spanish, offer a varied spectrum of complexity including an agglutinative language isolate (Basque), two romance languages (Italian, Spanish) and a Germanic one (English). The choice of languages obeys to the existing linguistic expertise among the organizers required to successfully run the shared task. Additional Multilingual Data --------------------- Participants also have available the English, Italian, Basque and Spanish CONAN manually curated data: - CONAN (English and Italian) [https://github.com/marcoguerini/CONAN/tree/master/CONAN] - CONAN (Basque and Spanish) [https://huggingface.co/datasets/HiTZ/CONAN-EUS] - CONAN-MT-SP (Spanish) [https://github.com/sinai-uja/CONAN-MT-SP] or [https://huggingface.co/datasets/SINAI/CONAN-MT-SP] - Multitarget-CONAN (English) [https://github.com/marcoguerini/CONAN/tree/master/Multitarget-CONAN] Participation --------------------- The aim of the shared task is, given a HS (and optionally any additional knowledge the participants may like to use), generate a CN to counteract the HS. An example of CN appear bellow: - English: + HS: We should ban all homosexuals. + CN: When will the love prosper and the hatred start to dissipate? I will not only respect my fellow LGBT+ people, I will promote their rights. - Italian: + HS: Dovremmo espellere tutti gli omosessuali. + CN: Quand'è che l'amore prospererà e l'odio inizierà a dissiparsi? Non solo rispetterò i miei compagni LGBT+, ma promuoverò i loro diritti. - Spanish: + HS: Deberíamos prohibir a todos los homosexuales. + CN: ¿Cuándo prosperará el amor y empezará a disiparse el odio? No sólo respetaré a mis compañeros LGBT+, sino que promoveré sus derechos. - Basque: + HS: Homosexual guztiak debekatu beharko genituzke. + CN: Noiz hasiko da maitasuna irabazten eta gorrotoa desagertzen? LGBT+ pertsonak errespetatzeaz gain, haien eskubideak sustatuko ditut. Participants will download the test HS for the 4 languages and generate at most three different CNs per HS for each language). The test window will last 5 days. Participants are allowed to use any resource (language model, data, etc.) to generate the CN. **Note:** If you are going to participate in the shared task, please fill the following form: [https://docs.google.com/forms/d/e/1FAIpQLSeAZTJsrEXt35HfFFchPNdPi289q5kKerq…] Evaluation --------------------- The CNs submitted by the participants will be evaluated: - Using traditional automatic metrics as in Tekiro ̆glu et al.( 2022), which include BLEU, ROUGE, Novelty and Repetition Rate. - Using LLM as a Judge following the approach described in this paper: https://arxiv.org/abs/2406.15227 Important Dates --------------------- - Test dataset release: October 21st, 2024 - Results submission: October 25th, 2024 - Results notification: November 10th, 2024 - Working papers submission: November 20th, 2024 - Notification of Acceptance: December 3rd, 2024 - Camera-Ready Papers Due: December 10th, 2024 - Workshop: January 19th, 2025 For more information you can joint the Google group [https://groups.google.com/g/multilingual-cs-generation-coling2025] or visit our website [https://sites.google.com/view/multilang-counterspeech-gen/home] [https://sites.google.com/view/multilang-counterspeech-gen/shared-task] Best regards, The Multilingual Counterspeech Generation Workshop Organizers.

1 0

Job offer: Researcher in Speech and Natural Language Processing (Junior and Senior Positions)
by Arantza Del Pozo Echezarreta 03 Oct '24

03 Oct '24

*** Apologies for cross-postings *** *Researcher positions in Speech and Natural Language Processing (Junior & Senior Positions) @ Vicomtech, San Sebastian/Bilbao, Spain* Vicomtech (https://www.vicomtech.org/en/), an international applied research centre specialised in Artificial Intelligence, Visual Computing and Interaction located in Spain, has several research positions in the field of speech and natural language processing. We are seeking talented and motivated individuals to join our dynamic Speech and Natural Language Technologies team in either our Donostia - San Sebastián or Bilbao premises. If you have experience in speech and/or natural language processing technologies and are passionate about applying cutting-edge research to solve real-world needs through advanced prototypes, this opportunity is for you! Whether you are a junior researcher (BSc/MSc graduate) looking to kickstart your career or a senior researcher (PhD graduate) eager to take on research leadership roles, we are interested in your profile. We offer the perfect environment with outstanding equipment and the best human team for growth. You will participate in advanced research and development projects, with opportunities to manage high-profile projects and/or lead technical teams depending on your experience. *Key Responsibilities: * - Conduct cutting-edge research in Speech and Natural Language Processing (NLP) technologies such as automatic speech recognition and synthesis, audio deep fake detection, information extraction, machine translation, text simplification and dialogue systems, among others. - Contribute to national and international research projects. - Develop advanced prototypes that transfer technology to businesses and institutions. - Manage or lead research projects, depending on experience. *Requirements: * - Bachelor’s or Master’s degree in Computer Science, Telecommunications Engineering or related fields. - For senior profiles, a PhD in Speech Processing, NLP, AI or related disciplines is preferred. A PhD is not required for junior candidates. - Strong programming skills (Python, Bash). - Fluency in both spoken and written Spanish and English. *Preferred Skills (Not Required but Valued): * - Experience with speech and natural language processing tools and libraries (e.g. Kaldi, Whisper, Marian NMT, HuggingFace Transformers, Rasa, etc.). Deep learning frameworks (Pytorch, Tensorflow, ONNX). - Virtualization technologies (Docker, Kubernetes). - Experience in industrial and/or European research projects. *What We Offer: * - A vibrant, innovative research environment with state-of-the-art AI, Visual Computing, and Interaction technologies. - Exciting national and international research projects. A multidisciplinary and renowned team in Speech and Language Technologies. - Creative freedom in research, aligned with the centre’s goals. - Opportunities for personal development through continuous learning. - Clear career progression paths and leadership opportunities. - Work-life balance policies and a commitment to equal employment opportunities. If you are passionate about research and eager to apply or develop your expertise to real-world challenges, we encourage you to send us your CV and join our forward-thinking team! To apply via LinkedIn: https://www.linkedin.com/jobs/view/4034768411 -- [image: Vicomtech] <https://www.vicomtech.org> Dr. Arantza del Pozo Echezarreta Director of Speech and Natural Language Technologies adelpozo(a)vicomtech.org +(34) 943 30 92 30 +(34) 619 910 422 The information contained in this electronic message is intended only for the personal and confidential use of the recipients. If you have received this e-mail by mistake, please, notify us and delete it. Avoid printing this message if it is not strictly necessary.

1 0

2026

2025

2024

2023

2022