August 2024 - Corpora

New Named Entity Corpus for Occupational Substance Exposure Assessment
by Paul Thompson 22 Aug '24

22 Aug '24

We are pleased to announce the release of a new annotated corpus, consisting of selected sections (i.e., Abstract, Methods and Results) of scientific research articles concerning occupational exposures to two different types of substance, i.e., diesel exhaust (51 articles) and respirable crystalline silica (50 articles). The article sections have been annotated by experts in the field with 6 categories of named entities relevant to the assessment of occupational substance exposures, particularly in the context of Job Exposure Matrices. The corpus and associated annotation guidelines may be downloaded from: https://zenodo.org/records/11164271 NER models and associated code are available at: https://github.com/panagiotis-geo/Substance_Exposure_NER/ The development of the corpus and the associated NER models are described in more detail in the following article: Thompson, P., Ananiadou, S., Basinas I., Brinchmann, B. C., Cramer, C., Galea, K. S., Ge, C., Georgiadis, P., Kirkeleit, J., Kuijpers, E., Nguyen, N., Nuñez, R., Schlünssen, V., Stokholm, Z. A., Taher, E. A., Tinnerberg, H., Van Tongeren, M. and Xie, Q. (2024). <https://doi.org/10.1371/journal.pone.0307844> Supporting the working life exposome: annotating occupational exposure for enhanced literature search. PLoS ONE 19(8): e030784 https://doi.org/10.1371/journal.pone.0307844 Abstract ————— An individual’s likelihood of developing non-communicable diseases is often influenced by the types, intensities and duration of exposures at work. Job exposure matrices provide exposure estimates associated with different occupations. However, due to their time-consuming expert curation process, job exposure matrices currently cover only a subset of possible workplace exposures and may not be regularly updated. Scientific literature articles describing exposure studies provide important supporting evidence for developing and updating job exposure matrices, since they report on exposures in a variety of occupational scenarios. However, the constant growth of scientific literature is increasing the challenges of efficiently identifying relevant articles and important content within them. Natural language processing methods emulate the human process of reading and understanding texts, but in a fraction of the time. Such methods can increase the efficiency of both finding relevant documents and pinpointing specific information within them, which could streamline the process of developing and updating job exposure matrices. Named entity recognition is a fundamental natural language processing method for language understanding, which automatically identifies mentions of domain-specific concepts (named entities) in documents, e.g., exposures, occupations and job tasks. State-of-the-art machine learning models typically use evidence from an annotated corpus, i.e., a set of documents in which named entities are manually marked up (annotated) by experts, to learn how to detect named entities automatically in new documents. We have developed a novel annotated corpus of scientific articles to support machine learning based named entity recognition relevant to occupational substance exposures. Through incremental refinements to the annotation process, we demonstrate that expert annotators can attain high levels of agreement, and that the corpus can be used to train high-performance named entity recognition models. The corpus thus constitutes an important foundation for the wider development of natural language processing tools to support the study of occupational exposures. -- Paul Thompson Research Fellow Department of Computer Science National Centre for Text Mining Manchester Institute of Biotechnology University of Manchester 131 Princess Street Manchester M1 7DN UK Tel: 0161 306 3091 http://personalpages.manchester.ac.uk/staff/Paul.Thompson/

2 1

Academic positions in the University of Cambridge
by alk23＠cam.ac.uk 22 Aug '24

22 Aug '24

The Centre for Human-Inspired Artificial Intelligence (CHIA) at the University of Cambridge is looking to appoint new academic staff: Two Assistant Professors and one Teaching Associate in Human-Inspired Artificial Intelligence. Successful candidates will have core technical expertise in machine learning, human-computer interaction, computer vision, speech and/or natural language processing, robotics, mobile systems or another area of human-centred or human-facing AI. In addition to excellence in their area, we emphasise a proven ability to work interdisciplinarily with other disciplines that are central to the development of human-inspired AI. We are also looking for commitment to CHIA's mission to develop AI that can contribute to social and global progress. For full details and how to apply for these positions, see Two Assistant Professors https://www.jobs.cam.ac.uk/job/47407/ Deadline: 1 September, 2024 One Teaching Associate https://www.jobs.cam.ac.uk/job/47788/ Deadline: 30 September, 2024

2 1

Call for participation: The 34th Meeting on Computational Linguistics in the Netherlands, Leiden University
by Leiden University - CLIN34 22 Aug '24

22 Aug '24

Dear colleague, The 34th Meeting of Computational Linguistics in The Netherlands (CLIN34) will take place soon, on Friday 30 August 2024! We cordially invite you to participate. Online registration<https://clin34.leidenuniv.nl/2024/07/05/registration-open/> ends on Wednesday (21st of August). Besides a large and diverse programme of posters and oral presentations, we are happy to report that CLIN34 will have two keynote talks by: * Diana Maynard, Sheffield University * Dominique Blok and Erik de Graaf, TNO The programme can also be found at: clin34.leidenuniv.nl/program/<https://clin34.leidenuniv.nl/program/> We hope to see you in Leiden about two weeks from now! The CLIN34 organizers Leiden University

2 1

Student contest: 2024 Babel Young Writers' Competition
by Tristan Miller 22 Aug '24

22 Aug '24

Dear all, Below is a call for submissions to our annual contest for student writers. Contributions from undergraduate students of computational linguistics are most welcome. Sincerely, Tristan Miller Babel Advisory Panel ---------------------- This year, Babel: The Language Magazine <https://babelzine.co.uk/> will be running the tenth edition of our Young Writers' Competition, which encourages young linguists who are starting out on their study of language. The competition is open to anyone studying a linguistics-related subject at the 16–18-year-old or undergraduate level. The winner(s) will have their article published in Babel's 50th issue (Spring 2025) and receive a year's subscription to the magazine. Keep an eye on @Babelzine on X or @babel_zine on Instagram for inspiration from previous winners on topics ranging from sign language to spoonerisms, and from language birth to language death. Competition rules are as follows: Topic: An original discussion of any linguistic topic, written in an accessible and interesting style Length: 2000 to 2500 words Deadline: Monday, 16 December 2024 Format: Word file Submission: By e-mail to babelthelanguagemagazine(a)gmail.com with the subject "Young Writer's Competition" Please e-mail babelthelanguagemagazine(a)gmail.com if you have any questions about the competition. -- Dr. Tristan Miller, Assistant Professor Department of Computer Science, University of Manitoba https://clam.logological.org/ | Tel. +1 204 474 6792

2 1

Lecturer/Associate Professor in Natural Language Processing at University College London (UCL)
by Yilmaz, Emine 22 Aug '24

22 Aug '24

University College London (UCL) Department of Computer Science invites applications for a Lecturer/Associate Professor position in Natural Language Processing. Interested applicants can submit their applications until September 5th using this link<https://www.ucl.ac.uk/work-at-ucl/search-ucl-jobs/details?jobId=25979&jobTi…>. About UCL UCL’s Department of Computer Science (CS) is a top-ranked Computer Science Department in the UK. In the 2021 Research Excellence Framework (REF) evaluation, UCL Computer Science was ranked second in the UK for research power and first in England. London is a global hub for AI, where UCL plays a central role through close collaborations and joint PhD programmes with for example Meta and Google DeepMind. About the role University College London, Department of Computer Science is seeking a Lecturer (equivalent of Assistant Professor in the UK)/Associate Professor to join the Natural Language Processing Group. Successful candidates are expected to contribute to the teaching and research activities at the department. Expected duties and responsibilities include conducting research in the broader field of natural language processing, securing funding and engagement in the management of research projects, and dissemination of research through publications at top conferences/journals, talks and external engagements. About you Candidates should have a PhD (or equivalent qualification) or have held a previous postdoctoral position in natural language processing, information retrieval, machine learning, or a strongly related field. Candidates are expected to have a strong publication record in top conferences such as ACL, ICLR, NeurIPS, EMNLP, SIGIR. Experience in applying for research funding is not necessary, but highly desired. We also welcome applications from candidates with research experience from industry. Please contact Emine Yilmaz (emine.yilmaz(a)ucl.ac.uk<mailto:emine.yilmaz@ucl.ac.uk>) if you need any further information.

2 1

Release: 𝐐𝐚𝐛𝐚𝐬 - 𝐚𝐧 𝐎𝐩𝐞𝐧-𝐒𝐨𝐮𝐫𝐜𝐞 𝐋𝐞𝐱𝐢𝐜𝐨𝐠𝐫𝐚𝐩𝐡𝐢𝐜 𝐃𝐚𝐭𝐚𝐛𝐚𝐬
by Mustafa Jarrar 22 Aug '24

22 Aug '24

We are very happy to release 𝐐𝐚𝐛𝐚𝐬 - 𝐚𝐧 𝐎𝐩𝐞𝐧-𝐒𝐨𝐮𝐫𝐜𝐞 𝐋𝐞𝐱𝐢𝐜𝐨𝐠𝐫𝐚𝐩𝐡𝐢𝐜 𝐃𝐚𝐭𝐚𝐛𝐚𝐬𝐞 Birzeit University’s SinaLab for Computational Linguistics and Artificial Intelligence <https://sina.birzeit.edu/> has officially launched Qabas <https://sina.birzeit.edu/qabas>, an open-source lexicographic database for Arabic, designed specifically for Natural Language Processing (NLP) applications. Qabas stands out by linking its lexical entries (lemmas) with lemmas from 110 different lexicons and numerous morphologically annotated corpora (around 2 million tokens), creating an extensive lexicographic graph. This project has been under development for over fourteen years. Lexicons have evolved from being primarily hard-copy resources for human use to having substantial significance in NLP applications. Although Arabic is a highly resourced language in terms of traditional lexicons, not enough attention is given to developing AI-oriented lexicographic databases. Additionally, none of the Arabic lexicons are available open-source, due to copyright restrictions imposed by their owners. As for Qabas, it is an open-source Arabic lexicon designed for NLP applications, and its novelty lies in its synthesis of many lexical resources. Each lexical entry (i.e., lemma) in Qabas is linked with equivalent lemmas in 110 other lexicons, and with 12 morphologically-annotated corpora (about 2M tokens); The philosophy of Qabas is to construct a large lexicographic data graph by linking existing Arabic lexicons and annotated corpora. Qabas stands as the largest Arabic lexicon, encompassing about 58K lemmas (45K nominal lemmas, 12.5K verbal lemmas, and 500 function word lemmas). Prof. Mustafa Jarrar, the project’s manager and main author, emphasized the importance of making Qabas freely available as an open-source resource, allowing everyone to access and use it for both commercial and non-commercial purposes. Prof. Jarrar hopes that researchers, companies, and software developers will leverage the lexicon’s data to develop innovative content and applications that benefit humanity. Prof. Talal Shahwan, President of Birzeit University, stated that despite the challenging conditions in Palestine, the university remains committed to excellence and to its mission towards knowledge. He emphasized that this achievement was made possible by the dedication of the university’s faculty and researchers. Qabas is publicly available online at: https://sina.birzeit.edu/qabas To download Qabas and find out more, see: https://sina.birzeit.edu/qabas/about Article: https://www.jarrar.info/publications/JH24.pdf We’d love your feedback: Facebook: https://www.facebook.com/watch?v=880418097306662 LinkedIn: https://www.facebook.com/watch?v=880418097306662 Best --Mustafa __________________________ Mustafa Jarrar, PhD Professor of Artificial Intelligence Chair, PhD Program in Computer Science Birzeit University, Palestine Page: http://www.jarrar.info SinaLab: https://sina.birzeit.edu

2 1

[CfP] Diversity and Change in Easy German @ DGfS 2025
by lena.wieland＠uni-saarland.de 22 Aug '24

22 Aug '24

CfP: Diversity and Change in Easy German (Workshop at DGfS 2025) Date: March 5-7, 2025 Location: University of Mainz, Germany Meeting Email: workshop-easy-german-dgfs2025(a)uni-saarland.de<mailto:workshop-easy-german-dgfs2025@uni-saarland.de> Website: https://sfb1102.uni-saarland.de/vielfalt-und-wandel-in-leichter-sprache/ Linguistic Field(s): Applied Linguistics, Computational Linguistics, Psycholinguistics Language Family: Germanic Call Deadline: August 18, 2024 Shortened Workshop Description: Easy German, which has been systematically developed since the 2000s to aid individuals with learning difficulties among others, focuses on enhancing text comprehensibility by avoiding linguistic complexity. Despite its intended uniformity, there is a lack of consensus on its precise conceptualization, with various frameworks and guidelines proposing different approaches. This workshop aims to: 1. Provide a platform for researchers to discuss the production and evolution of Easy German texts. 2. Highlight dynamic changes and variability in Easy German texts compared to Standard German. 3. Examine the cognitive processing of Easy German through psycholinguistic studies involving the target demographic. 4. Critically assess AI-driven systems for Easy German text production, exploring their implications, opportunities, and challenges. For further information, please visit the workshop website: https://sfb1102.uni-saarland.de/vielfalt-und-wandel-in-leichter-sprache/ Organizers: Ingo Reich (Saarland University, Germany) Heike Zinsmeister (University of Hamburg, Germany) Sarah Jablotschkin (University of Hamburg, Germany) Lena Wieland (Saarland University, Germany) Invited Speakers: Bettina Bock (University of Cologne) Ted Sanders (Utrecht University) Call for Papers: We invite contributions on all aspects of Easy German and easy-to-read variants in other Germanic languages. The workshop will include a small poster session, and submissions for both talks and posters are welcome. Contributions in English are preferred, but submissions in German are also accepted. * Submission Details: * Abstract submission deadline: August 18, 2024 * Abstracts should be submitted to workshop-easy-german-dgfs2025(a)uni-saarland.de<mailto:workshop-easy-german-dgfs2025@uni-saarland.de> * Abstracts should not exceed one page (DIN A4, 2.5 cm margins, 12pt font) * Examples, graphics, or references may be included on a second page Important Workshop Information: The workshop is part of the 47th annual meeting of the German Linguistic Society (DGfS 2025) at Johannes Gutenberg University Mainz. Participants must register for the DGfS conference and pay the conference fee. For more information, visit http://dgfs.uni-mainz.de<http://dgfs.uni-mainz.de/>. Important Dates: Deadline for abstract submission: August 18, 2024 Notification of acceptance: September 2, 2024 Workshop dates: March 5-7, 2025 -- Lena Wieland SFB 1102, Project T1 – Information Density and Linguistic Encoding in “Leichte Sprache” Universität des Saarlandes Campus A2.2 Raum 3.12 D-66123 Saarbrücken T: +49 681 302 57543 www.uni-saarland.de/fakultaet-p/nds/team/wieland<https://www.uni-saarland.de/fakultaet-p/nds/team/wieland.html>

2 2

Open Position in Human-centric ART(ificial Intelligence) at UNITOV
by znzfms00＠uniroma2.it 22 Aug '24

22 Aug '24

We are certain that YOU're the kind of researcher that wants to dive deep into Neural Networks (and, clearly, LLMs) to understand what happens inside. If this is the case, YOU are the person Human-centric ART looks for. Take your time to solve this >>> CHALLENGE<https://urldefense.com/v3/__https://www.linkedin.com/posts/fabio-massimo-za…> <<< and, meanwhile, apply to our Open Position >>> https://pica.cineca.it/uniroma2/f4-2024-0005/<https://urldefense.com/v3/__https://pica.cineca.it/uniroma2/f4-2024-0005/__…> <<< To attract exactly YOU, we offer a competitive salary for a one-year position (34K€/annually with low taxation - Assegno di Ricerca IV Fascia) with possibility of renewal. DEADLINE: August 31st Drop me an email if you apply: fabio.massimo.zanzotto(a)uniroma2.it Requirements: - out-of-the-box thinking attitude - an adequate publication record track in ML and/or NLP Encouraged: - Ph.D. in CS, AI, ML, or NLP Research Group: Human-centric ART Institution: University of Rome Tor Vergata Location: Rome, Italy Position: Research Assistant (Assegno di Ricerca IV Fascia) Salary: 34K€/annually with low taxation Duration: 1 year with the possibility of renewal Application site: https://pica.cineca.it/uniroma2/f4-2024-0005/<https://urldefense.com/v3/__https://pica.cineca.it/uniroma2/f4-2024-0005/__…> Check out our last publications: ‪Fabio Massimo Zanzotto‬ - ‪Google Scholar‬<https://urldefense.com/v3/__https://scholar.google.com/citations?hl=en&user…>

2 1

Call For Papers: RegNLP 2025 - Regulatory Natural Language Processing Workshop
by cengtubagokhan＠gmail.com 22 Aug '24

22 Aug '24

===Workshop Description=== The RegNLP 2025 Workshop will take place on January 20th, 2025, in conjunction with the COLING 2025 conference in Abu Dhabi, UAE. Regulatory documents are foundational to governance, compliance, and legal frameworks across various sectors. However, the sheer complexity, volume, and constantly evolving nature of these documents present significant challenges. To address these, the field of Natural Language Processing (NLP) is increasingly being harnessed to develop tools and methodologies that enable the effective management, analysis, and utilization of regulatory content. This workshop seeks to bring together researchers and practitioners from NLP, legal informatics, compliance, and related fields to discuss the latest advancements and challenges in regulatory NLP. The focus will be on innovative methods for document parsing, entity recognition, automated compliance checking, and other applications critical to navigating the intricate landscape of regulatory requirements. We will explore how NLP can be adapted to the specialized language and context of regulatory texts and how it can be employed to enhance the accuracy, efficiency, and reliability of regulatory processes. By fostering collaboration and knowledge exchange, RegNLP 2025 aims to build a community dedicated to advancing the application of NLP in the regulatory domain and to identify promising directions for future research. ===Important Dates=== Paper Submission Deadline: November 5, 2024 Notification of Acceptance: December 3, 2024 Camera-Ready Papers Due: December 10, 2024 Workshop Date: January 20, 2025 ===Submission Topics=== We invite submissions of original and high-quality research papers on topics related to the application of NLP in regulatory contexts, including but not limited to: -Applications of NLP to Regulatory Tasks: --Compliance monitoring and management --Risk assessment and regulatory reporting --Interpretation and classification of regulatory changes --Summarization of regulatory documents for decision-making --Creation of domain-specific lexical resources -Adapting NLP Methods for Regulatory Data: --Information retrieval and anomaly detection --Clustering and multimodality analysis --Entity recognition, linking, and disambiguation --Syntax: Tagging, chunking, and parsing --Dialogue and discourse analysis --Text summarization and relation extraction --Question answering using regulatory data -Tasks and Resources: --New regulatory tasks and datasets for NLP --Evaluation frameworks for regulatory NLP tasks -Demos: --Systems and software solutions utilizing NLP for regulatory text processing -Industrial Research: --Case studies of industrial applications in regulatory compliance --Research involving proprietary regulatory data -Interdisciplinary Position Papers: --The role of NLP in the regulatory landscape --Reflections on the use of Large Language Models (LLMs) in regulatory contexts --Legal and ethical considerations in regulatory data processing ===More Details=== For more information about the workshop, please visit our website: https://regnlp.github.io/ ===Organization=== Workshop Chairs: Tuba Gokhan - MBZUAI Kexin Wang - UKP Lab, Technical University of Darmstadt Iryna Gurevych - UKP Lab, Technical University of Darmstadt & MBZUAI Ted Briscoe - MBZUAI ===Contact Information=== For inquiries, please contact us via email at: regnlp2025(a)gmail.com

2 1

CFP: 3rd Shared Task on Indian Language Summarization (ILSUM 2024)
by Parth Mehta 15 Aug '24

15 Aug '24

Apologies for the multiple postings. ----------------------------- *Indian Language Summarization (ILSUM 2024)* Website: https://ilsum.github.io/ To be organized in conjunction with FIRE 2024 (fire.irsi.org.in) 12th-15th December 2024, Gandhinagar, India ------------------------------ The third shared task on Indian Language Summarization (ILSUM) aims at extending evaluation benchmark dataset for Indian Language Summarization. Three Dravidian languages Kannada, Telugu and Tamil are introduced this year. We also extend the misinformation detection subtask to a cross-lingual setup. *Subtask 1*: This task builds upon the task from the first two editions. In the previous editions we covered three major Indian languages Hindi, Gujarati and Bengali alongside Indian English, a widely recognized dialect of the English Language. This year's edition adds the three Dravidian languages Kannada, Tamil and Telugu and an expanded dataset for the languages from last year. Like the previous edition, this will be a classic summarization task, where we will provide article-summary pairs for each language and the participants are expected to generate a fixed-length summary. *Subtask 2*: The task is centred around identifying factual errors in machine-generated summaries. While LLMs are very good at summarization, among other NLP tasks, they are often prone to hallucinations. This means the model generates information that is not accurate, not based on its training data, or is completely made up but looks accurate and reliable. Further, such tools can be misused to generate misleading or outright incorrect information. Identifying such inaccuracies can be a challenging task. This year's subtask builds upon a similar task from the previous edition in a cross-lingual setup. Participants will be provided with an article in English and its corresponding machine-generated summary in Hindi and Gujarati. The objective is to identify the presence of factual incorrectness in the summaries if any, and classify them in one of the predefined categories. *Tentative Timeline* ------------- 15th August - Training Data Released and Registrations open 30th August - Test Data Release 30th September - Run Submission Deadline 10th October - Results Declared 20th October - Working notes due 20th November - Camera Ready Submissions due 12th-15th December - FIRE 2024 at Gandhinagar, India *Organisers* ---------------- Shrey Satapara, Indian Institute of Technology, Hyderabad, India Sandip Modha, LDRP-ITR, Gandhinagar, India Shashirekha HL, Mangalore University, India Asha Hegde, Mangalore University, India Parth Mehta, Parmonic, USA Debasis Ganguly, University of Glasgow, Scotland *For regular updates subscribe to our mailing list: **ilsum(a)googlegroups.com**

1 0

2026

2025

2024

2023

2022

Corpora August 2024