October 2024 - Corpora

2nd CfP: First workshop on Data Models, Citation, Access, and Re-usability impacting Historical Linguistic Datasets
by sil.linguist＠gmail.com 20 Oct '24

20 Oct '24

This workshop proposes to provide a forum to discuss the structures and models of information resources in historical-comparative linguistic research outputs through the integration of informatic models from library science and archivy. We want to address pertinent issues impacting the indexing (for citation) and interoperability of datasets (for sustainability). WS Title: First workshop on Data Models, Citation, Access, and Re-usability impacting Historical Linguistic Datasets Workshop at ICHL27, Santiago de Chile, 18-22 August 2025 Workshop Type: in-person Organizers: Hugh Paterson III & Oksana Zavalina Abstract Deadline: October 18th EXTENDED NEW DEADLINE NOV 7th. Abstract Details: up to 800 words excluding references. Submission: Email PDF of abstracts to both i(a)hp3.me and oksana.zavalina(a)unt.edu with [ICHL27 w8] in the subject line. Note: Workshops are in most cases restricted to 6 papers; all other papers, if accepted, will be given as part of the ICHL general sessions. Should there be sufficient interest for an extended workshop (up to 12 papers), we will lobby the local organizers to permit this format. Workshop Website: https://hughandbecky.us/Hugh-CV/project/2025-ichl27-historical-linguistic-d… Conference Website: https://ichl27santiago.cl PDF of Workshop abstract: https://hughandbecky.us/Hugh-CV/project/2025-ichl27-historical-linguistic-d… Publication: We are pursuing publication via edited volume post-workshop. ==Goal & Questions== The role of library models (e.g., IFLA-LRM: Riva, Le Bœuf, and Žumer 2017) and archival practice (e.g., lifecycle management: Higgins 2012) is under-explored in relation to the construction and reuse of Historical Linguistic Information Sources. This workshop proposes to provide a forum to discuss the structures and models of information resources in historical-comparative linguistic research outputs through the integration of informatic models from library science and archivy. We invite papers describing the information models used for assembling large corpora (including wordlists) used in historical linguistics, highlighting assumptions for citation, referencing, segmentation, and reusability of the assembled collection of texts and their digital surrogates. We encourage papers which present typologies of use cases, categories of tracked information, provenance of data content, citability of aggregate content, and the identifiers-for and permanence-of user-generated datasets on research platforms. What are the design patterns within datasets? What are the categories used? and what are their scopes? What are the kinds of objects subsumed into datasets? ==Background== Significant advances have been made in historical linguistics through the use of large compiled datasets (e.g., Kamholz et al. 2024; Tresoldi 2023; Arora et al. 2023; Dellert et al. 2020; Greenhill 2015; Segerer and Flavier 2013; Mielke 2008; Greenhill, Blust, and Gray 2008). While not precluding the contributions of single historical manuscripts and traditional manuscript consultation methods, the use of and creation of datasets (including corpora) has become the defacto way of generating new hypotheses (Wichmann and Saunders 2007; Steiner, Cysouw, and Stadler 2011; Segerer 2015). Datasets in historical linguistics generally do two things: (1) record critical researcher-created information such as reconstructed forms, cognacy judgments, confidence levels, along with contextual notes; and (2) contain foundational content from sources not created by the dataset compiler. Such source material often include historically published and unpublished resources including: maps (Hessle and Kirk 2020), language specific lexicons and published reconstructions (Kamholz et al. 2024), wordlists (Forkel et al. 2024; Segerer and Flavier 2013), transcriptions of manuscripts and texts (Weber et al. 2023; Genee and Junker 2018; Kytö 2011), and even reconstructions by other scholars, etc. Interactional platform-tools such as RefLex (Segerer and Flavier 2013) or OUTOFPAPUA (Kamholz et al. 2024) allow users to create custom datasets based on specific selected resources available to the platform. They do this without requiring users to interact with the complete set of underlying resources and/or the platforms allow users to create new derivative aggregate collections (reconstructed forms and cognacy relations) independent of other platform users. Citing, referencing, and redistributing these custom datasets is challenging and impacts the verifiability of claims. It is broadly accepted across linguistic research that scholarly work—including evidence— should be citable, accessable, and reusable (Bird and Simons 2003). Together these issues impact reproducibility, an important tenet in scholarship often overlooked in linguistics (Berez-Kroeker et al. 2018). However, it is also well acknowledged that the citation and reference of original source material for linguistic evidence is lacking across the field (Gawne et al. 2017). More specifically in historical-comparative linguistics, the context of citation and referencing of the evidentiary record along with current dataset assemblage and distribution practices generally do not support fine-grained or Work-oriented citation and referencing. This often means that specific and necessary details in comparative linguistics are not retrievable. Therefore, the data models embedded within historical comparative datasets become all the more important for the reproducibility of work and the testing, verification, and refinement of hypotheses (Bakro-Nagy 2010). With the exception of leading work around Cross-Linguistic Data Formats (CLDF) use with historical-comparative data (Forkel et al. 2018; Forkel, Swanson, and Moran 2024) and approaches using linked data in linguistics (Kesäniemi et al. 2018; Tittel, Gillis-Webber, and Nannini 2020), the literature has been silent about the storage formats for historical-comparative data. Undiscussed are the information categories represented in historical comparative linguistic datasets. The informatic arrangement and description of compiled datasets has generally been ad-hoc and served the needs of individually-funded projects. This has resulted in a proliferation of divergent data categories mitigating against ease-of-reuse. We set out to ignite discussion around compilations of manuscripts, wordlists, and other derivative resources which have become mainstream tools in hypothesis generation related to the language evolution. We explore the heretofore unapproached contribution that models such as Work-Expression-Manifestation-Item (WEMI), illustrated in figure 1, from library and information science (Coyle 2023; Riva, Le Bœuf, and Žumer 2017; IFLA, 1998) can offer those who compile, and cite/reference aggregate linguistic resources. Specifically, clarifying linking relationships between the literature and datasets, including dataset portions. We invite papers describing the information models used for assembling large corpora (including wordlists) used in historical linguistics, highlighting assumptions for citation, referencing, segmentation, and reusability of the assembled collection of texts and their digital surrogates. We encourage papers which present typologies of use cases, categories of tracked information, provenance of data content, citability of aggregate content, and the identifiers-for and permanence-of user-generated datasets on research platforms. Figure 1. Is available at the workshop website and the abstract in PDF form. ==References== Arora, Aryaman, Adam Farris, Samopriya Basu, and Suresh Kolichala. 2023. “Jambu: A Historical Linguistic Database for South Asian Languages.” arXiv. https://doi.org/10.48550/arXiv.2306.02514. Bakro-Nagy, Marianne. 2010. “Data in Historical Linguistics: On Utterances, Sources, and Reliability.” Sprachtheorie Und Germanistische Linguistik 20.2: 133-195., January. https://www.academia.edu/3629841/Data_in_historical_linguistics_On_utteranc…. Berez-Kroeker, Andrea L., Lauren Gawne, Susan Smythe Kung, Barbara F. Kelly, Tyler Heston, Gary Holton, Peter Pulsifer, et al. 2018. “Reproducible Research in Linguistics: A Position Statement on Data Citation and Attribution in Our Field.” Linguistics 56 (1): 1–18. https://doi.org/10.1515/ling-2017-0032. Bird, Steven, and Gary F. Simons. 2003. “Seven Dimensions of Portability for Language Documentation and Description.” Language 79 (3): 557–82. https://doi.org/10.1353/lan.2003.0149. Coyle, Karen. 2023. “openWEMI.” In Proceedings of the International Conference on Dublin Core and Metadata Applications. Dublin, Ohio: Dublin Core Metadata Initiative. https://doi.org/10.23106/DCMI.953115290. Dellert, Johannes, Thora Daneyko, Alla Münch, Alina Ladygina, Armin Buch, Natalie Clarius, Ilja Grigorjew, et al. 2020. “NorthEuraLex: A Wide-Coverage Lexical Database of Northern Eurasia.” Language Resources and Evaluation 54 (1): 273–301. https://doi.org/10.1007/s10579-019-09480-6. Forkel, Robert, Johann-Mattis List, Simon J. Greenhill, Christoph Rzymski, Sebastian Bank, Michael Cysouw, Harald Hammarström, Martin Haspelmath, Gereon A. Kaiping, and Russell D. Gray. 2018. “Cross-Linguistic Data Formats, Advancing Data Sharing and Re-Use in Comparative Linguistics.” Scientific Data 5 (1): 180205. https://doi.org/10.1038/sdata.2018.205. Forkel, Robert, Johann-Mattis List, Christoph Rzymski, and Guillaume Segerer. 2024. “Linguistic Survey of India and Polyglotta Africana: Two Retrostandardized Digital Editions of Large Historical Collections of Multilingual Wordlists.” In _Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), edited by Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, 10578–83. Torino, Italia: ELRA and ICCL. https://aclanthology.org/2024.lrec-main.925. Forkel, Robert, Daniel G. Swanson, and Steven Moran. 2024. “Converting Legacy Data to CLDF: A FAIR Exit Strategy for Linguistic Web Apps.” In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), edited by Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, 3978–82. Torino, Italia: ELRA and ICCL. https://aclanthology.org/2024.lrec-main.353. Gawne, Lauren, Barbara F. Kelley, Andrea L. Berez-Kroeker, and Tyler Heston. 2017. “Putting Practice into Words: The State of Data and Methods Transparency in Grammatical Descriptions.” Language Documentation & Description 11:157–89. http://hdl.handle.net/10125/24731. Genee, Inge, and Marie-Odile Junker. 2018. “The Blackfoot Language Resources and Digital Dictionary Project: Creating Integrated Web Resources for Language Documentation and Revitalization.” Language Documentation & Conservation 12:274–314. http://hdl.handle.net/10125/24770. Greenhill, Simon J. 2015. “TransNewGuinea.Org: An Online Database of New Guinea Languages.” PLOS ONE 10 (10): e0141563. https://doi.org/10.1371/journal.pone.0141563. Greenhill, Simon J., Robert Blust, and Russell D. Gray. 2008. “The Austronesian Basic Vocabulary Database: From Bioinformatics to Lexomics.” Evolutionary Bioinformatics 4 (January):EBO.S893. https://doi.org/10.4137/EBO.S893. Hessle, Christian, and John Kirk. 2020. “Digitising Collections of Historical Linguistic Data: The Example of The Linguistic Atlas of Scotland.” Journal of Data Mining & Digital Humanities Special issue on Visualisations in Historical Linguistics. https://doi.org/10.46298/jdmdh.5611. Higgins, Sarah. 2012. “The Lifecycle of Data Managment.” In Managing Research Data, edited by Graham Pryor, 17–46. London, UK: Facet Publishing. IFLA Study Group on the Functional Requirements for Bibliographic Records and Plassard, Marie-France. 1998. “Functional Requirements for Bibliographic Records: Final Report.” 2nd ed. [UBCIM Publications, New Series] IFLA Series on Bibliographic Control 19. Munich, Germany: K.G. Saur. http://www.ifla.org/VII/s13/frbr. Kamholz, David, Anne van Schie, Allahverdi Verdizade, Maria Zielenbach, and Antoinette Schapper. 2024. “OUTOFPAPUA.” Database. 2024. https://outofpapua.com. Kesäniemi, Joonas, Turo Vartiainen, Tanja Säily, and Terttu Nevalainen. 2018. “Exploring Meta-Analysis for Historical Corpus Linguistics Based on Linked Data.” Journal of Research Design and Statistics in Linguistics and Communication Science 5 (1–2): 4–47. https://doi.org/10.1558/jrds.36709. Kytö, Merja. 2011. “Corpora and Historical Linguistics.” Revista Brasileira de Linguística Aplicada 11 (2): 417–57. https://doi.org/10.1590/S1984-63982011000200007. Mielke, Jeff. 2008. The Emergence of Distinctive Features. Oxford, England: Oxford University Press. Riva, Pat, Patrick Le Bœuf, and Maja Žumer, eds. 2017. IFLA Library Reference Model: A Conceptual Model for Bibliographic Information. December 2017. Den Haag, Netherlands: International Federation of Library Associations and Institutions (IFLA). https://www.ifla.org/publications/node/11412. Segerer, Guillaume. 2015. “How Databases Shape Research: Labial-Velars Distribution in Africa.” In 8th World Congress of African Linguistics (WOCAL8). Kyoto, Japan. https://inria.hal.science/halshs-01251122. Segerer, Guillaume, and Sébastien Flavier. 2013. “The RefLex Project: Documenting and Exploring Lexical Resources in Africa.” Oral Presentation presented at the Research, records and responsibility: Ten years of the Pacific and Regional Archive for Digital Sources in Endangered Cultures, Sydney, Australia. http://hdl.handle.net/2123/9854. Steiner, Lydia, Michael Cysouw, and Peter Stadler. 2011. “A Pipeline for Computational Historical Linguistics,” January. https://doi.org/10.1163/221058211X570358. Tittel, Sabine, Frances Gillis-Webber, and Alessandro A. Nannini. 2020. “Towards an Ontology Based on Hallig-Wartburg’s Begriffssystem for Historical Linguistic Linked Data.” In Proceedings of the 7th Workshop on Linked Data in Linguistics (LDL-2020), edited by Maxim Ionov, John P. McCrae, Christian Chiarcos, Thierry Declerck, Julia Bosque-Gil, and Jorge Gracia, 1–10. Marseille, France: European Language Resources Association. https://aclanthology.org/2020.ldl-1.1. Tresoldi, Tiago. 2023. “A Global Lexical Database (GLED) for Computational Historical Linguistics.” Journal of Open Humanities Data 9 (1): Article 2. https://doi.org/10.5334/johd.96. Weber, Natalie, Tyler Brown, Joshua Celli, McKenzie Denham, Hailey Dykstra, Rodrigo Hernandez-Merlin, Evan Hochstein, et al. 2023. “Blackfoot Words: A Database of Blackfoot Lexical Forms.” Language Resources and Evaluation 57 (3): 1207–62. https://doi.org/10.1007/s10579-022-09631-2. Wichmann, Søren, and Arpiar Saunders. 2007. “How to Use Typological Databases in Historical Linguistic Research.” Diachronica 24 (2): 373–404. https://doi.org/10.1075/dia.24.2.06wic.

1 0

50th Anniversary Public Lecture Series: The shared anti-science discourses
by Hardaker, Claire (hardakec) 19 Oct '24

19 Oct '24

With the usual apologies for any cross-posting: In the Department of Linguistics and English Language at Lancaster University, we’re celebrating our 50th anniversary in 2024. In a series of public lectures, we will showcase our recent research in different areas of linguistics. Everyone is welcome! Our next talk should be of particular interest to corpus linguists, forensic linguists, researchers broadly interested in the security and protection sciences: The shared anti-science discourses, by Dr Isobelle Clarke<https://www.lancaster.ac.uk/linguistics/about/people/isobelle-clarke> Anti-science discourse has been studied through the optic of particular governments (Carter et al., 2019) or specific topics, such as anti-vaccination (Davis, 2019), anti-genetically modified organisms (Cook et al. 2004), stem cell research (Marcon, Murdoch and Caulfield, 2017), and climate denial discourse (Park, 2015). This research often details the development and content of the anti-science position and discourses. Yet, little is known about how the discourses compare across topics. Are there anti-science discourses that are shared across topics or does the discourse vary with the topic? In this talk, I will present the results of the common discourses which are shared between texts from websites known to promote pseudoscience and conspiracy on the topics of stem cells, climate change, vaccination and genetically modified organisms. For details of this talk and the upcoming programme, please visit our 50th anniversary website<https://www.lancaster.ac.uk/linguistics/50th-anniversary/>. Register Register here<https://www.trybooking.com/uk/events/landing/68278> to attend. Date and time 6:00 PM - 8:00 PM (UTC+01), Thursday 24th October 2024 Location Faraday Lecture<https://use.mazemap.com/#v=1&config=lancaster&campusid=341&zlevel=1&center=…> Theatre, Lancaster University, Lancaster, LA1 4YW Getting to the University More information on ways to get to the university can be found here<https://www.lancaster.ac.uk/about-us/maps-and-travel/>. If you have any questions, please don’t hesitate to ask. Regards Claire Professor Claire Hardaker (she/her) Professor of Forensic Linguistics Director of the MSc in Forensic Linguistics & Speech Science<https://www.lancaster.ac.uk/study/postgraduate/postgraduate-courses/forensi…> Lancaster University, LA1 4YL

1 0

Call for Papers: The First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages (IndoNLP2025@COLING)
by Nanomi Arachchige, Isuri (Postgraduate Researcher) 19 Oct '24

19 Oct '24

Call for papers for the COLING-2025 workshop: The First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages (IndoNLP2025) Date: 20th January 2025 (full day) Venue: Abu Dhabi, UAE Webpage: https://indonlp-workshop.github.io/IndoNLP-Workshop/ Submission Deadline: 5 November 2024 Submission Portal: https://softconf.com/coling2025/IndoNLP25/ Workshop description The rapid advancement of Natural Language Processing (NLP) and Large Language Models (LLMs) has transformed the landscape of computational linguistics. However, Indo-Aryan and Dravidian Languages (IADL), which represent a significant portion of South Asia's linguistic heritage, remain under-resourced and under-researched in these technological developments. This workshop aims to bridge this gap by bringing together researchers, linguists, and technologists to focus on the unique challenges and opportunities. Participants will explore innovative methods for creating and annotating digital corpora, develop speech and language technologies suited to IADL, and promote interdisciplinary collaborations. By leveraging LLMs, we seek to address the complexities of syntax, morphology, and semantics in these languages to enhance the performance of NLP applications. Furthermore, the workshop will provide a platform for sharing best practices, tools, and resources, enhancing the digital infrastructure necessary for language preservation. Through collaborative efforts, we aim to build a research community to advance NLP for IADL, contributing to linguistic diversity and cultural preservation in the digital age. The topics of the workshop include, but are not limited to: - Large Language Models for Indo-Aryan Languages and Dravidian Languages. - Developing a cleaned Indo-Aryan and Dravidian language corpus (UNICODE) and digital linguistic resources. - Machine Translation and Cross-Lingual Systems - Speech Technologies: Recognition and Synthesis - Language Identification and Dialect Detection - Information Extraction, OCR systems and Knowledge Modelling - NLP Applications - Fake News, Spam, and Rumour Detection - Hate speech and Offensive Language Detection - Sentiment Analysis and Text Summarisation - NLP applications: Misinformation, Conspiracy theories. Rumours, SPAM, Phishing, and similar applications. Submission & Publication Papers will be evaluated according to their significance, originality, technical content, style, clarity, and relevance to the workshop. We welcome the following types of contributions: - Standard research papers (up to 8 pages, plus more pages for references if needed) - Short research papers (from 4 to 6 pages, plus more pages for references if needed) At the end of the paper (after the conclusions but before the references), papers need to include a mandatory section discussing the limitations of the work and, optionally, a section discussing ethical considerations. Papers can include unlimited pages of references and an unlimited appendix. To prepare your submission, please make sure to use the COLING 2025 style files available here: - Latex - https://coling2025.org/downloads/coling-2025.zip - Word - https://coling2025.org/downloads/coling-2025.docx - Overleaf - https://www.overleaf.com/latex/templates/instructions-for-coling-2025-proce… All papers should be electronically submitted in PDF format via the main conference platform via START<https://softconf.com/coling2025/IndoNLP25/>. Important Dates - Paper submission deadline: November 5, 2024 - Notification of acceptance: November 25, 2024 - Camera-ready paper: December 13, 2024 - Workshop date: January 20, 2025 Organising Committee - Ruvan Weerasinghe, Informatics Institute of Technology, Sri Lanka - Isuri Anuradha, Lancaster University, UK - Deshan Sumanathilaka, Swansea University, UK - Mo El-Haj, Lancaster University, UK - Chamila Liyanage, University of Colombo School of Computing, Sri Lanka - Fahad Khan, Istituto di Linguistica Computazionale in CNR, Italy - Andrew Hardie, Lancaster University, UK - Asim Abbas, Birmingham University, UK - Ruslan Mitkov Lancaster University, UK - Julian Hough, Swansea University, UK - Nicholas Micallef, Swansea University, UK - Naomi Krishnarajah, Informatics Institute of Technology, Sri Lanka Programme Committee - Randil Pushpanandha, University of Colombo, Sri Lanka - Dulip Herath, Queensland University,Australia - Daisy Lal, Lancaster University, UK - Damith Premasiri, Lancaster University, UK - Venkatesh Raju, Stealth Mode AI Startup, India - Gayanath Chandrasena, University of Helsinki, Finland - Kaza Sri Sai Swaroop, IBM, India - Asanka Wasala,Dell Technologies, Ireland - Kengatharaiyer Sarveswaran - University of Jaffna, Sri Lanka - Sinnathamby Mahesan - University of Jaffna, Sri Lanka - Nishantha Medagoda - Auckland University of Technology, New Zealand - Prasan Yapa, Kyoto University of Advance Science, Japan - Paul Rayson, Lancaster University, UK - Lochandaka Ranathunga, University of Moratuwa, Sri Lanka - Kaneeka Vidanage, General Sir John Kotelawala Defence University, Sri Lanka - Achala Aponso, Edith Cowan University, Australia - Rajitha Jayasinghe, University of Westminster, UK, - Arjumand Younus, University College Dublin, Ireland - Abdul Nazeer, National Institute of Technology, Calicut, India - Pabitra Mitra, Indian Institute of Technology, Kharagpur, India - Tanmoy Chakraborty, Indian Institute of Technology, Delhi, India - Tirthankar Dasgupta, Indian Institute of Technology, Kharagpur, India - Girish Nath Jha, School for Sanskrit and Indic Studies, JNU, India - Arka Majhi, Indian Institute of Technology, Bombay, India - Anand Kumar, National Institute of Technology, Karnataka, India - Kishorjit Nongmeikapam, Indian Institute of Information Technology, India - Abdullah Alzahrani, Swansea University, Wales, UK Gmail: indonlp2025(a)gmail.com<mailto:indonlp2025@gmail.com> Twitter: https://x.com/indo_nlp

1 0

Postdoc (or PhD) position in NLU at the University of Technology Nuremberg
by Michael Roth 18 Oct '24

18 Oct '24

The University of Technology Nuremberg (UTN) is looking to fill a full-time position in the Department of Engineering as soon as possible: # Research Associate - Postdoc (m/f/d) in Natural Language Understanding (NLU) UTN is dedicated to harnessing the knowledge and innovation of the humanities to shape a sustainable future. The Department of Engineering seeks to establish strong, dynamic collaboration across disciplines, connecting engineering with the humanities, social sciences, and natural sciences. The position offers an opportunity for scientific qualification, allowing you to build your research profile and gain experience in line with your academic background and personal aspirations. The research project should make a contribution to one of the NLU Lab's key research areas listed below. ## Your tasks * Active collaboration in research and teaching * Participation in the lab's research areas "background knowledge in language understanding and misunderstanding" and "implicit and underspecified language" * Support in the conception and organization of scientific events and public engangement projects (in coordination with the Communication Unit) * Participation in research cooperations of the department ## Your profile * Very good academic degree (Master's or comparable) in computational linguistics or a related field * Doctorate in the field of computational linguistics * Proven research focus in natural language understanding, as demonstrated by relevant publications * Interest in interdisciplinary cooperation in research and teaching ## We offer * An employment contract or position as a civil servant for initially 3 years * Active support in the development of your own research agenda and corresponding applications for projects or an independent research group * Salary corresponding to group A13 of the Bavarian Salary Act or pay group E13 TV-L (https://oeffentlicher-dienst.info/tv-l/allg/) if the personal and pay scale requirements are met. * Opportunity to actively participate in the development of the newly founded University of Technology Nuremberg and to take on responsible tasks * A dynamic and flexible working environment * A modern workplace with all the attractive social benefits of the public sector * Flexible working hours to reconcile family and career * Mobile working opportunities * Attractive training and development opportunities The position is suitable for people with severe disabilities. Severely disabled applicants will be given preference if they have the same suitability, qualifications and professional performance. Women are encouraged to apply in accordance with Art. 7 Para. 3 of the Bavarian Equal Opportunities Act. The NLU Lab is committed to fostering a diverse and inclusive work environment and we highly welcome applications from minority groups and candidates of all backgrounds. The position is open to part-time arrangements, provided that the responsibilities can be fulfilled through job sharing. ## Are you interested? Please send us your detailed application by 03.11.2024. Please only use our application portal. Interviews for this vacancy are expected to take place in the week commencing 10.11.2024. Job portal: https://jobs.utn.de/en/jobposting/ead54f69ce980cfe6f7d04fba57812b88328f4a00… ## Do you have questions? We are happy to receive general questions by e-mail to jobs(a)utn.de, quoting the reference number ENG-2024-04, and we will call you back. If you have content related questions, please contact Prof. Dr. Michael Roth at michael.roth(a)utn.de The official version of this announcement is available in German on www.utn.de

1 0

Join Us: EMNLP 2024 Tutorial on Countering Hateful and Offensive Speech Online - Open Challenges
by Flor Miriam Plaza Del Arco 18 Oct '24

18 Oct '24

Dear colleagues, We are pleased to invite you to the tutorial, “Countering Hateful and Offensive Speech Online—Open Challenges,” which will take place on Friday, November 15, from 9:00 to 12:30 at EMNLP 2024 in Miami. Overview: In today's digital age, hate speech and offensive speech online pose a significant challenge to maintaining respectful and inclusive online environments. This tutorial aims to provide attendees with a comprehensive understanding of the field by delving into essential dimensions such as: 1. Data Creation and Multilingualism. 2. Counter-narrative generation. 3. A hands-on session with one of the most popular APIs for detecting hate speech. 4. Fairness and ethics in AI. 5. The use of recent advanced approaches. For the full program and detailed agenda, please visit the tutorial website. https://nlp-for-countering-hate-speech-tutorial.github.io/ Tutorial Organizers * Flor Miriam Plaza-del-Arco, MilaNLP, Bocconi University, Italy. * Debora Nozza, MilaNLP, Bocconi University, Italy. * Marco Guerini, LanD_FBK, Fondazione Bruno Kessler, Italy. * Jeffrey Sorensen, Jigsaw, USA. * Marcos Zampieri, Language Technology Group, George Masson University, USA. We look forward to your participation and hope to see you there! Best regards, The Tutorial Organizing Team ------------------------------------------------ Flor Miriam Plaza del Arco, Ph.D. Postdoctoral Researcher MilaNLP<https://milanlproc.github.io/>, Computing Sciences Department Bocconi University Via Röntgen, 1-2, 20136 Milan, MI, Italy Twitter: @florplaza22 Web: https://fmplaza.github.io/

1 0

PhD position in multilingual NLP (4 years, fully funded) at USI, Lugano, Switzerland
by Lonneke van der Plas 18 Oct '24

18 Oct '24

We are looking for a PhD candidate for a fully-funded 4-year project focusing on multilingual NLP, including extremely low-resource languages. The PhD project is part of the National Centre of Competence in Research (NCCR) Evolving Language (www.evolvinglanguage.ch), a Swiss consortium with the ambitious goal of creating a new discipline, Evolutionary Language Science, that targets the past and future of language. The consortium consists of leading scientists from traditionally separated academic domains, which allows us to harvest the diverse expertise from the humanities, social sciences, computational sciences, natural sciences and medicine towards a broad-scale interdisciplinary collaboration. Within this framework, the successful candidate will be expected to investigate how new terms emerge in a given language and how they spread in different contexts, thereby comparing Western societies with hunter-gatherer societies. This task is led by computational linguists and evolutionary anthropologists from the USI Università della Svizzera italiana and the University of Zurich (UZH): Prof. Lonneke van der Plas, Prof. Lena Jäger, and Prof. Andrea Migliano. The ideal candidate should satisfy the following requirements: • A Master (or equivalent title) in Computational Linguistics, Natural Language Processing, or related disciplines, such as Computer Science or Computational Cognitive Science • High personal interest in multilingual approaches to NLP, low-resource languages, and language change • Expertise in machine learning and computational modelling of language (change) • Good skills in oral and written English (official language of the Ph.D. program) • Good oral and writing skills in one (or more) national Swiss languages. Italian is particularly welcome. • Ability to work independently and to plan and direct own work • Motivation to engage in the elaboration of a PhD dissertation. Ability to work in team and autonomy in scheduling research steps. Interest in teaching and tutoring students and availability to collaborate with colleagues, especially with colleagues from the different disciplines of the NCCR Evolving Language (engage in scientific dialogue, listen and think critically) are required. More information and a link to apply for the position can be found here: https://sites.google.com/site/lonnekenlp/phd-positions-available

1 0

Final Call for Papers: IWCLUL 2024 in Helsinki
by Mika Hämäläinen 18 Oct '24

18 Oct '24

The 9th International Workshop on Computational Linguistics for Uralic Languages (IWCLUL 2024) will be organized by ACL SIGUR. The proceedings of the event will be published in the ACL anthology. The workshop will take place in November 28-29, 2024 in Helsinki, Finland at Metropolia University of Applied Sciences. https://acl-sigur.github.io/iwclul2024.html Submission deadline: October 25, 2024 (extended) Registration/publication fees: 0€! We solicit original and unpublished work related to NLP methods for Uralic languages, including multilingual methods that include at least one Uralic language (e.g. Finnish, Estonian, Hungarian etc). Appropriate topics include (but are not limited to): - Multilingual approaches in NLP presenting work on at least one Uralic language - LLMs and their use in the context of (endangered) Uralic languages - Position papers - Parsers, analysers and processing pipelines of Uralic languages - Lexical databases, electronic dictionaries - Finished end-user applications aimed at Uralic languages, such as spelling or grammar checkers, machine translation or speech processing - Evaluation methods and gold standards, tagged corpora, treebanks - Reports on language-independent or unsupervised methods as applied to Uralic languages - Surveys and review articles on subjects related to computational linguistics for one or more Uralic languages - Any work that aims at combining efforts and reducing duplication of work - How to elicit activity from the language community, agitation campaigns, games with a purpose Short papers can be up to 4 pages in length (5 for camera-ready version). Short papers can report on work in progress or a more targeted contribution such as software or partial results. Long papers can be up to 8 pages in length (9 for camera-ready version). Long papers should report on previously unpublished, completed, original work. Lightning talks submitted as 750-word abstracts. Lightning talks are suited for discussing ideas or presenting work in progress. The abstracts will be published in a lightning proceedings on Zenodo. All submission formats can have an unlimited number of pages for references. All submissions must follow the ACL stylesheet. The submissions must be anonymous, and they will be peer-reviewed by our program committee. The peer review is double blinded. Papers must be submitted using the conference submission system by the deadline. At least one of the authors of an accepted paper must attend the event and present their paper. Accepted papers (short and long) will be published in the joint proceedings that will appear in the ACL Anthology. Accepted papers will also be given an additional page to address the reviewers’ comments. The length of a camera-ready submission can then be 5 pages for a short paper and 9 for a long paper with an unlimited number of pages for references. Important dates: - Paper submission (full and short): October 25, 2024 (extended) - Notification of acceptance: November 3, 2024 - Camera ready deadline: November 10, 2024 - Registration deadline: November 10, 2024 - Workshop: November 28-29, 2024

1 0

[CFP-Extended Shared Task deadlines] The First Workshop and Shared Task on Multilingual Counterspeech Generation at COLING-2025
by mevallec＠ujaen.es 18 Oct '24

18 Oct '24

----------------------------------------------------- Shared Task on Multilingual Counterspeech Generation ----------------------------------------------------- *WE HAVE EXTENDED THE DEADLINES!* (You can find the new deadlines in the "Important Dates" section of this CFP) In addition to paper contributions, we are organizing a shared task on multilingual counterspeech generation with the aim of sharing in a central space current efforts, especially those for languages different to English. It is envisaged that the shared task would allow the community to study how we can improve counterspeech generation for both lower resource languages but also to reinforce the strong body of research already existing for English. The counterspeech generated by participants should be respectful, non-offensive, and contain information that is specific and truthful with respect to the following targets: Jews, LGBT+, immigrants, people of color, women. Data --------------------- We release new data consisting of 596 Hate Speech-Counter Narrative (HS-CN) pairs. In this dataset, the HS are taken from MTCONAN [https://github.com/marcoguerini/CONAN/tree/master/Multitarget-CONAN], while the CN are newly generated. Together with each HS-CN pair, we also provide 5 background knowledge sentences, some of which are relevant for obtaining the Counter Narratives. The dataset is available in 4 different languages (Basque, English, Italian and Spanish) and divided in the following splits: - Development: 100 pairs. [AVAILABLE NOW!] [https://huggingface.co/datasets/LanD-FBK/ML_MTCONAN_KN] - Train: 396 pairs [AVAILABLE NOW!] [https://huggingface.co/datasets/LanD-FBK/ML_MTCONAN_KN] - Test: 100 pairs [AVAILABLE ON 28th OCTOBER] In order to score the shared task participants, the CNs will be kept hidden during the shared task while the HS and the background knowledge will be released for participants to prepare their submissions. The languages, Basque, English, Italian and Spanish, offer a varied spectrum of complexity including an agglutinative language isolate (Basque), two romance languages (Italian, Spanish) and a Germanic one (English). The choice of languages obeys to the existing linguistic expertise among the organizers required to successfully run the shared task. Additional Multilingual Data --------------------- Participants also have available the English, Italian, Basque and Spanish CONAN manually curated data: - CONAN (English and Italian) [https://github.com/marcoguerini/CONAN/tree/master/CONAN] - CONAN (Basque and Spanish) [https://huggingface.co/datasets/HiTZ/CONAN-EUS] - CONAN-MT-SP (Spanish) [https://github.com/sinai-uja/CONAN-MT-SP] or [https://huggingface.co/datasets/SINAI/CONAN-MT-SP] - Multitarget-CONAN (English) [https://github.com/marcoguerini/CONAN/tree/master/Multitarget-CONAN] Participation --------------------- The aim of the shared task is, given a HS (and optionally any additional knowledge the participants may like to use), generate a CN to counteract the HS. An example of CN appears bellow: - English: + HS: We should ban all homosexuals. + CN: When will the love prosper and the hatred start to dissipate? I will not only respect my fellow LGBT+ people, I will promote their rights. - Italian: + HS: Dovremmo espellere tutti gli omosessuali. + CN: Quand'è che l'amore prospererà e l'odio inizierà a dissiparsi? Non solo rispetterò i miei compagni LGBT+, ma promuoverò i loro diritti. - Spanish: + HS: Deberíamos prohibir a todos los homosexuales. + CN: ¿Cuándo prosperará el amor y empezará a disiparse el odio? No sólo respetaré a mis compañeros LGBT+, sino que promoveré sus derechos. - Basque: + HS: Homosexual guztiak debekatu beharko genituzke. + CN: Noiz hasiko da maitasuna irabazten eta gorrotoa desagertzen? LGBT+ pertsonak errespetatzeaz gain, haien eskubideak sustatuko ditut. Participants will download the test HS for the 4 languages and generate at most three different CNs per HS for each language). The test window will last 5 days. Participants are allowed to use any resource (language model, data, etc.) to generate the CN. **Note:** If you are going to participate in the shared task, please fill the following form: [https://docs.google.com/forms/d/e/1FAIpQLSeAZTJsrEXt35HfFFchPNdPi289q5kKerq…] Evaluation --------------------- The CNs submitted by the participants will be evaluated: - Using traditional automatic metrics as in Tekiroglu et al.( 2022), which include BLEU, ROUGE, Novelty and Repetition Rate. - Using LLM as a Judge following the approach described in this paper: https://arxiv.org/abs/2406.15227 Important Dates --------------------- - Test dataset release: October 28th, 2024 - Results submission: November 4th, 2024 - Results notification: November 15th, 2024 - Working papers submission: November 25th, 2024 - Notification of Acceptance: December 8th, 2024 - Camera-Ready Papers Due: December 13th, 2024 - Workshop: January 19th, 2025 ----------------------------------------------------- Workshop on Multilingual Counterspeech Generation ----------------------------------------------------- The Shared Task is associated to the First Workshop on Multilingual Counterspeech Generation at COLING 2025. --------------------- Background and Scope --------------------- While interest in automatic approaches to Counterspeech generation has been steadily growing, including studies on data curation (Chung et al., 2019a; Fanton et al., 2021), detection (Chung et al., 2021a; Mathew et al., 2018), and generation (Tekiroglu et al., 2020; Chung et al., 2021b; Zhu and Bhat, 2021; Tekiroglu et al., 2022), the large majority of the published experimental work on automatic Counterspeech generation has been carried out for English. This is due to the scarcity of both non-English manually curated training data and to the crushing predominance of English in the generative Large Language Models (LLMs) ecosystem. A workshop on exploring Multilingual Counterspeech Generation is proposed to promote and encourage research on multilingual approaches for this challenging topic. Thus, this workshop aims to test monolingual and multilingual LLMs in particular and Language Technology in general to automatically generate counterspeech not only in English but also in languages with fewer resources. In this sense, an important goal of the workshop will be to understand the impact of using LLMs, considering for example how to deal with pressing issues such as biases, hallucinated content, data scarcity or data contamination. We seek to maximize the scientific and social impact of this workshop by promoting the creation of a community of researchers from diverse fields, such as computer and social sciences, as well as policy makers and other stakeholders interested in automatic counterspeech generation. By doing so we aim to gain a deeper understanding of how counterspeech is currently used to tackle abuse by individuals, activists, and organizations and how Natural Language Processing (NLP) and Generation (NLG) may be best applied to counteract it. Call for Papers --------------------- We welcome submissions on the following topics (but not limited to): - Models and methods for generating counterspeech in different languages. - Automatic Counterspeech generation for low resource languages with scarce training data. - Dialogue agents that use counterspeech to combat offensive messages that are directed to individuals or groups, targeted based on various aspects such as ideology, gender, sexual orientation and religion. - Methods for human and automatic evaluation of counterspeech. - Multidisciplinary studies providing different perspectives on the topic such as computer science, social science, psychology, etc. - Development of taxonomies and quality datasets for counterspeech in multiple languages. - Potentials and limitations (e.g., fairness, biases, hallucinated content) of applying different NLP methods, such as LLMs, to generate counterspeech. - Social impact and empirical studies of counterspeech in social networks, including research on the effectiveness and consequences for users of using counterspeech to combat hate online. Submission --------------------- We welcome two types of papers: regular workshop papers and non-archival submissions. Regular workshop papers will be included in the workshop proceedings. All submissions must be in PDF format and made through START [https://softconf.com/coling2025/MCG25/] - Regular workshop papers: Authors can submit papers up to 8 pages, with unlimited pages for references. Authors may submit up to 100 MB of supplementary materials separately and their code for reproducibility. All submissions undergo an double-blind single-track review. Accepted papers will be presented as posters with the possibility of oral presentations. - Non-archival submissions: Cross-submissions are welcome. Accepted papers in other venues or journals will be presented at the workshop, but will not be included in the workshop proceedings. Papers must be in PDF format and will be reviewed in a double-blind fashion by workshop reviewers. We also welcome extended abstracts (up to 2 pages) of papers that are work in progress, under review or to be submitted to other venues. Papers in this category need to follow the COLING format. Important Dates --------------------- - Submission: November 20th, 2024 - Notification of Acceptance: December 2nd, 2024 - Camera-Ready Papers Due: December 10th, 2024 For more information you can join the Google group [https://groups.google.com/g/multilingual-cs-generation-coling2025] or visit our website [https://sites.google.com/view/multilang-counterspeech-gen/home] [https://sites.google.com/view/multilang-counterspeech-gen/shared-task] Best regards, The Multilingual Counterspeech Generation Workshop Organizers.

1 1

Computing Conference | London | 19-20 June 2025 - Call for Papers Now Open
by Supriya Kapoor 18 Oct '24

18 Oct '24

Dear Colleagues, We are pleased to announce the Call for Papers for the *Computing Conference 2025 <https://saiconference.com/Computing>*, scheduled to take place from 19-20 June 2025 in London. The conference aims to bring together researchers, scholars, practitioners, and experts from around the world to share their insights, discoveries, and research outcomes. We invite you to submit your original research papers, posters, and proposals on a wide range of topics within the realm of computing, including but not limited to: - High-Performance Computing - Quantum Computing - Data Science and Analytics - Cloud Computing and Distributed Systems - Cybersecurity and Privacy - Emerging Technologies in Computing - Natural Language Processing and LLMs - Artificial Intelligence and Machine Learning - Computer Vision and Image Processing - Intelligent Educational systems and e-Learning - Ambient Intelligence and IoT *Important Dates - Round 2:* Full Paper Submission Deadline: 15 November 2024 Notification of Acceptance: 15 December 2024 Camera-Ready Paper Submission: 15 January 2025 Conference Dates: 19-20 June 2025 Proceedings will be published in Springer series "Lecture Notes in Networks and Systems" and submitted for consideration to Web of Science, SCOPUS, INSPEC, WTI Frankfurt eG, zbMATH and SCImago. Submissions will undergo a rigorous double blind peer-review process by an esteemed panel of experts, ensuring the highest academic and technical standards. Detailed submission guidelines and conference updates are available on our official website at https://saiconference.com/Computing. Looking forward to your valuable contributions to the Computing Conference 2025. Sincerely, Kohei Arai Program Chair Computing Conference View 2024 Recap <https://youtu.be/Am9rE4tm3ms> | Unsubscribe <https://mailing.thesai.org/sendy/unsubscribe-success.php?c=9056>

1 0

1st CfP: Asia Pacific Journal of Corpus Research (APJCR) Vol. 5, No. 2 (Deadline: 31 October 2024)
by Prof CK Jung 18 Oct '24

18 Oct '24

[Apologies for cross-posting] Dear colleagues We are inviting submissions for the next issue of Asia Pacific Journal of Corpus Research, to appear on 31 December 2024. *ABOUT*The Asia Pacific Journal of Corpus Research (APJCR, e-ISSN 2733-8096, DOI: https://doi.org/10.22925/apjcr) is an international and interdisciplinary peer-reviewed journal intended to explore corpus research in the Asia Pacific region. APJCR addresses areas of methodological, applied and theoretical work in the field of corpus research. Examples of such include discourse analysis, lexical studies, grammatical studies, language acquisition, language learning, language education, lexicography, pragmatics, sociolinguistics, (machine) translation studies, (digital) literary studies, computational linguistics, speech, phonetics, deep learning and natural language understanding in conjunction with corpus. *NO ARTICLE PROCESS CHARGE*APJCR does not charge authors an Article Processing Fee (APF). *OPEN ACCESS POLICY*APJCR provides open access to its content under the principle in the academic field that making research freely available to the public supports a greater global exchange of knowledge. *SUBMISSION* Papers (in English or Korean) should be sent to *apjcreditor(a)icr.or.kr <apjcreditor(a)icr.or.kr>* *Full instruction can be found on http://icr.or.kr/apjcr <http://icr.or.kr/apjcr>* *IMPORTANT DATES*- Manuscript submission: 31 October 2024 - First decision (articles assessed by editors): November 2024 - Final decision: November 2024 - Production: December 2024 - Online publication: 31 December 2024 *APJCR ARCHIVE*- *Google Scholar*: https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=apjcr&btnG= - *KoreaScience*: http://koreascience.or.kr/journal/CPSOBX/v1n1.page *ENQUIRIES* help(a)icr.or.kr --- *CK Jung BEng(Hons) Birmingham MSc Warwick EdD Warwick Cert Oxford* Associate Professor | Department of English Language and Literature, Incheon National University, *South Korea* President | The Korea Association of Secondary English Education, *South Korea *(http://kasee.org) Director | Institute for Corpus Research, Incheon National University, *South Korea* (http://icr.or.kr) Editor-in-Chief | Asia Pacific Journal of Corpus Research, ICR, *International* (http://icr.or.kr/apjcr) Editorial Board | Corpora, Edinburgh University Press, *UK* Editorial Board | English Today, Cambridge University Press, *UK* E: ckjung(a)inu.ac.kr / T: +82 (0)32 835 8129

1 1

2026

2025

2024

2023

2022

Corpora October 2024