November 2024 - Corpora

3-year postdoc position in NLP at the University of Oslo
by Lilja Øvrelid 18 Nov '24

18 Nov '24

A position as Postdoctoral Research Fellow in Natural Language Processing is available within MediaFutures:Research Centre for Responsible Media Technology & Innovation at the Language Technology Group (LTG) at the University of Oslo (UiO), Norway. The closing date is December 13th, 2024. For more information about the position and the research group, please see the full announcement here: https://www.jobbnorge.no/en/available-jobs/job/270966/postdoctoral-research… Please do not hesitate to contact me for any further information. Best regards, Lilja

1 0

Final CFP: Building and Using Comparable Corpora Workshop
by s.sharoff＠leeds.ac.uk 17 Nov '24

17 Nov '24

Note the paper submission deadline: 30 November, 2024 Workshop website: https://comparable.lisn.upsaclay.fr/bucc2025/ COLING website: https://coling2025.org/ Keynote speaker: Preslav Nakov, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi ************************************************************** * Motivation In the language engineering and linguistics communities, research in comparable corpora has been motivated by two main reasons. In language engineering, on the one hand, it is chiefly motivated by the need to use comparable corpora as training data for statistical NLP applications such as statistical and neural machine translation or cross-lingual retrieval. In linguistics, on the other hand, comparable corpora are of interest because they enable cross-language discoveries and comparisons. It is generally accepted in both communities that comparable corpora consist of documents that are comparable in content and form in various degrees and dimensions across several languages. Parallel corpora are on the one end of this spectrum, and unrelated corpora are on the other. In recent years, the use of comparable corpora for pre-training Large Language Models (LLMs) has led to their impressive multilingual and cross-lingual abilities, which are relevant to a range of applications, including Information Retrieval, Machine Translation, Cross-lingual text classification, etc. The linguistic definitions and observations related to comparable corpora can improve methods to mine such corpora or to improve cross-lingual transfer of LLMs. Therefore, it is of great interest to bring together builders and users of such corpora. * Shared Task This year we will run a shared task aimed at detecting translations of terms via comparable corpora. Please see the website for details: https://comparable.limsi.fr/bucc2025/bucc2025-task.html * Topics We solicit contributions on all topics related to comparable (and parallel) corpora, including but not limited to the following: Building Comparable Corpora: - Automatic and semi-automatic methods - Methods to mine parallel and non-parallel corpora from the web - Tools and criteria to evaluate the comparability of corpora - Parallel vs non-parallel corpora, monolingual corpora - Rare and minority languages, across language families - Multi-media/multi-modal comparable corpora Applications of comparable corpora: - Human translation - Language learning - Cross-language information retrieval & document categorization - Bilingual and multilingual projections - (Unsupervised) Machine translation - Writing assistance - Machine learning techniques using comparable corpora Mining from Comparable Corpora: - Cross-language distributional semantics, word embeddings and pre-trained multilingual transformer models - Extraction of parallel segments or paraphrases from comparable corpora - Methods to derive parallel from non-parallel corpora (e.g. to provide for low-resource languages in neural machine translation) - Extraction of bilingual and multilingual translations of single words, multi-word expressions, proper names, named entities, sentences, and paraphrases from comparable corpora, etc. - Induction of morphological, grammatical, and translation rules from comparable corpora - Induction of multilingual word classes from comparable corpora Comparable Corpora in the Humanities: - Comparing linguistic phenomena across languages in contrastive linguistics - Analyzing properties of translated language in translation studies - Studying language change over time in diachronic linguistics - Assigning texts to authors via authors' corpora in forensic linguistics - Comparing rhetorical features in discourse analysis - Studying cultural differences in sociolinguistics - Analyzing language universals in typological research * Workshop Organizers - Serge Sharoff (University of Leeds) - Ayla Rigouts Terryn (Université de Montréal (UdeM), Mila) - Pierre Zweigenbaum (Université Paris-Saclay, CNRS, LISN, Orsay) - Reinhard Rapp (University of Mainz, Germany) * Program Committee - Ebrahim Ansari (Institute for Advanced Studies in Basic Sciences, Iran) - Eleftherios Avramidis (DFKI, Germany) - Gabriel Bernier-Colborne (National Research Council, Canada) - Thierry Etchegoyhen (Vicomtech, Spain) - Alex Fraser (University of Munich, Germany) - Natalia Grabar (University of Lille, France) - Amal Haddad Haddad (Universidad de Granada, Spain) - Amir Hazem (University of Tokyo, Japan) - Kyo Kageura (University of Tokyo, Japan) - Natalie Kübler (Université Paris Cité, France) - Philippe Langlais (Université de Montréal, Canada) - Yves Lepage (Waseda University, Japan). - Shervin Malmasi (Amazon, USA) - Michael Mohler (Language Computer Corporation, USA) - Emmanuel Morin (Nantes Université, France) - Dragos Stefan Munteanu (RWS, USA) - Ted Pedersen (University of Minnesota, Duluth, USA) - Nasredine Semmar (CEA LIST, Paris, France) - Silvia Severini (Leonardo Labs, Italy) - Pranaydeep Singh (University of Gent, Belgium) - Richard Sproat (Google, USA) - Marko Tadić (University of Zagreb, Croatia) - François Yvon (Sorbonne Université, France)

1 0

Call for Papers ROMCIR 2025 @ ECIR: The 5th International Workshop on Reducing Online Misinformation through Credible Information Retrieval
by Udo Kruschwitz 17 Nov '24

17 Nov '24

ROMCIR 2025: The 5th International Workshop on Reducing Online Misinformation through Credible Information Retrieval Co-located with ECIR 2025: The 47th European Conference on Information Retrieval Lucca, Italy | April 10, 2025 Workshop website: https://romcir.disco.unimib.it Submission link: https://easychair.org/conferences/?conf=romcir2025 ____________________________________________________________________________________________________ GENERAL DESCRIPTION The fifth edition of ROMCIR concerns providing access to users to (topically) relevant and factually accurate information, to mitigate the human-generated or AI-generated information disorder phenomenon concerning distinct domains. By "information disorder" we mean all forms of communication pollution, from misinformation made out of ignorance, automatically built based on biased content, to intentional sharing of false content (generated both manually and automatically). In this context, all those approaches that can serve to assess the factual accuracy of information circulating online and in social media in particular find their place. This topic is very broad, as it concerns different contents (e.g., Web pages, news, reviews, medical information, online accounts, etc.), different Web and social media platforms (e.g., microblogging platforms, social networking services, social question-answering systems, etc.), different purposes (e.g., identifying false information, accessing information based on its truthfulness, retrieving truthful information, etc.), and different open issues related in particular to AI (e.g., explainability of search results, assessment of the truthfulness of automatically generated content, generative models to support IRSs, etc.). **************************************************************************************************** THEMES The themes of interest include, but are not limited to, the following: * Access to and retrieval of truthful information * Bot/spam/troll detection * Computational fact-checking * Credibility assessment of online documents * Crowdsourcing for information truthfulness assessment * Disinformation/misinformation/bias detection * Evaluation strategies to assess information truthfulness * Generative models and information truthfulness assessment * Human-in-the-loop misinformation detection * Information polarization in online communities, echo chambers * Propaganda identification/analysis * Retrieval of credible and truthful information * Security, privacy, and information truthfulness * Societal reaction to misinformation * Stance detection * Trust and reputation Data-driven approaches, supported by publicly available datasets, are more than welcome. **************************************************************************************************** CONTRIBUTIONS The Workshop solicits the sending of two types of contributions relevant to the Workshop and suitable to generate discussion: * Original, unpublished contributions (pre-prints submitted to ArXiv are eligible) that will be included in an open-access post-proceedings volume of CEUR Workshop Proceedings (http://ceur-ws.org/), indexed by both Scopus and DBLP. * Already published or preliminary work that will not be included in the post-proceedings volume. All submissions will undergo SINGLE-BLIND peer review by the Program Committee. Submissions are to be done electronically through the EasyChair at: https://easychair.org/conferences/?conf=romcir2025 **************************************************************************************************** SUBMISSION INSTRUCTIONS Submissions must be: * Regular papers: between 10 and 14 pages long * Short papers: Between 5 and 9 pages long We recommend that authors use the new CEUR-ART style for writing papers to be published: * An Overleaf page for LaTeX users is available at: https://www.overleaf.com/project/671e05abc213fddad9644a94 * An offline version with the style files including DOCX template files is available at: http://ceur-ws.org/Vol-XXX/CEURART.zip * The paper must contain, as the name of the conference: ROMCIR 2025: The 5th Workshop on Reducing Online Misinformation through Credible Information Retrieval (held as part of ECIR 2025: The 47th European Conference on Information Retrieval), April 10, 2025, Lucca, Italy * The title of the paper should follow the regular capitalization of English (e.g., Example of a Title of a Paper Correctly Capitalized) * Please, choose the one-column template * According to CEUR-WS policy, the papers will be published under a CC BY 4.0 license: https://creativecommons.org/licenses/by/4.0/deed.en If the paper is accepted, authors will be asked to sign (at pen) an author agreement with CEUR: * In case you do not employ Third-Party Material (TPM) in your draft, sign the document at: https://ceur-ws.org/ceur-author-agreement-ccby-ntp.pdf?ver=2024-06-04 * If you do use TPM, the agreement can be found at: https://ceur-ws.org/ceur-author-agreement-ccby-tp.pdf?ver=2024-06-04 For further information: https://ceur-ws.org/HOWTOSUBMIT.html **************************************************************************************************** IMPORTANT DATES (AoE) * Abstract submission: January 05, 2025 * Paper submission: January 12, 2025 * Decision notification: February 16, 2025 * Workshop day: April 10, 2025 **************************************************************************************************** ORGANIZERS * Udo Kruschwitz (https://www.linkedin.com/in/udo-kruschwitz-57106b5/), University of Regensburg, Regensburg, Germany * Marinella Petrocchi (https://www.iit.cnr.it/en/marinella.petrocchi/), IIT-CNR, Pisa, Italy * Marco Viviani (https://ikr3.disco.unimib.it/people/marco-viviani/), University of Milano-Bicocca, Milan, Italy

2 1

CfP: Workshop on Context and Meaning—Navigating Disagreements in NLP Annotations
by Michael Roth 17 Nov '24

17 Nov '24

******************************************************************************** CoMeDiNLP: Context and Meaning--Navigating Disagreements in NLP Annotations https://unimplicit.github.io/ Workshop held in conjunction with COLING 2025 January 19/20, 2025 ******************************************************************************** Disagreements among annotators pose a significant challenge in Natural Language Processing, impacting the quality and reliability of datasets and consequently the performance of NLP models. This workshop aims to explore the complexities of annotation disagreements, their causes, and strategies towards their effective resolution, with a focus on meaning in context. The quality and reliability of annotated data is crucial for the development of robust NLP models. However, managing disagreements among annotators poses significant challenges to researchers and practitioners. Such disagreements can stem from various factors, including subjective interpretations, cultural biases and ambiguous guidelines. Early research has highlighted the impact of annotator disagreements on data quality and model performance (e.g. Artstein and Poesio, 2008; Pustejovsky and Stubbs, 2012; Plank et al., 2014). More recent work on perspectivism in NLP, such as that by Basile et al. (2021), highlights the importance of embracing multiple perspectives in annotation tasks to better capture the diversity of human language. This approach argues for the inclusion of various viewpoints to improve the robustness and fairness of NLP models. On the modeling side, various methods for dealing with annotation disagreements have been proposed. For example, Hovy et al. (2013) and Passonneau and Carpenter (2014) identify and weigh annotator reliability to better aggregate contributions, whereas recent approaches following the perspectivism approach leverage inherent disagreements in subjective tasks to train models handling diverse opinions (Davani et al., 2022; Deng et al., 2023). == Call for Submissions == We invite both long (8 pages) and short (4 page) papers. The limits refer to the content and any number of additional pages for references are allowed. The papers should follow the COLING 2025 formatting instructions. Each submission must be anonymized, written in English, and contain a title and abstract. We especially welcome papers that address the following themes, for a single type of disagreement or annotation disagreements in general: - New benchmarks for detecting or categorizing disagreements - Models and modeling strategies for variations in annotation - Evaluation schemes and metrics for phenomena without a single ground truth - Phenomena that are not yet within reach with current NLP technology. To encourage discussion and community building and to bootstrap potential collaborations, we elicit, in addition to shared task papers and regular "archival" track papers, also non-archival submissions. These can take 2 forms: - Works in progress, that are not yet mature enough for a full submission, can be submitted in the form of a title and abstract. Abstracts may be up to two pages in length. - Already published work, or work currently under submission elsewhere, can be submitted in the form of an abstract and a copy of the submission/publication. These works will be reviewed for topical fit and accepted submissions will be presented as posters. Depending on the final workshop program, selected works may be presented in panels. We plan for these to be an opportunity for researchers to present and discuss their work with the relevant community. Please submit your papers here: https://softconf.com/coling2025/CM-ND-NLP25/ == Important Dates == November 18, 2024: Due date for workshop and shared task papers [1] December 1-3, 2024: Author response period December 5, 2024: Notification of acceptance December 13, 2024: Camera-ready submission deadline January 19/20, 2025: Workshop date All deadlines are 11:59pm UTC-12 ("anywhere on Earth"). [1] If you plan to submit a paper but require a deadline extension, please send us an email to michael.roth(a)utn.de and dominik.schlechtweg(a)ims.uni-stuttgart.de == Organizers == Michael Roth, University of Technology Nuremberg Dominik Schlechtweg, University of Stuttgart == Program Committee == David Alfter, University of Gothenburg Valerio Basile, University of Turin Felipe Bravo, University of Chile Jing Chen, Hong Kong Polytechnic University Naihao Deng, University of Michigan Aida Mostafazadeh Davani, Google Research Diego Frassinelli, University of Konstanz / LMU Munich Haim Dubossarsky, Queen Mary University Simon Hengchen, iguanodon.ai & Université de Genève Sandra Kübler, Indiana University Andrei Kutuzov, University of Oslo Elisa Leonardelli, Fondazione Bruno Kessler Marie-Catherine de Marneffe, UCLouvain Maja Pavlovic, Queen Mary University Siyao Peng, LMU Munich Pauline Sander, University of Stuttgart Pia Sommerauer, Vrije Universiteit Amsterdam Nina Tahmasebi, University of Gothenburg Alexandra Uma Frank D. Zamora-Reina, University of Chile Wei Zhao, University of Aberdeen --- Prof. Michael Roth [he/him] Natural Language Understanding Lab University of Technology Nuremberg Technische Universität Nürnberg

1 1

Deadline extended: International Workshop on Nakba Narratives as Language Resources (Nakba-NLP 2025)
by Amal Haddad Haddad [She/her] 16 Nov '24

16 Nov '24

Deadline extended: 28 November (extended) Keynote Speaker: Ilan Pappe Panel Discussion: Digital Archives and Cultural Heritage in the LLMs Era Nakba-NLP 2025 International Workshop on Nakba Narratives as Language Resources Part of the COLING 2025 Conference (virtual) January 19, 2025 https://sina.birzeit.edu/nakba-nlp [1] إغناء الرواية والنكبة الفلسطينية بتقنيات معالجة اللغة والذكاء الاصطناعي (مدونات، صور، فيديو، اخبار، خطاب، تحيز، شبكات تواصل اجتماعي، نماذج لغوية، تصنيف، احداث، ....) We invite submissions for Nakba-NLP 2025, a workshop dedicated to exploring and preserving Nakba narratives through the application of artificial intelligence, natural language processing, and corpus linguistics. We seek contributions on the following topics: ◈ Digitization of oral and written narratives ◈ Creation and labeling of language corpora and datasets ◈ Digital archives, metadata, and semantic/content mark-up ◈ Annotation tools and annotation guidelines ◈ Document classification, topic modeling, and information retrieval ◈ Named entity recognition for identifying people, places, organizations, and events ◈ Entity linking and relationship extraction ◈ Event detection and event argument extraction ◈ Knowledge Graphs and Linked Data ◈ Vocabularies, dictionaries, and ontologies ◈ Data visualization ◈ Knowledge representation ◈ Machine translation, summarisation, and paraphrasing ◈ Natural Language Generation ◈ Large Language Models ◈ Sentiment analysis and emotional content extraction ◈ Discourse analysis (e.g., bias, offensive language, and misinformation) related to Nakba narratives ◈ Voice & dialogue-based systems; ASR ◈ Palestinian dialects (written and spoken) Suggested Datasets: a list of datasets can be found here https://t.ly/00Ul6 [2] Important Dates: ===================== All deadlines are 11:59 pm UTC-12 (anywhere on Earth). - Submission Deadline: 28 November 2024 - Notifications of Acceptance: 5 December 2024 - Camera Ready Deadline: 13 December 2024 (cannot be changed) Organizing Committee: ===================== - Mustafa Jarrar, Birzeit University, Palestine - Nizar Habash, New York University, UAE - Mo El-Haj, Lancaster University, UK - Zeina Jallad, Harvard Law School, USA - Camille Mansour, Paris-Sorbonne University, France - Diana Allan, McGill University, Canada - Paul Rayson, Lancaster University, UK Publicity Chairs ===================== - Amal Haddad, University of Granada, Spain - Sanad Malaysha, Birzeit University, Palestine Contact: Nakba-NLP25_coling2025(a)softconf.com -- Links: ------ [1] https://urldefense.com/v3/__https://sina.birzeit.edu/nakba-nlp/__;!!D9dNQww… [2] https://urldefense.com/v3/__https://t.ly/00Ul6__;!!D9dNQwwGXtA!Qs4o1RM4JHxc…

1 0

CfP: AAAI 2025 Workshop on Advancing LLM-Based Multi-Agent Collaboration (WMAC-2025)
by Yi Zhang 16 Nov '24

16 Nov '24

Apologies for cross-posting. We invite submissions to the AAAI 2025 Workshop on Advancing LLM-Based Multi-Agent Collaboration (WMAC 2025), to be held in Philadelphia during the AAAI 2025 conference (February 25 - March 4, 2025). This full-day workshop seeks to ignite discussion on cutting-edge research areas and challenges associated with multi-agent collaboration driven by large language models (LLMs). As LLMs continue to showcase the ability to coordinate multiple AI agents for complex problem-solving, the workshop will delve into pivotal open research questions that advance the understanding and potential of LLM-based multi-agent collaboration. We invite submissions on a range of topics, including but not limited to: * Architectures for multi-agent collaboration, hierarchy, and decision-making * Cross-agent knowledge sharing * Inter-agent communication protocols * Distributed and decentralized agent * Multi-agent group behavior learning * Strategic planning in multi-agent problem-solving * Guardrails and ethical considerations in multi-agent systems [Important Dates] Submission deadline: November 24, 2024 Notification of acceptance: December 9, 2024 Workshop date: March 4th, 2025 [Submission Guidelines] We welcome both short papers (up to 4 pages) and long papers (up to 8 pages) following the AAAI format. Submissions may include recently published work, under-review papers, work in progress, and position papers. All submissions will undergo peer review through a single-blind process. While workshop publication is non-archival, accepted papers will be featured on our website with author permission. [Submission Site] Please submit your work via: https://openreview.net/group?id=AAAI.org/2025/Workshop/WMAC [Workshop Format] The one-day workshop features invited talks, oral presentations, lightning talks, poster sessions, and a panel discussion. Additional details on speakers and the schedule will be available on our website. [More Information] Workshop website: https://multiagents.org/workshop Contact for inquires: pc(a)multiagents.org We look forward to your submissions and to seeing you at the workshop!

1 0

November 2024 Newsletter - LDC
by Penn LDC 15 Nov '24

15 Nov '24

In this newsletter: Join LDC for membership year 2025 Spring 2025 data scholarship application deadline New publications: LORELEI Yoruba Representative Language Pack<https://catalog.ldc.upenn.edu/LDC2024T10> Samrómur Synthetic<https://catalog.ldc.upenn.edu/LDC2024S12> ________________________________ Join LDC for membership year 2025 It's time to renew your LDC membership for 2025. Current (2024) members who renew their membership before March 3, 2025, will receive a 10% discount. New or returning organizations will receive a 5% discount if they join the Consortium by March 3. In addition to receiving new publications, current LDC members enjoy the benefit of licensing older data from our Catalog of 950+ holdings at reduced fees. Current-year for-profit members may use most data for commercial applications. Plans for next year's publications are in progress. Among the expected releases are: * Iraqi Arabic - English Lexical Database: a set of six interrelated tables (roots, lemmas, wordforms, multi-word expressions, English definitions, example phrases) presenting each Iraqi Arabic word in Arabic script and IPA format, a result of LDC's collaboration with Georgetown University Press to enhance and update three dialectal Arabic dictionaries * AIDA topic source data and annotations: multimodal source data and annotations in multiple languages (Russian, English, Spanish) for information and entity extraction * 2015 NIST Language Recognition Evaluation Test Set: 164,000+ segments of conversational telephone speech and broadcast narrow band speech in six linguistic varieties (Arabic, Spanish, English, Chinese, Slavic, French) representing 20 languages, used in NIST's 2015 language recognition evaluation * BOLT CALLFRIEND CALLHOME CTS audio, transcripts and translations: previously unpublished Chinese and Egyptian Arabic telephone conversations from the CALLFRIEND and CALLHOME collections, with transcripts and translations developed by LDC for the DARPA BOLT program * Chinese Sentence Pattern Structure Treebank: 5,000+ sentences from ancient and modern Chinese texts with syntactic annotation based on sentence constituent analysis, developed by Beijing Normal University and Peking University * IARPA MATERIAL language packs: conversational telephone speech, transcripts, English translations, annotations, and queries in multiple languages (e.g., Georgian, Kazakh, Lithuanian) * LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources, and related tools in various languages (e.g., Hungarian, Hindi, Amharic, Somali) For full descriptions of all LDC data sets, browse our Catalog<https://catalog.ldc.upenn.edu/>. Visit Join LDC<https://www.ldc.upenn.edu/members/join-ldc> for details on membership, user accounts and payment. Spring 2025 data scholarship application deadline Applications are now being accepted through January 15, 2025, for the Spring 2025 LDC data scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarships<https://www.ldc.upenn.edu/language-resources/data/data-scholarships> page for more information about program rules and submission requirements. ________________________________ New publications: LORELEI Yoruba Representative Language Pack<https://catalog.ldc.upenn.edu/LDC2024T10> was developed by LDC and is comprised of approximately 7.2 million words of Yoruba monolingual text, 127,000 Yoruba words translated from English data, and 810,000 words of Yoruba-English parallel text. Approximately 77,000 words were annotated for named entities, over 25,000 words were annotated for full entity (including nominals and pronouns) and simple semantic annotation, and around 10,000 words were annotated for noun phrase chunking. Data was collected from discussion forum, news, reference, social network, and weblogs. The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage. The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10)<https://catalog.ldc.upenn.edu/LDC2020T10>. 2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee. * Samrómur Synthetic<https://catalog.ldc.upenn.edu/LDC2024S12> was developed by the Language and Voice Lab, Reykjavik University<https://lvl.ru.is/> and contains 72 hours of Icelandic synthetic speech, transcripts and metadata. Source sentences were extracted from the Samrómur platform<https://samromur.is>, comprised of texts and transcripts covering various genres. Text was processed through a text-to-speech system developed by Reykjavik University's Language and Voice Lab to generate speech files. Synthesized speech was created with 44 voices (22 male, 22 female) at four different speed rates for a total of 220 speakers and 62,700 utterances (with 285 sentences/speaker). 2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee. To unsubscribe from this newsletter, log in to your LDC account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to "Receive Newsletter" under Account Options or contact LDC for assistance. Membership Coordinator Linguistic Data Consortium<ldc.upenn.edu> University of Pennsylvania T: +1-215-573-1275 E: ldc(a)ldc.upenn.edu<mailto:ldc@ldc.upenn.edu> M: 3600 Market St. Suite 810 Philadelphia, PA 19104

1 0

Universal Dependencies, release 2.15
by Dan Zeman 15 Nov '24

15 Nov '24

We are very happy to announce the twenty-first release of annotated treebanks in Universal Dependencies, v2.15, available at https://universaldependencies.org/. Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective (de Marneffe et al., 2021; Nivre et al., 2020). The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008). The general philosophy is to provide a universal inventory of categories and guidelines to facilitate consistent annotation of similar constructions across languages, while allowing language-specific extensions when necessary. The *296* treebanks in v2.15 are annotated according to version 2 of the UD guidelines and represent the following *168* languages: Abaza, Abkhaz, Afrikaans, Akkadian, Akuntsu, Albanian, Amharic, Ancient Greek, Ancient Hebrew, Apurina, Arabic, Armenian, Assyrian, Azerbaijani, Bambara, Basque, Bavarian, Beja, Belarusian, Bengali, Bhojpuri, Bororo, Breton, Bulgarian, Buryat, Cantonese, Cappadocian, Catalan, Cebuano, Chinese, Chukchi, Classical Armenian, Classical Chinese, Coptic, Croatian, Czech, Danish, Dutch, Egyptian, English, Erzya, Estonian, Faroese, Finnish, French, Frisian Dutch, Galician, Georgian, German, Gheg, Gothic, Greek, Guajajara, Guarani, Gujarati, Gwichin, Haitian Creole, Hausa, Hebrew, Highland Puebla Nahuatl, Hindi, Hittite, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kaapor, Kangri, Karelian, Karo, Kazakh, Khunsari, Kiche, Komi Permyak, Komi Zyrian, Korean, Kurmanji, Kyrgyz, Latgalian, Latin, Latvian, Ligurian, Lithuanian, Livvi, Low Saxon, Luxembourgish, Macedonian, Madi, Maghrebi Arabic French, Makurap, Malayalam, Maltese, Manx, Marathi, Mbya Guarani, Middle French, Moksha, Munduruku, Naija, Nayini, Neapolitan, Nheengatu, North Sami, Northwest Gbaya, Norwegian, Old Church Slavonic, Old East Slavic, Old French, Old Irish, Old Turkish, Ottoman Turkish, Pashto, Paumari, Persian, Pesh, Phrygian, Polish, Pomak, Portuguese, Romanian, Russian, Sanskrit, Scottish Gaelic, Serbian, Sinhala, Skolt Sami, Slovak, Slovenian, Soi, South Levantine Arabic, Spanish, Spanish Sign Language, Swedish, Swedish Sign Language, Swiss German, Tagalog, Tamil, Tatar, Teko, Telugu, Telugu English, Thai, Tswana, Tupinamba, Turkish, Turkish German, Ukrainian, Umbrian, Upper Sorbian, Urdu, Uyghur, Uzbek, Veps, Vietnamese, Warlpiri, Welsh, Western Armenian, Western Sierra Puebla Nahuatl, Wolof, Xavante, Xibe, Yakut, Yoruba, Yupik and Zaar. The 168 languages belong to *33* families: Afro-Asiatic, Arawakan, Arawan, Austro-Asiatic, Austronesian, Basque, Bororoan, Chibchan, Chukotko-Kamchatkan, Code switching, Creole, Dravidian, Eskimo-Aleut, Indo-European, Japanese, Kartvelian, Korean, Macro-Je, Mande, Mayan, Mongolic, Na-Dene, Niger-Congo, Northwest Caucasian, Pama-Nyungan, Sign Language, Sino-Tibetan, Tai-Kadai, Tungusic, Tupian, Turkic, Uralic and Uto-Aztecan. Depending on the language, the treebanks range in size from less than 1,000 tokens to over 3 million tokens. We expect the next release to be available in May 2025. The size of the following 24 treebanks changed significantly since the last release: Abkhaz AbNC : 2444 → 6363 Albanian STAF : 0 → 3563 Beja Autogramm : 0 → 11951 Beja NSC : 5888 → 0 Cappadocian AMGiC : 0 → 451 Egyptian UJaen : 5515 → 14650 Georgian GLC : 2335 → 60173 Gwichin TueCL : 0 → 1008 Hebrew IAHLTknesset : 0 → 67007 Italian Old : 82644 → 122038 Korean KSL : 0 → 66989 Kyrgyz KTMU : 7451 → 23654 Nheengatu CompLin : 15036 → 19278 Northwest Gbaya Autogramm: 0 → 2417 Old East Slavic RNC : 95551 → 168064 Old East Slavic Ruthenian: 96803 → 111503 Pashto Sikaram : 0 → 995 Pesh ChibErgIS : 0 → 2508 Phrygian KUL : 0 → 1687 Portuguese DANTEStocks : 0 → 80997 Slovenian SST : 76341 → 98393 Spanish Sign Language LSE: 0 → 1393 Ukrainian ParlaMint : 0 → 51997 Uzbek UT : 0 → 5850 In total, the new release contains *1,939,085* sentences, 32,078,118 surface tokens and *32,741,781* syntactic words. Daniel Zeman, Joakim Nivre, Mitchell Abrams, Elia Ackermann, Noëmi Aepli, Hamid Aghaei, Željko Agić, Amir Ahmadi, Lars Ahrenberg, Chika Kennedy Ajede, Arofat Akhundjanova, Furkan Akkurt, Gabrielė Aleksandravičiūtė, Ika Alfina, Avner Algom, Khalid Alnajjar, Chiara Alzetta, Erik Andersen, Matthew Andrews, Lene Antonsen, Tatsuya Aoyama, Katya Aplonova, Angelina Aquino, Carolina Aragon, Glyd Aranes, Maria Jesus Aranzabe, Bilge Nas Arıcan, Þórunn Arnardóttir, Gashaw Arutie, Jessica Naraiswari Arwidarasti, Masayuki Asahara, Katla Ásgeirsdóttir, Deniz Baran Aslan, Cengiz Asmazoğlu, Luma Ateyah, Furkan Atmaca, Mohammed Attia, Aitziber Atutxa, Liesbeth Augustinus, Mariana Avelãs, Elena Badmaeva, Keerthana Balasubramani, Miguel Ballesteros, Esha Banerjee, Sebastian Bank, Bryan Khelven da Silva Barbosa, Verginica Barbu Mititelu, Starkaður Barkarson, Rodolfo Basile, Victoria Basmov, Colin Batchelor, John Bauer, Seyyit Talha Bedir, Shabnam Behzad, Juan Belieni, Kepa Bengoetxea, İbrahim Benli, Yifat Ben Moshe, Ansu Berg, Gözde Berk, Riyaz Ahmad Bhat, Erica Biagetti, Eckhard Bick, Agnė Bielinskienė, Esma Fatıma Bilgin Taşdemir, Kristín Bjarnadóttir, Verena Blaschke, Rogier Blokland, Nina Böbel, Victoria Bobicev, Loïc Boizou, Johnatan Bonilla, Emanuel Borges Völker, Carl Börstell, Cristina Bosco, Gosse Bouma, Sam Bowman, Adriane Boyd, Anouck Braggaar, António Branco, Kristina Brokaitė, Aljoscha Burchardt, Carmen Cabeza, Natalia Cáceres Arandia, Marisa Campos, Marie Candito, Bernard Caron, Gauthier Caron, Catarina Carvalheiro, Rita Carvalho, Lauren Cassidy, Maria Clara Castro, Sérgio Castro, Tatiana Cavalcanti, Gülşen Cebiroğlu Eryiğit, Flavio Massimiliano Cecchini, Giuseppe G. A. Celano, Anila Çepani, Slavomír Čéplö, Neslihan Cesur, Savas Cetin, Özlem Çetinoğlu, Fabricio Chalub, Liyanage Chamila, Claudine Chamoreau, Shweta Chauhan, Yifei Chen, Ethan Chi, Taishi Chika, Yongseok Cho, Jinho Choi, Bermet Chontaeva, Jayeol Chun, Juyeon Chung, Alessandra T. Cignarella, Silvie Cinková, Aurélie Collomb, Çağrı Çöltekin, Miriam Connor, Claudia Corbetta, Daniela Corbetta, Francisco Costa, Marine Courtin, Benoît Crabbé, Mihaela Cristescu, Vladimir Cvetkoski, Netanel Dahan, Ingerid Løyning Dale, Philemon Daniel, Elizabeth Davidson, Leonel Figueiredo de Alencar, Mathieu Dehouck, Martina de Laurentiis, Marie-Catherine de Marneffe, Valeria de Paiva, Mehmet Oguz Derin, Elvis de Souza, Arantza Diaz de Ilarraza, Roberto Antonio Díaz Hernández, Carly Dickerson, Ariani Di Felippo, Arawinda Dinakaramani, Elisa Di Nuovo, Bamba Dione, Peter Dirix, Hoa Do, Kaja Dobrovoljc, Caroline Döhmer, Adrian Doyle, Timothy Dozat, Kira Droganova, Magali Sanches Duran, Puneet Dwivedi, Christian Ebert, Hanne Eckhoff, Masaki Eguchi, Sandra Eiche, Roald Eiselen, Marhaba Eli, Ali Elkahky, Binyam Ephrem, Olga Erina, Tomaž Erjavec, Soudabeh Eslami, Farah Essaidi, Aline Etienne, Wograine Evelyn, Sidney Facundes, Richárd Farkas, Ján Faryad, Federica Favero, Jannatul Ferdaousi, Marília Fernanda, Hector Fernandez Alcalde, Amal Fethi, Jennifer Foster, Theodorus Fransen, Cláudia Freitas, Kazunori Fujita, Katarína Gajdošová, Daniel Galbraith, Edith Galy, Federica Gamba, Marcos Garcia, José María García-Miguel, Moa Gärdenfors, Tanja Gaustad, Efe Eren Genç, Fabrício Ferraz Gerardi, Kim Gerdes, Luke Gessler, Filip Ginter, Gustavo Godoy, Iakes Goenaga, Koldo Gojenola, Memduh Gökırmak, Yoav Goldberg, Gili Goldin, Xavier Gómez Guinovart, Berta González Saavedra, Bernadeta Griciūtė, Matias Grioni, Loïc Grobol, Normunds Grūzītis, Bruno Guillaume, Kirian Guiller, Céline Guillot-Barbance, Tunga Güngör, Vladimir Gurevich, Nizar Habash, Hinrik Hafsteinsson, Jan Hajič, Jan Hajič jr., Mika Hämäläinen, Linh Hà Mỹ, Na-Rae Han, Muhammad Yudistira Hanifmuti, Takahiro Harada, Sam Hardwick, Kim Harris, Naïma Hassert, Dag Haug, Johannes Heinecke, Oliver Hellwig, Felix Hennig, Barbora Hladká, Jaroslava Hlaváčová, Florinel Hociung, Diana Hoefels, Petter Hohle, Nick Howell, Yidi Huang, Marivel Huerta Mendez, Jena Hwang, Takumi Ikeda, Inessa Iliadou, Anton Karl Ingason, Radu Ion, Elena Irimia, Ọlájídé Ishola, Artan Islamaj, Kaoru Ito, Federica Iurescia, Sandra Jagodzińska, Siratun Jannat, Tomáš Jelínek, Apoorva Jha, Katharine Jiang, Mayank Jobanputra, Anders Johannsen, Hildur Jónsdóttir, Fredrik Jørgensen, Markus Juutinen, Hüner Kaşıkara, Nadezhda Kabaeva, Sylvain Kahane, Hiroshi Kanayama, Jenna Kanerva, Neslihan Kara, Ritván Karahóǧa, Andre Kåsen, Tolga Kayadelen, Sarveswaran Kengatharaiyer, Václava Kettnerová, Lilit Kharatyan, Jesse Kirchner, Elena Klementieva, Elena Klyachko, Petr Kocharov, Arne Köhn, Abdullatif Köksal, Kamil Kopacewicz, Timo Korkiakangas, Mehmet Köse, Alexey Koshevoy, Nelda Kote, Natalia Kotsyba, Barbara Kovačić, Jolanta Kovalevskaitė, Emmanuelle Kowner, Simon Krek, Parameswari Krishnamurthy, Sandra Kübler, Adrian Kuqi, Oğuzhan Kuyrukçu, Aslı Kuzgun, Sookyoung Kwak, Kris Kyle, Käbi Laan, Veronika Laippala, Lorenzo Lambertino, Israel Landau, Tatiana Lando, Septina Dian Larasati, Alexei Lavrentiev, John Lee, Phương Lê Hồng, Alessandro Lenci, Saran Lertpradit, Herman Leung, Maria Levina, Lauren Levine, Cheuk Ying Li, Josie Li, Keying Li, Yixuan Li, Yuan Li, KyungTae Lim, Bruna Lima Padovani, Yi-Ju Jessica Lin, Krister Lindén, Yang Janet Liu, Nikola Ljubešić, Irina Lobzhanidze, Olga Loginova, Lucelene Lopes, Edita Luftiu, Arsenii Lukashevskyi, Stefano Lusito, Anne-Marie Lutgen, Andry Luthfi, Mikko Luukko, Olga Lyashevskaya, Teresa Lynn, Vivien Macketanz, Menel Mahamdi, Jean Maillard, Ilya Makarchuk, Aibek Makazhanov, Francesco Mambrini, Michael Mandl, Christopher Manning, Ruli Manurung, Büşra Marşan, Cătălina Mărănduc, David Mareček, Katrin Marheinecke, Stella Markantonatou, Héctor Martínez Alonso, Lorena Martín Rodríguez, André Martins, Cláudia Martins, Jan Mašek, Hiroshi Matsuda, Yuji Matsumoto, Alessandro Mazzei, Ryan McDonald, Sarah McGuinness, Maitrey Mehta, Pierre André Ménard, Gustavo Mendonça, Hilla Merhav, Tatiana Merzhevich, Paul Meurer, Niko Miekka, Emilia Milano, Aaron Miller, Yael Minerbi, Karina Mischenkova, Anna Missilä, Cătălin Mititelu, Maria Mitrofan, Yusuke Miyao, AmirHossein Mojiri Foroushani, Judit Molnár, Amirsaeid Moloodi, Simonetta Montemagni, Amir More, Laura Moreno Romero, Giovanni Moretti, Shinsuke Mori, Tomohiko Morioka, Shigeki Moro, Bjartur Mortensen, Bohdan Moskalevskyi, Kadri Muischnek, Robert Munro, Yugo Murawaki, Kaili Müürisep, Pinkey Nainwani, Mariam Nakhlé, Juan Ignacio Navarro Horñiacek, Anna Nedoluzhko, Gunta Nešpore-Bērzkalne, Manuela Nevaci, Lương Nguyễn Thị, Huyền Nguyễn Thị Minh, Yoshihiro Nikaido, Vitaly Nikolaev, Rattima Nitisaroj, Victor Norrman, Alireza Nourian, Maria das Graças Volpe Nunes, Hanna Nurmi, Stina Ojala, Atul Kr. Ojha, Hulda Óladóttir, Adédayọ̀ Olúòkun, Mai Omura, Emeka Onwuegbuzia, Noam Ordan, Petya Osenova, Robert Östling, Annika Ott, Lilja Øvrelid, Şaziye Betül Özateş, Merve Özçelik, Arzucan Özgür, Balkız Öztürk Başaran, Teresa Paccosi, Alessio Palmero Aprosio, Anastasia Panova, Thiago Alexandre Salgueiro Pardo, Hyunji Hayley Park, Niko Partanen, Elena Pascual, Marco Passarotti, Agnieszka Patejuk, Guilherme Paulino-Passos, Giulia Pedonese, Oggi Peeters, Angelika Peljak-Łapińska, Siyao Peng, Siyao Logan Peng, Rita Pereira, Sílvia Pereira, Cenel-Augusto Perez, Natalia Perkova, Guy Perrier, Slav Petrov, Daria Petrova, Andrea Peverelli, Jason Phelan, Claudel Pierre-Louis, Jussi Piitulainen, Yuval Pinter, Clara Pinto, Rodrigo Pintucci, Tommi A Pirinen, Emily Pitler, Magdalena Plamada, Barbara Plank, Alistair Plum, Thierry Poibeau, Larisa Ponomareva, Martin Popel, Lauma Pretkalniņa, Rigardt Pretorius, Sophie Prévost, Prokopis Prokopidis, Adam Przepiórkowski, Robert Pugh, Tiina Puolakainen, Christoph Purschke, Sampo Pyysalo, Peng Qi, Andreia Querido, Andriela Rääbis, Ella Rabinovich, Alexandre Rademaker, Mizanur Rahoman, Taraka Rama, Loganathan Ramasamy, Carlos Ramisch, Joana Ramos, Fam Rashel, Mohammad Sadegh Rasooli, Vinit Ravishankar, Livy Real, Petru Rebeja, Siva Reddy, Mathilde Regnault, Georg Rehm, Arij Riabi, Ivan Riabov, Michael Rießler, Erika Rimkutė, Larissa Rinaldi, Laura Rituma, Putri Rizqiyah, Luisa Rocha, Eiríkur Rögnvaldsson, Ivan Roksandic, Norton Trevisan Roman, Mykhailo Romanenko, Rudolf Rosa, Valentin Roșca, Paulette Roulon, Davide Rovati, Ben Rozonoyer, Olga Rudina, Jack Rueter, Paolo Ruffolo, Kristján Rúnarsson, Rozana Rushiti, Shoval Sadde, Pegah Safari, Aleksi Sahala, Shadi Saleh, Alessio Salomoni, Tanja Samardžić, Konstantinos Sampanis, Stephanie Samson, Xulia Sánchez-Rodríguez, Manuela Sanguinetti, Ezgi Sanıyar, Dage Särg, Marta Sartor, Albina Sarymsakova, Mitsuya Sasaki, Baiba Saulīte, Agata Savary, Yanin Sawanakunanon, Shefali Saxena, Kevin Scannell, Salvatore Scarlata, Emmanuel Schang, Nathan Schneider, Sebastian Schuster, Lane Schwartz, Djamé Seddah, Wolfgang Seeker, Sven Sellmer, Mojgan Seraji, Syeda Shahzadi, Mo Shen, Atsuko Shimada, Gyu-Ho Shin, Hiroyuki Shirasu, Yana Shishkina, Muh Shohibussirri, Maria Shvedova, Janine Siewert, Einar Freyr Sigurðsson, João Silva, Aline Silveira, Natalia Silveira, Sara Silveira, Maria Simi, Radu Simionescu, Katalin Simkó, Mária Šimková, Haukur Barri Símonarson, Kiril Simov, Dmitri Sitchinava, Ted Sither, Aaron Smith, Isabela Soares-Bastos, Per Erik Solberg, Barbara Sonnenhauser, Shafi Sourov, Rachele Sprugnoli, Vivian Stamou, Steinþór Steingrímsson, Antonio Stella, Abishek Stephen, Milan Straka, Omer Strass, Emmett Strickland, Jana Strnadová, Alane Suhr, Yogi Lesmana Sulestio, Umut Sulubacak, Hakyung Sung, Shingo Suzuki, Daniel Swanson, Zsolt Szántó, Chihiro Taguchi, Dima Taji, Luigi Talamo, Fabio Tamburini, Mary Ann C. Tan, Takaaki Tanaka, Dipta Tanaya, Mirko Tavoni, Samson Tella, Isabelle Tellier, Marinella Testori, Guillaume Thomas, Tarık Emre Tıraş, Sara Tonelli, Liisi Torga, Marsida Toska, Trond Trosterud, Anna Trukhina, Reut Tsarfaty, Utku Türk, Francis Tyers, Sveinbjörn Þórðarson, Vilhjálmur Þorsteinsson, Sumire Uematsu, Roman Untilov, Zdeňka Urešová, Larraitz Uria, Hans Uszkoreit, Andrius Utka, Elena Vagnoni, Sowmya Vajjala, Socrates Vak, Rob van der Goot, Martine Vanhove, Daniel van Niekerk, Gertjan van Noord, Viktor Varga, Uliana Vedenina, Giulia Venturi, Eric Villemonte de la Clergerie, Veronika Vincze, Anishka Vissamsetty, Natalia Vlasova, Eleni Vligouridou, Aya Wakasa, Joel C. Wallenberg, Lars Wallin, Abigail Walsh, John Wang, Jonathan North Washington, Leonie Weissweiler, Maximilan Wendt, Paul Widmer, Shira Wigderson, Sri Hartati Wijono, Vanessa Berwanger Wille, Seyi Williams, Miriam Winkler, Shuly Wintner, Mats Wirén, Christian Wittern, Tsegay Woldemariam, Tak-sum Wong, Alina Wróblewska, Qishen Wu, Mary Yako, Kayo Yamashita, Naoki Yamazaki, Chunxiao Yan, Koichi Yasuoka, Marat M. Yavrumyan, Arife Betül Yenice, Enes Yılandiloğlu, Olcay Taner Yıldız, Zhuoran Yu, Arlisa Yuliawati, Zdeněk Žabokrtský, Shorouq Zahra, Amir Zeldes, He Zhou, Hanzhi Zhu, Yilun Zhu, Anna Zhuravleva, Rayan Ziane, Artūrs Znotiņš References Marie-Catherine de Marneffe, Christopher Manning, Joakim Nivre, Daniel Zeman. 2021. Universal Dependencies. In Computational Linguistics 47:2, pp. 255–308. Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajič, Christopher D. Manning, Sampo Pyysalo, Sebastian Schuster, Francis Tyers, Daniel Zeman. 2020. Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection. In Proceedings of LREC. -------------------------------------------------------------------------------- Marie-Catherine de Marneffe, Bill MacCartney, and Christopher D. Manning. 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of LREC. Marie-Catherine de Marneffe and Christopher D. Manning. 2008. The Stanford typed dependencies representation. In COLING Workshop on Cross-framework and Cross-domain Parser Evaluation. Marie-Catherine de Marneffe, Timothy Dozat, Natalia Silveira, Katri Haverinen, Filip Ginter, Joakim Nivre, and Christopher Manning. 2014. Universal Stanford Dependencies: A cross-linguistic typology. In Proceedings of LREC. Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajič, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, Daniel Zeman. 2016. Universal Dependencies v1: A Multilingual Treebank Collection. In Proceedings of LREC. Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Proceedings of LREC. Daniel Zeman. 2008. Reusable Tagset Conversion Using Tagset Drivers. In Proceedings of LREC.

1 0

1st CfP: 9th SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
by Anna Kazantseva 14 Nov '24

14 Nov '24

LaTeCH-CLfL 2025: The 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature to be held on May 3rd or 4th, 2025 in conjunction with NAACL 2025 <https://2025.naacl.org/> in Albuquerque, NM. https://sighum.wordpress.com/latech-clfl-2025/ First Call for Papers (with apologies for cross-posting) Organisers: Diego Alves, Yuri Bizzoni, Stefania Degaetano-Ortlieb, Anna Kazantseva, Janis Pagel, Stan Szpakowicz LaTeCH-CLfL 2025 is the ninth in a series of meetings for NLP researchers who work with data from the broadly understood arts, humanities and social sciences, and for specialists in those disciplines who apply NLP techniques in their work. The workshop continues a long tradition of annual meetings. The SIGHUM Workshops on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH) ran ten times in 2007-2016. The five Workshops on Computational Linguistics for Literature (CLfL) took place in 2012-2016. The first eight joint workshops (LaTeCH-CLfL) were held in 2017-2024. Topics and content In the Humanities, Social Sciences, Cultural Heritage and literary communities, there is increasing interest in, and demand for, NLP methods for semantic and structural annotation, intelligent linking, discovery, querying, cleaning and visualization of both primary and secondary data. This is even true of primarily non-textual collections, given that text is also the pervasive medium for metadata. Such applications pose new challenges for NLP research: noisy, non-standard textual or multi-modal input, historical languages, vague research concepts, multilingual parts within one document, and so no. Digital resources often have insufficient coverage; resource-intensive methods require (semi-) automatic processing tools and domain adaptation, or intense manual effort (e.g., annotation). Literary texts bring their own problems, because navigating this form of creative expression requires more than the typical information-seeking tools. Examples of advanced tasks include the study of literature of a certain period, author or sub-genre, recognition of certain literary devices, or quantitative analysis of poetry. Topics of interest include, but are not limited to, the following: • adaptation of NLP tools to Cultural Heritage, Social Sciences, Humanities and literature; • automatic error detection and cleaning of textual data; • complex annotation schemas, tools and interfaces; • creation (fully- or semi-automatic) of semantic resources; • creation and analysis of social networks of literary characters; • discourse and narrative analysis/modelling, notably in literature; • emotion analysis for the humanities and for literature; • generation of literary narrative, dialogue or poetry; • identification and analysis of literary genres; • interpretability of large language models output for DH-related tasks (explainable AI); • linking and retrieving information from different sources, media, and domains; • low-resource and historical language processing; • modelling dialogue literary style for generation; • modelling of information and knowledge in the Humanities, Social Sciences, and Cultural Heritage; • profiling and authorship attribution; • search for scientific and/or scholarly literature; • work with linguistic variation and non-standard or historical use of language. Information for authors We invite papers on original, unpublished work in the topic areas of the workshop. In addition to long papers, we will consider short papers and system descriptions (demos). We also welcome position papers. Please find submission requirements on the website https://sighum.wordpress.com/latech-clfl-2025/. Important dates (tentative) Workshop paper due: January 30, 2025 Notification of acceptance: March 1, 2025 Camera-ready papers due: March 10, 2025 Workshop date: May 3rd or 4th, 2025 More on the organizers Diego Alves, Language Science and Technology, Saarland University Yuri Bizzoni, Center for Humanities Computing / School for Communication and Culture, Århus University Stefania Degaetano-Ortlieb, Language Science and Technology, Saarland University Anna Kazantseva, National Research Council Canada Janis Pagel, Department of Digital Humanities, University of Cologne Stan Szpakowicz, School of Electrical Engineering and Computer Science, University of Ottawa Contact latech-clfl(a)googlegroups.com <mailto:latech-clfl@googlegroups.com>

1 0

Call For papers: 1st Workshop for non Arabic Languages using Arabic script
by Amal Haddad 14 Nov '24

14 Nov '24

AbjadNLP 2025 [1] The 1st Workshop on NLP for Languages Using Arabic Script https://wp.lancs.ac.uk/abjad/cfp/ CALL FOR PAPERS CALL FOR PAPERS: THE 1ST WORKSHOP ON NLP FOR LANGUAGES USING ARABIC SCRIPT (ABJADNLP 2025) Co-located with COLING 2025 Conference, Abu Dhabi, UAE (19-20 January 2025) Submission URL [2] AbjadNLP is dedicated to advancing innovation and gaining deeper insights into Natural Language Processing (NLP) for languages that use the Arabic script. Our primary focus is on Abjad and Ajami languages that utilise the Arabic script or its variations. Traditionally associated with Semitic languages, Abjad scripts represent consonants in every syllable. In contrast, Ajami scripts denote the alphabetic use of the Arabic script in various African contexts, representing non-Arabic languages. We are interested in research on languages that fall under the Abjad or Ajami categories that use the Arabic script or any variations of it. We invite contributions, discussions, and explorations that delve deep into the unique linguistic structures, resources, challenges, and untapped potential presented by Abjad and Ajami languages within the realm of NLP and language resources. Our goal is to create synergies among researchers by addressing the diverse phenomena and challenges inherent in these rich linguistic traditions. The workshop is proud to highlight our connections with the Masakhane NLP community and collaborations with institutions worldwide, such as COMSATS on Urdu, and the long-standing UCREL NLP Group at Lancaster University, whose work encompasses over 20 languages worldwide, including Abjad and Ajami languages. Note: We chose the name Abjad for simplicity, but our focus includes Abjad and other languages that have adopted the Arabic and Perso-Arabic scripts, as well as Ajami languages. We acknowledge that Sorani Kurdish, when written in Arabic script, follows an alphabet style rather than an Abjad style. TOPICS OF INTEREST: * Core Technologies: morphological analysis, disambiguation, tokenisation, POS tagging, named entity detection, chunking, parsing, semantic role labelling, sentiment analysis, language modelling, etc. * Applications: machine translation, speech recognition, speech synthesis, optical character recognition, assistive technologies, social media, etc. * Resources and Tools: dictionaries, annotated data, corpora, orthography descriptions, font technology, glyph rendering, text input methodologies, spell-checking, speech-to-text solutions, BLARK descriptions, open access corpora. * Cultural and Sociolinguistic Considerations: text processing, transliteration challenges, and solutions, cultural contexts in NLP applications. SUBMISSION GUIDELINES: We follow the COLING 2025 standards for submission format and guidelines. Submissions should conform to the following types: * Long papers: Up to eight (8) pages, presenting substantial, original, completed, and unpublished work. * Short papers: Up to four (4) pages, describing a small focused contribution, negative results, system demonstrations, etc. KEY DATES: * 1st Call for Papers Announcement: 16 July 2024 * 2nd Call for Papers Announcement: 16 August 2024 * Paper Submission Deadline: 2 December 2024 * Notification of Paper Acceptance: 6 December 2024 * Camera-ready Paper Deadline: 13 December 2024 * Workshop Date: 19 or 20 January 2025 ORGANISING COMMITTEE: General Chair: Mo El-Haj, Lancaster University Programme Chairs: * Hugh Paterson III, Collaborative Scholar * Saad Ezzini, Lancaster University * Ignatius Ezeani, Lancaster University Review Committee: * Mahum Hayat Khan, University of La Rioja * Muhammad Sharjeel, COMSATS University Islamabad Publication Chair: Sina Ahmadi, University of Zurich Publicity Chairs: * Cynthia Amol, Maseno University * Amal Haddad Haddad, University of Granada * Jaleh Delfani, University of Surrey Advisory Committee: * Ruslan Mitkov, Lancaster University * Paul Rayson, Lancaster University -- Amal Haddad Haddad (She/her) Facultad de Traducción e Interpretación Universidad de Granada |https://www.ugr.es/personal/amal-haddad-haddad Lexicon Research Group |http://lexicon.ugr.es/haddad Co-Convenor, BAAL SIG 'Humans, Machines, Language'|https://r.jyu.fi/humala Event Coordinator, BAAL SIG 'Language, Learning and Teaching' =============== Cláusula de Confidencialidad: "Este mensaje se dirige exclusivamente a su destinatario y puede contener información privilegiada o confidencial. Si no es Ud. el destinatario indicado, queda notificado de que la utilización, divulgación o copia sin autorización está prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, se ruega lo comunique inmediatamente por esta misma vía y proceda a su destrucción. This message is intended exclusively for its addressee and may contain information that is CONFIDENTIAL and protected by professional privilege. If you are not the intended recipient you are hereby notified that any dissemination, copy or disclosure of this communication is strictly prohibited by law. If this message has been received in error, please immediately notify us via e-mail and delete it" =============== Links: ------ [1] https://wp.lancs.ac.uk/abjad/ [2] https://softconf.com/coling2025/AbjadNLP25/

1 0

2026

2025

2024

2023

2022

Corpora November 2024