May 2023 - Corpora - ELRA lists

3-year PhD position in Computational Models of Semantic Memory and its Acquisition (Inria and University of Lille, France)
by Pascal Denis 13 May '25

13 May '25

Hello, Could you please distribute the following job offer? Thanks. Best, Pascal ------------------------------------------------------------------------------------- 3-year PhD position in Computational Models of Semantic Memory and its Acquisition (Inria and University of Lille, France) We invite applications for a 3-year PhD position at the University of Lille in the context of the recently funded research project "COMANCHE" (Computational Models of Lexical Meaning and Change). The position is funded by Inria, the French national research institute in Computer Science and Applied Mathematics. COMANCHE proposes to transfer and adapt neural word embeddings algorithms to model the acquisition and evolution of word meaning, by comparing them with linguistic theories on language acquisition and language evolution. At the intersection between Natural Language Processing, psycholinguistics and historical linguistics, this project intends to validate or revise some of these theories, while also developing computational models that are less data hungry and computationally intensive as they exploit new inductive biases inspired by these disciplines. The first strand of the project, on which the successful candidate will work, focuses on the development of computational models of semantic memory and its acquisition. Two main research directions will be pursued. On the one hand, we will compare the structural properties associated to different semantic spaces derived from word embedding algorithms to those found in human semantic memory as reflected in behavioral data (such as typicality norms) as well as brain imaging data. The latter data will then used as additional supervision to inject more hierarchical structure into the learned semantic spaces. One the other hand, we intend to experiment with training regimes for word embedding algorithms that are closer to those of humans when they acquire language, controlling the quantity as well as the linguistic complexity of the inputs fed to the learning algorithms through the use of longitudinal and child directed speech corpora (e.g., CHILDES, Colaje). In both cases, both English and French data will be considered. The successful candidate holds a Master's degree in computational linguistics or computer science or cognitive science and has prior experience in word embedding models. Furthermore, the candidate will provide strong programming skills, expertise in machine learning approaches and is eager to work across languages. The position is affiliated with the MAGNET team at Inria, Lille [1] as well as with the SCALAB group at University of Lille [2] in an effort to strenghten collaborations between these two groups, and ultimately foster cross-fertilizations between Natural Language Processing and Psycholinguistics. Applications will be considered until the position is filled. However, you are encouraged to apply early as we shall start processing the applications as and when they are received. Applications, written in English or French, should include a brief cover letter with research interests and vision, a CV (including your contact address, work experience, publications), and contact information for at least 2 referees. Applications (and questions) should be sent to Angèle Brunellière (angele.brunelliere(a)univ-lille.fr) and Pascal Denis (pascal.denis(a)inria.fr). The starting date of the position is 1 October 2022 or soon thereafter, for a total of 3 full years. Best regards, Angèle Brunellière and Pascal Denis [1] https://team.inria.fr/magnet/ [2] https://scalab.univ-lille.fr/ -- Pascal ---- Pour une évaluation indépendante, transparente et rigoureuse ! Je soutiens la Commission d'Évaluation de l'Inria. ---- +++++++++++++++++++++++++++++++++++++++++++++++ Pascal Denis Equipe MAGNET, INRIA Lille Nord Europe Bâtiment B, Avenue Heloïse Parc scientifique de la Haute Borne 59650 Villeneuve d'Ascq Tel: ++33 3 59 35 87 24 Url: http://researchers.lille.inria.fr/~pdenis/ +++++++++++++++++++++++++++++++++++++++++++++++

1 2

NLP4CALL 2023 Final call for papers
by David Alfter 22 Aug '24

22 Aug '24

== 12th NLP4CALL, Tórshavn, Faroe Islands== The workshop series on Natural Language Processing (NLP) for Computer-Assisted Language Learning (NLP4CALL) is a meeting place for researchers working on the integration of Natural Language Processing and Speech Technologies in CALL systems and exploring the theoretical and methodological issues arising in this connection. The latter includes, among others, insights from Second Language Acquisition (SLA) research, on the one hand, and promote development of “Computational SLA” through setting up Second Language research infrastructure(s), on the other. The intersection of Natural Language Processing (or Language Technology / Computational Linguistics) and Speech Technology with Computer-Assisted Language Learning (CALL) brings “understanding” of language to CALL tools, thus making CALL intelligent. This fact has given the name for this area of research – Intelligent CALL, ICALL. As the definition suggests, apart from having excellent knowledge of Natural Language Processing and/or Speech Technology, ICALL researchers need good insights into second language acquisition theories and practices, as well as knowledge of second language pedagogy and didactics. This workshop invites therefore a wide range of ICALL-relevant research, including studies where NLP-enriched tools are used for testing SLA and pedagogical theories, and vice versa, where SLA theories, pedagogical practices or empirical data are modeled in ICALL tools. The NLP4CALL workshop series is aimed at bringing together competences from these areas for sharing experiences and brainstorming around the future of the field. We welcome papers: - that describe research directly aimed at ICALL; - that demonstrate actual or discuss the potential use of existing Language and Speech Technologies or resources for language learning; - that describe the ongoing development of resources and tools with potential usage in ICALL, either directly in interactive applications, or indirectly in materials, application or curriculum development, e.g. learning material generation, assessment of learner texts and responses, individualized learning solutions, provision of feedback; - that discuss challenges and/or research agenda for ICALL - that describe empirical studies on language learner data. This year a special focus is given to work done on error detection/correction and feedback generation. We encourage paper presentations and software demonstrations describing the above- mentioned themes primarily, but not exclusively, for the Nordic languages. ==Shared task== NEW for this year is the MultiGED shared task on token-level error detection for L2 Czech, English, German, Italian and Swedish, organized by the Computational SLA working group. For more information, please see the Shared Task website: https://github.com/spraakbanken/multiged-2023 ==Invited speakers== This year, we have the pleasure to announce two invited talks. The first talk is given by Marije Michel from the University of Amsterdam. The second talk is given by Pierre Lison from the Norwegian Computing Center. ==Submission information== Authors are invited to submit long papers (8-12 pages) alternatively short papers (4-7 pages), page count not including references. We will be using the NLP4CALL template for the workshop this year. The author kit can be accessed here, alternatively on Overleaf: <https://spraakbanken.gu.se/sites/default/files/2023/NLP4CALL%20workshop%20t…> <https://spraakbanken.gu.se/sites/default/files/2023/nlp4call%20template.doc> <https://www.overleaf.com/latex/templates/nlp4call-workshop-template/qqqzqqy…> Submissions will be managed through the electronic conference management system EasyChair <https://easychair.org/conferences/?conf=nlp4call2023>. Papers must be submitted digitally through the conference management system, in PDF format. Final camera-ready versions of accepted papers will be given an additional page to address reviewer comments. Papers should describe original unpublished work or work-in-progress. Papers will be peer reviewed by at least two members of the program committee in a double-blind fashion. All accepted papers will be collected into a proceedings volume to be submitted for publication in the NEALT Proceeding Series (Linköping Electronic Conference Proceedings) and, additionally, double-published through the ACL anthology, following experiences from the previous NLP4CALL editions (<https://www.aclweb.org/anthology/venues/nlp4call/>). ==Important dates== 03 April 2023: paper submission deadline 21 April 2023: notification of acceptance 01 May 2023: camera-ready papers for publication 22 May 2023: workshop date ==Organizers== David Alfter (1), Elena Volodina (2), Thomas François (3), Arne Jönsson (4), Evelina Rennes (4) (1) Gothenburg Research Infrastructure for Digital Humanities, Department of Literature, History of Ideas, and Religion, University of Gothenburg, Sweden (2) Språkbanken, Department of Swedish, Multilingualism, Language Technology, University of Gothenburg, Sweden (3) CENTAL, Institute for Language and Communication, Université Catholique de Louvain, Belgium (4) Department of Computer and Information Science, Linköping University, Sweden ==Contact== For any questions, please contact David Alfter, david.alfter(a)gu.se For further information, see the workshop website <https://spraakbanken.gu.se/en/research/themes/icall/nlp4call-workshop-serie…> Follow us on Twitter @NLP4CALL <https://twitter.com/NLP4CALL/>

2 6

3-year PhD position in Automatic Argumentation Mining in French Legal Decisions (Inria Lille, University of Lille, and LexisNexis France)
by Pascal Denis 10 Nov '23

10 Nov '23

Hi there, Could you please distribute the following job offer? Thanks. Best, Pascal ------------------------------------------------------------------------------------- We invite applications for a 3-year PhD position co-funded by Inria, the French national research institute in Computer Science and Applied Mathematics, and LexisNexis France, leader of legal information in France and subsidiary of the RELX Group. The overall objective of this project is to develop an automated system for detecting argumentation structures in French legal decisions, using recent machine learning-based approaches (i.e. deep learning approaches). In the general case, these structures take the form of a directed labeled graph, whose nodes are the elements of the text (propositions or groups of propositions, not necessarily contiguous) which serve as components of the argument, and edges are relations that signal the argumentative connection between them (e.g., support, offensive). By revealing the argumentation structure behind legal decisions, such a system will provide a crucial milestone towards their detailed understanding, their use by legal professionals, and above all contributes to greater transparency of justice. The main challenges and milestones of this project start with the creation and release of a large-scale dataset of French legal decisions annotated with argumentation structures. To minimize the manual annotation effort, we will resort to semi-supervised and transfer learning techniques to leverage existing argument mining corpora, such as the European Court of Human Rights (ECHR) corpus, as well as annotations already started by LexisNexis. Another promising research direction, which is likely to improve over state-of-the-art approaches, is to better model the dependencies between the different sub-tasks (argument span detection, argument typing, etc.) instead of learning these tasks independently. A third research avenue is to find innovative ways to inject the domain knowledge (in particular the rich legal ontology developed by LexisNexis) to enrich enrich the representations used in these models. Finally, we would like to take advantage of other discourse structures, such as coreference and rhetorical relations, conceived as auxiliary tasks in a multi-tasking architecture. The successful candidate holds a Master's degree in computational linguistics, natural language processing, machine learning, ideally with prior experience in legal document processing and discourse processing. Furthermore, the candidate will provide strong programming skills, expertise in machine learning approaches and is eager to work at the interplay between academia and industry. The position is affiliated with the MAGNET [1], a research group at Inria, Lille, which has expertise in Machine Learning and Natural Language Processing, in particular Discourse Processing. The PhD student will also work in close collaboration with the R&D team at LexisNexis France, who will provide their expertise in the legal domain and the data they have collected. Applications will be considered until the position is filled. However, you are encouraged to apply early as we shall start processing the applications as and when they are received. Applications, written in English or French, should include a brief cover letter with research interests and vision, a CV (including your contact address, work experience, publications), and contact information for at least 2 referees. Applications (and questions) should be sent to Pascal Denis (pascal.denis(a)inria.fr). The starting date of the position is 1 November 2022 or soon thereafter, for a total of 3 full years. Best regards, Pascal Denis [1] https://team.inria.fr/magnet/ [2] https://www.lexisnexis.fr/ -- Pascal ---- Pour une évaluation indépendante, transparente et rigoureuse ! Je soutiens la Commission d'Évaluation de l'Inria. ---- +++++++++++++++++++++++++++++++++++++++++++++++ Pascal Denis Equipe MAGNET, INRIA Lille Nord Europe Bâtiment B, Avenue Heloïse Parc scientifique de la Haute Borne 59650 Villeneuve d'Ascq Tel: ++33 3 59 35 87 24 Url: http://researchers.lille.inria.fr/~pdenis/ +++++++++++++++++++++++++++++++++++++++++++++++

1 3

Core metadata scheme for learner corpora - feedback needed!
by Magali Paquot 30 Oct '23

30 Oct '23

Dear colleagues, Last month, we shared the result of our collaborative work on a core metadata scheme for learner corpora with LCR2022 participants. Our proposal builds on Granger and Paquot (2017)'s first attempt to design such a scheme and during our presentation, we explained the rationale for expanding on the initial proposal and discussed selected aspects of the revised scheme. Our proposal is available at https://docs.google.com/spreadsheets/d/1-RbX5iUCUtCBkZU9Rfk-kv-Vzc--F-eUW2O…<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.goog…> We firmly believe that our efforts to develop a core metadata scheme for learner corpora will only be successful to the extent that (1) the LCR community is given the opportunity to engage with our work in various ways (provide feedback on the general structure of the scheme, the list of variables that we identified as core and their operationalization; test the metadata on other learner corpora; use the scheme to start a new corpus compilation, etc.) and (2) the core metadata scheme is the result of truly collaborative work. As mentioned at LCR2022, we will be collecting feedback on the metadata scheme until the end of October. The online feedback form is available at: https://docs.google.com/document/d/1NeDUuxGJlPSJI9wHVA1xgGM-aV8jXTa8Qlb45K-…<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.goog…> We'd like to thank all the colleagues who already got back to us (at LCR2022, by email or via the online form). We also thank them for their appreciation and enthusiasm for our work! We'd also like to encourage more colleagues (and particularly those of you who have experience in learner corpus compilation) to provide feedback! We need help in finalizing the core metadata scheme to make sure that it can be applied in all learner compilation contexts. In short, we need you to make sure the scheme meets the needs of the LCR community at large. With very best wishes, Magali Paquot (also on behalf of Alexander König, Jennifer-Carmen Frey, and Egon W. Stemle) Reference Granger, S. & M. Paquot (2017). Towards standardization of metadata for L2 corpora. Invited talk at the CLARIN workshop on Interoperability of Second Language Resources and Tools, 6-8 December 2017, University of Gothenburg, Sweden. Dr. Magali Paquot Centre for English Corpus Linguistics Institut Langage et Communication UCLouvain https://perso.uclouvain.be/magali.paquot/

1 1

Call For Participation - PLABA @ TAC 2023
by Ondov, Brian (NIH/NLM/LHC) [F] 20 Jul '23

20 Jul '23

We are pleased to announce the inaugural offering of the Plain Language Adaptation of Biomedical Abstracts (PLABA) track, as part of the 2023 Text Analysis Conference (TAC) hosted by the U.S. National Institute of Standards and Technology (NIST). This track is an opportunity to showcase your cutting-edge research on an important topic, and to take advantage of large amounts of expert annotated data and manual evaluation. Background: Deficits of Health Literacy are linked to worse outcomes and drive health disparities. Though unprecedented amounts of biomedical knowledge are available online, patients and caregivers face a type of “language barrier” when confronted with jargon and academic writing. Advances in language modeling have improved plain language generation, but the task of automatically and accurately adapting biomedical text for a general audience has thus far lacked high-quality, standardized benchmarks. Task: Systems will adapt biomedical abstracts to plain language. This includes substituting medical jargon, providing explanations for necessary terms, simplifying sentences, and other modifications. The training set is the publicly available PLABA dataset<https://doi.org/10.1038%2Fs41597-022-01920-3>, which contains 750 abstracts with manual, sentence-aligned adaptations for each, totaling more than 7k sentence pairs with document context. Evaluation: Participating systems will be evaluated on 400 held out abstracts, manually adapted four-fold by different annotators for robust automatic metrics. Additionally, a subset of system output will be manually evaluated along several axes to ensure they are accurate and faithful to the original, which is crucial for the biomedical domain. URL: https://bionlp.nlm.nih.gov/plaba2023/ Mailing list: https://groups.google.com/g/plaba2023 Key dates: Jul 19 – Evaluation data released Aug 16 – Submissions due Oct 18 – Results posted We look forward to your submissions.

1 1

Six PhD students and one postdoc: Neurosymbolic Models of Language, Vision, and Action - Saarbrücken, Germany
by Alexander Koller 16 Jul '23

16 Jul '23

The Research Training Group 2853 “Neuroexplicit Models of Language, Vision, and Action” is looking for 6 PhD students and 1 postdoc October 2023 or later Neuroexplicit models combine neural and human-interpretable (“explicit”) models in order to overcome the limitations that each model class has separately. They include neurosymbolic models, which combine neural and symbolic models, but also e.g. combinations of neural and physics-based models. In the RTG, we will improve the state of the art in natural language processing (“Language”), computer vision (“Vision”), and planning and reinforcement learning (“Action”) through the use of neuroexplicit models and investigate the cross-cutting design principles of effective neuroexplicit models (“Foundations”). The RTG is scheduled to grow to a total of 24 PhD students and one postdoc by 2025. Through the inclusion of ~20 further PhD students and postdocs funded from other sources, it will be one of the largest research centers on neuroexplicit or neurosymbolic models in the world. The RTG brings together researchers at Saarland University, the Max Planck Institute for Informatics, the Max Planck Institute for Software Systems, the CISPA Helmholtz Center for Information Security, and the German Research Center for Artificial Intelligence (DFKI). All of these institutions are colocated on the same campus in Saarbrücken, Germany. The positions are funded as follows: • PhD students will be funded for up to four years at the TV-L E13 100% pay scale. You should have or be about to complete an MSc degree in computer science or a related field and have demonstrated expertise in one of the research areas of the RTG, e.g. through an excellent Master’s thesis or relevant publications. • The postdoc will initially be funded for three years, with the possibility of extension up to five years, at the TV-L E13 100% pay scale. As the RTG postdoc, you will pursue your own research agenda in the field of neuroexplicit models and work with the PhD students to identify and pursue opportunities for collaborative research. You should have or be about to complete a PhD in computer science or a related field and have demonstrated your expertise in one or more of the RTG’s research areas through publications in top venues. The RTG is part of the Saarland Informatics Campus, one of the leading centers for research in computer science, artificial intelligence, and natural language processing in Europe. The Saarland Informatics Campus brings together 900 researchers and 2500 students from 81 countries. The CISPA Helmholtz Center, located on the same campus, is home to an additional 350 researchers and on track to grow to 800 by 2026. Researchers at SIC and CISPA are part of the ELLIS network and have been awarded more than 35 ERC grants. Each PhD student in the RTG will be jointly supervised by two PhD advisors from the list of Principal Investigators below. Each student will freely define their own research topic; we encourage the choice of topics that cross the traditional boundaries of research fields. Students may be affiliated with Saarland University or with one of the participating institutes. Vera Demberg, Saarland University - Computational Linguistics Jörg Hoffmann, Saarland University - AI Planning Eddy Ilg, Saarland University - Computer Vision, Machine Learning Dietrich Klakow, Saarland University - Natural Language Processing Alexander Koller, Saarland University - Computational Linguistics Bernt Schiele, MPI for Informatics - Computer Vision, Machine Learning Philipp Slusallek, DFKI and Saarland University - Computer Graphics, Artificial Intelligence Christian Theobalt, MPI for Informatics - Visual Computing, Machine Learning Mariya Toneva, MPI for Software Systems - Computational Neuroscience, Machine Learning Isabel Valera, Saarland University - Machine Learning Jilles Vreeken, CISPA - Machine Learning, Causality Joachim Weickert, Saarland University - Mathematical Data Analysis Verena Wolf, DFKI and Saarland University - Modeling and Simulation, Reinforcement Learning Ellie Pavlick, Brown University and Google AI, will join us regularly as a Mercator Fellow. Please send your application by 31 May 2023 to bewerbung(a)uni-saarland.de. Include the reference number W2298 for the postdoc position and the reference number W2299 for the PhD positions. We aim to conduct job interviews in July (for a start in October) and September (for a later start). The legally binding version of this job ad is at https://www.uni-saarland.de/fileadmin/upload/verwaltung/stellen/Wissenschaf… (postdoc) and https://www.uni-saarland.de/fileadmin/upload/verwaltung/stellen/Wissenschaf… (PhD), respectively. For details on what materials to submit with your application and all other information about the RTG, please see our website: https://www.neuroexplicit.org/jobs/#phd-2023

2 1

Call for Participation: Challenge on Medical Video Question Answering at TRECVID 2023
by Deepak Gupta 15 Jul '23

15 Jul '23

Dear colleagues and friends, This year, we are organizing the MedVidQA <https://medvidqa.github.io/>challenge with TRECVID 2023 <https://www-nlpir.nist.gov/projects/tv2023/index.html>. This challenge aims at developing models for (1) retrieving the relevant videos and locating the visual answer in those videos for the medical or health-related question and (2) generating the medical instructional questions from the video segments. Following the success of the 1st MedVidQA shared task <https://aclanthology.org/2022.bionlp-1.25/>, MedVidQA at TRECVID 2023 expanded the tasks and introduced a new track considering language-video understanding and generation. This track is comprised of two main tasks Video Corpus Visual Answer Localization (VCVAL) and Medical Instructional Question Generation (MIQG). For more details, please visit the challenge website ( https://medvidqa.github.io/) and TRECVID 2023 website ( https://www-nlpir.nist.gov/projects/tv2023/index.html). The link for submission: - Task 1 (VCVAL): https://codalab.lisn.upsaclay.fr/competitions/13445 <https://codalab.lisn.upsaclay.fr/competitions/13546> - Task 2 (MIQG): https://codalab.lisn.upsaclay.fr/competitions/13546 *Important Dates* - *Release of the training and validation datasets:* April 30, 2023 - *Release of the video corpus:* May 12, 2023 - *Release of the test sets:* July 14, 2023 - *Run submission deadline:* August 4, 2023 - *Release of the official results:* September 29, 2023 We look forward to your participation in MedVidQA at TRECVID 2023. Join our Google Group <https://groups.google.com/g/trecvid-medvidqa2023> for important updates! If you have any questions, ask in our Google Group <https://groups.google.com/g/trecvid-medvidqa2023> or email <deepak.gupta(a)nih.gov> us. Thank you, MedVidQA 2023 Organizers

1 2

New articles for Asia Pacific Journal of Corpus Research (APJCR) Vol. 3, No. 1 are available online (Open Access)
by Prof CK Jung 09 Jul '23

09 Jul '23

Dear all Just wanted to let you know that APJCR Vol. 3, No. 1 is now available to view online. http://icr.or.kr/ejournals-apjcr CK --- *CK Jung BEng(Hons) Birmingham MSc Warwick EdD Warwick Cert Oxford* Department of English Language and Literature, Incheon National University, *South Korea* Vice President | The Korea Association of Primary English Education (KAPEE), *South Korea* Vice President | The Korea Association of Secondary English Education (KASEE), *South Korea* Director | Institute for Corpus Research, Incheon National University, *South Korea* (http://icr.or.kr) Editor | Asia Pacific Journal of Corpus Research, ICR, *International* ( http://icr.or.kr/apjcr) Deputy Editor | Korean Journal of English Language and Linguistics, KASELL, *South Korea* Editorial Board | Corpora, Edinburgh University Press, *UK* Editorial Board | English Today, Cambridge University Press, *UK* E: ckjung(a)inu.ac.kr / T: +82 (0)32 835 8129 H(EN): http://ckjung.org H(KR): http://prof1.inu.ac.kr/user/ckjung

1 12

Call for Shared Task Participation: Collecting and Geocoding Armed Clash Events in Russian Ukrainian Conflict at CASE @ RANLP 2023
by ali hürriyetoglu 07 Jun '23

07 Jun '23

CASE-2023 Shared Task - Task 2: Collecting and Geocoding Armed Clash Events in Russo-Ukrainian Conflict ================================================ The unprecedented quantity of easily accessible data on social, political, and economic processes offers ground-breaking potential in guiding data-driven analysis of socio political phenomena: Armed conflicts, political movements, fights for economic and social rights, and various related socio-political happenings are reported in news articles and social media posts and recorded in curated databases. On the other hand, automatic event detection from texts and event geocoding has long been a challenge for the natural language processing (NLP) community. It requires sophisticated methods and resources, such as Machine Learning models, linguistic rules and dictionaries, geographic gazetteers. Task definition The task Collecting and Geocoding Armed Clash Events in Russo-Ukrainian Conflict is being held as a sub-task of the 6th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2023). The task will use data from the Russo-Ukrainian Conflict to test the capabilities of event detection systems to extract, geocode and de-duplicate armed clashes in news and social media postsл Evaluation will be based on the correlation between the spatio-temporal distribution and number of the extracted events and those which are in the ground truth data set. We invite contributions from researchers in NLP, ML, Deep Learning, and AI. The call is directed also towards socio-political scientists, researchers in conflict analysis and forecasting, peace studies, and computational social science. All participating teams will be able to publish their system description paper in the workshop proceedings published by ACL. For more information on the workshop, please visit the Workshop website https://emw.ku.edu.tr/case-2023/ <https://emw.ku.edu.tr/case-2022/> and the conference website https://ranlp.org/ranlp2023/. ================================================ 1. Data Gold Standard and Text Input Data for the participant systems for the time range 24.02.2022-24.08.2022 has been prepared and will be shared with the applicants on the Task website. 1.1 Training Data No training data are provided for this Task. The data utilized for CASE 2023 Task 1, which is described in Hürriyetoğlu, A. et al. (2022, 2020b), can be used for training systems for this task (Task 2). Additionally data can be used to build systems/models that can detect protest events in tweets and news articles. 1.2 Input Data The participant systems will be evaluated on raw data collections including Telegram messages, the New York Times and Ukrainian-Russian official news channels. Namely, the data collections comprise: • English language social media massage and news corpus comprising. 48.007 Telegram Messages and The New York Times News about Ukraine. • Ukrainian language social media collection comprising 102.135 Telegram Messages and Ukraine News Agency News. • Russian language social media collection comprising 8.534 Telegram Message and Russian News Agency News Further details on the text collections and sampling methods are provided in the folders news and Social Media of the github repo for the Task ( https://github.com/zavavan/case2023_task2). 1.3 Gold Standard Data The Russo-Ukrainian Conflict ground truth data primarily consists of data coming from the Armed Conflict Location & Event Data Project (ACLED). We will be adding alternative ground-truth datasets in order to prevent the bias that may be introduced by using a single definition and interpretation of an event. Full details on the manually curated data used as Gold Standard for the correlation analysis will be disclosed at the end of the evaluation period. Please check documentation on the folder gold_standard of the Task github repo. ================================================ 1. Evaluation The systems which participate in this shared task will be required to detect news articles and Telegram posts which contain description of ongoing armed clashes. The time and place of each armed clash should be detected at date level (regarding the time) and precise geographic coordinates (latitude and longitude). The systems should ideally extract event times, based on multiple text reports. In order to evaluate the ability of automatic event-coders to reproduce the gold standard armed clash event dataset, we adapt two correlation methods originally used in micro-level analysis of political violence by Hammond and Weidmann (2014), based on aggregation of event counts uniform grid geographical cells and 1-day time spans and apply a number of standard correlation coefficients and error measures. For each of the input text corpora in1.2, each participant may submit up to 3 different system responses. Each system response will consist of a csv file with the following naming pattern: “submission.<team-name>.<corpus>.<response-number>.csv” where <corpus> is either “social_media” or “news”. For instance: “submission.MyTeam.news.3.csv” for the 3rd submission of team “MyTeam” on the news corpus. Each system response file will have one line per event, where each line will have the following format: <id>,<City>,<Region>,<Country>,<Date> where <id> is a numerical event identifier, <City>,<Region>,<Country> are canonical English names of the City,State/Region and Country, respectively, of the detected event location. While only the <country> attribute is mandatory, systems are expected to assign a description of the event location at the finest grained level possible, as otherwise geographical coordinate conversion may penalize the correlation score on geographical cell aggregation. <Date> is the assigned date of the event in the format YYYY-MM-DD. A sample system response file line: 0,Kharkiv,Kharkiv Oblast,Ukraine,2022-05-02 A sample system output file can be downloaded from the Task repo at: https://github.com/zavavan/case2023_task2/blob/main/submission.myteam.news.… Important Dates (AoE time) ================================================ It is optional to use Task 1 systems. Participants may also use their own systems, which are developed independently of Task 1. Task 1 Training data available: May 1, 2023 Task 1 Test data available: May 15, 2023 Task 1 Evaluation period ends: June 30, 2023 Task 2 Sample Text archive is available: May 22, 2023 Task 2 Text archive for evaluation is available: July 1, 2023 Task 2 Evaluation period starts: July 1, 2023 Task 2 Evaluation period ends: July 24 System Description Paper submissions due: July 31, 2023 Notification to authors after review: August 7, 2023 Camera ready: August 25, 2023 Workshop period @ RANLP: Sep 7-8, 2023 Organization ================================================ - Hristo Tanev (Joint Research Centre (JRC), European Commission, Italy) - Onur Uca, Sociology (Sociology, Mersin University, Turkey) - Vanni Zavarella (University of Cagliari, Italy) - Ali Hürriyetoğlu (KNAW Humanities Cluster DHLab, the Netherlands) Please contact the organizers at hristo.tanev(a)ec.europa.eu or onuruca(a)mersin.edu.tr for your questions. 5.References Jesse Hammond and Nils B Weidmann. Using machine-coded event data for the micro-level study of political violence. Research & Politics, 1(2):2053168014539924, 2014. Hürriyetoğlu, A., Mutlu, O., Duruşan, F., Uca, O,. Gürel, A.,S., Radford, B., Dai, Y., Hettiarachchi, H., Stoehr, N., Nomoto, T., Slavcheva, M., Vargas, F., Javid, A., Beyhan, F., Yörük, E. (2022). Extended Multilingual Protest News Detection Shared Task1,CASE2021 and 2022. arXiv preprint arXiv:2211.11360. Url: https://arxiv.org/abs/2211.11360 Hürriyetoğlu, A., Yörük, E., Yüret, D., Mutlu, O., Yoltar, Ç., Duruşan, F., & Gürel, B. (2020b). Cross-context news corpus for protest events related knowledge base construction. arXiv preprint arXiv:2008.00351. In Automated Knowledge Base Construction (AKBC). URL: https://www.akbc.ws/2020/papers/7NZkNhLCjp

1 1

Call for Workshop Papers and Shared Task Participation: Automated Extraction of Socio-political Events from Text - CASE @ RANLP 2023
by ali hürriyetoglu 07 Jun '23

07 Jun '23

Call for workshop papers and Shared Task participation: the 6th workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text - CASE @ RANLP 2023 ************************************************************************************ URL: https://emw.ku.edu.tr/case-2023/ Paper submission deadline: 10 July 2023 Paper acceptance notification: 5 August 2023 Paper camera-ready: 25 August 2023 Workshop dates: 7-8 September 2023 Dates and deadlines for the shared task are below. Softconf page of the workshop: https://softconf.com/ranlp23/CASE/ ************************************************************************************ We invite contributions from researchers in computer science, NLP, ML, DL, AI, socio-political sciences, conflict analysis and forecasting, peace studies, as well as computational social science scholars involved in the collection and utilization of socio-political event data. This includes (but is not limited to) the following topics 1) Extracting events and their arguments such as time and location in and beyond a sentence or document, event coreference resolution. 2) Research in NLP technologies in relation to event detection: geocoding, temporal reasoning, argument structure detection, syntactic and semantic analysis of event structures, text classification, for event type detection, learning event-related lexica, event co-reference resolution, fake news analysis, and others with a focus on real or potential event detection applications. 3) New datasets, training data collection, and annotation for event information. 4) Event-event relations, e.g., subevents, main events, spatio-temporal relations, causal relations. 5) Event dataset evaluation in light of reliability and validity metrics. 6) Defining, populating, and facilitating event schemas and ontologies. 7) Automated tools and pipelines for event collection related tasks. 8) Lexical, syntactic, semantic, discursive, and pragmatic aspects of event manifestation. 9) Methodologies for development, evaluation, and analysis of event datasets. 10) Applications of event databases, e.g. early warning, conflict prediction, policymaking. 11) Estimating what is missing in event datasets using internal and external information. 12) Detection of new and emerging SPE types, e.g. creative protests. 13) Release of new event datasets. 14) Bias and fairness of the sources and event datasets. 15) Ethics, misinformation, privacy, and fairness concerns pertaining to event datasets. 16) Copyright issues on event dataset creation, dissemination, and sharing. 17) Cross-lingual, multilingual and multimodal aspects in event analysis. 18) Resources and approaches related to contentious politics around climate change. **** Shared tasks **** Please check the workshop page and Github repositories of the respective task for additional details. Task 1 - Multilingual protest news detection: The performance of an automated system depends on the target event type as it may be broad or potentially the event trigger(s) can be ambiguous. The context of the trigger occurrence may need to be handled as well. For instance, the ‘protest’ event type may be synonymous with ‘demonstration’ or not in a specific context. Moreover, hypothetical cases such as future protest plans may need to be excluded from the results. Finally, the relevance of a protest depends on the actors as in a contentious political event only citizen-led events are in the scope. This challenge becomes even harder in a cross-lingual and zero-shot setting in case training data are not available in new languages. We tackle the task in four steps and hope state-of-the-art approaches will yield optimal results. Contact person: Ali Hürriyetoğlu (ali.hurriyetoglu(a)gmail.com) Github: https://github.com/emerging-welfare/case-2022-multilingual-event Task 2 - Collecting and Geocoding Armed Clash Events in Russian Ukrainian Conflict: There is a mismatch between the event information collected between automated and manual approaches. We aim at identifying similarities and differences between the results of these paradigms for creating event datasets. The participants of Task 1 will be invited to run the systems they will develop to tackle Task 1 on a text archive. Participation in Task 1 is not a precondition to participate in Task 2. Contact person: Hristo Tanev (htanev(a)gmail.com) and Onur Uca ( onuruca(a)mersin.edu.tr) Github: https://github.com/zavavan/case2023_task2 Task 3 - Event causality identification: Causality is a core cognitive concept and appears in many natural language processing (NLP) works that aim to tackle inference and understanding. We are interested in studying event causality in news, and therefore, introduce the Causal News Corpus. The Causal News Corpus consists of 3,767 event sentences, extracted from protest event news, that have been annotated with sequence labels on whether it contains causal relations or not. Subsequently, causal sentences are also annotated with Cause, Effect and Signal spans. Our subtasks work on the Causal News Corpus, and we hope that accurate, automated solutions may be proposed for the detection and extraction of causal events in news. Contact person: Fiona Anting Tan (tan.f(a)u.nus.edu) Github: https://github.com/tanfiona/CausalNewsCorpus Task 4 - Multimodal Hate Speech Event Detection: Hate speech detection is one of the most important aspects of event identification during political events like invasions. In the case of hate speech detection, the event is the occurrence of hate speech, the entity is the target of the hate speech, and the relationship is the connection between the two. Since multimodal content is widely prevalent across the internet, the detection of hate speech in text-embedded images is very important. Given a text-embedded image, this task aims to automatically identify the hate speech and its targets. This task will have two subtasks. Contact person: Surendrabikram Thapa (surendrabikram(a)vt.edu) Github: https://github.com/therealthapa/case2023_task4 **** Deadlines for the Shared tasks **** ** Task 1, 3, 4: Training & Validation data available: May 1, 2023 Test data available: Jun 15, 2023 Test start: Jun 15, 2023 Test end: Jun 30, 2023 System Description Paper submissions due: Jul 10, 2023 Notification to authors after review: Aug 5, 2023 Camera ready: Aug 25, 2023 ** Task 2: Sample Text archive is available: May 22, 2023 Text archive for evaluation is available: July 1, 2023 Evaluation period starts: July 1, 2023 Evaluation period ends: July 24, 2023 System Description Paper submissions due: July 31, 2023 Notification to authors after review: August 7, 2023 Camera ready: August 25, 2023 *** Keynotes *** We will continue our tradition of inviting keynote speakers from both social and computational sciences. The social science keynote will be delivered by Erdem Yörük with the title “Using Automated Text Processing to Understand Social Movements and Human Behaviour” and the computational ones will be delivered by Ruslan Mitkov and Kiril Simov. Please see the workshop webpage (https://emw.ku.edu.tr/case-2023/) for additional details.

1 1

2025

2024

2023

2022

Corpora May 2023