July 2023 - Corpora - ELRA lists

3-year PhD position in Computational Models of Semantic Memory and its Acquisition (Inria and University of Lille, France)
by Pascal Denis 13 May '25

13 May '25

Hello, Could you please distribute the following job offer? Thanks. Best, Pascal ------------------------------------------------------------------------------------- 3-year PhD position in Computational Models of Semantic Memory and its Acquisition (Inria and University of Lille, France) We invite applications for a 3-year PhD position at the University of Lille in the context of the recently funded research project "COMANCHE" (Computational Models of Lexical Meaning and Change). The position is funded by Inria, the French national research institute in Computer Science and Applied Mathematics. COMANCHE proposes to transfer and adapt neural word embeddings algorithms to model the acquisition and evolution of word meaning, by comparing them with linguistic theories on language acquisition and language evolution. At the intersection between Natural Language Processing, psycholinguistics and historical linguistics, this project intends to validate or revise some of these theories, while also developing computational models that are less data hungry and computationally intensive as they exploit new inductive biases inspired by these disciplines. The first strand of the project, on which the successful candidate will work, focuses on the development of computational models of semantic memory and its acquisition. Two main research directions will be pursued. On the one hand, we will compare the structural properties associated to different semantic spaces derived from word embedding algorithms to those found in human semantic memory as reflected in behavioral data (such as typicality norms) as well as brain imaging data. The latter data will then used as additional supervision to inject more hierarchical structure into the learned semantic spaces. One the other hand, we intend to experiment with training regimes for word embedding algorithms that are closer to those of humans when they acquire language, controlling the quantity as well as the linguistic complexity of the inputs fed to the learning algorithms through the use of longitudinal and child directed speech corpora (e.g., CHILDES, Colaje). In both cases, both English and French data will be considered. The successful candidate holds a Master's degree in computational linguistics or computer science or cognitive science and has prior experience in word embedding models. Furthermore, the candidate will provide strong programming skills, expertise in machine learning approaches and is eager to work across languages. The position is affiliated with the MAGNET team at Inria, Lille [1] as well as with the SCALAB group at University of Lille [2] in an effort to strenghten collaborations between these two groups, and ultimately foster cross-fertilizations between Natural Language Processing and Psycholinguistics. Applications will be considered until the position is filled. However, you are encouraged to apply early as we shall start processing the applications as and when they are received. Applications, written in English or French, should include a brief cover letter with research interests and vision, a CV (including your contact address, work experience, publications), and contact information for at least 2 referees. Applications (and questions) should be sent to Angèle Brunellière (angele.brunelliere(a)univ-lille.fr) and Pascal Denis (pascal.denis(a)inria.fr). The starting date of the position is 1 October 2022 or soon thereafter, for a total of 3 full years. Best regards, Angèle Brunellière and Pascal Denis [1] https://team.inria.fr/magnet/ [2] https://scalab.univ-lille.fr/ -- Pascal ---- Pour une évaluation indépendante, transparente et rigoureuse ! Je soutiens la Commission d'Évaluation de l'Inria. ---- +++++++++++++++++++++++++++++++++++++++++++++++ Pascal Denis Equipe MAGNET, INRIA Lille Nord Europe Bâtiment B, Avenue Heloïse Parc scientifique de la Haute Borne 59650 Villeneuve d'Ascq Tel: ++33 3 59 35 87 24 Url: http://researchers.lille.inria.fr/~pdenis/ +++++++++++++++++++++++++++++++++++++++++++++++

1 2

NLP4CALL 2023 Final call for papers
by David Alfter 22 Aug '24

22 Aug '24

== 12th NLP4CALL, Tórshavn, Faroe Islands== The workshop series on Natural Language Processing (NLP) for Computer-Assisted Language Learning (NLP4CALL) is a meeting place for researchers working on the integration of Natural Language Processing and Speech Technologies in CALL systems and exploring the theoretical and methodological issues arising in this connection. The latter includes, among others, insights from Second Language Acquisition (SLA) research, on the one hand, and promote development of “Computational SLA” through setting up Second Language research infrastructure(s), on the other. The intersection of Natural Language Processing (or Language Technology / Computational Linguistics) and Speech Technology with Computer-Assisted Language Learning (CALL) brings “understanding” of language to CALL tools, thus making CALL intelligent. This fact has given the name for this area of research – Intelligent CALL, ICALL. As the definition suggests, apart from having excellent knowledge of Natural Language Processing and/or Speech Technology, ICALL researchers need good insights into second language acquisition theories and practices, as well as knowledge of second language pedagogy and didactics. This workshop invites therefore a wide range of ICALL-relevant research, including studies where NLP-enriched tools are used for testing SLA and pedagogical theories, and vice versa, where SLA theories, pedagogical practices or empirical data are modeled in ICALL tools. The NLP4CALL workshop series is aimed at bringing together competences from these areas for sharing experiences and brainstorming around the future of the field. We welcome papers: - that describe research directly aimed at ICALL; - that demonstrate actual or discuss the potential use of existing Language and Speech Technologies or resources for language learning; - that describe the ongoing development of resources and tools with potential usage in ICALL, either directly in interactive applications, or indirectly in materials, application or curriculum development, e.g. learning material generation, assessment of learner texts and responses, individualized learning solutions, provision of feedback; - that discuss challenges and/or research agenda for ICALL - that describe empirical studies on language learner data. This year a special focus is given to work done on error detection/correction and feedback generation. We encourage paper presentations and software demonstrations describing the above- mentioned themes primarily, but not exclusively, for the Nordic languages. ==Shared task== NEW for this year is the MultiGED shared task on token-level error detection for L2 Czech, English, German, Italian and Swedish, organized by the Computational SLA working group. For more information, please see the Shared Task website: https://github.com/spraakbanken/multiged-2023 ==Invited speakers== This year, we have the pleasure to announce two invited talks. The first talk is given by Marije Michel from the University of Amsterdam. The second talk is given by Pierre Lison from the Norwegian Computing Center. ==Submission information== Authors are invited to submit long papers (8-12 pages) alternatively short papers (4-7 pages), page count not including references. We will be using the NLP4CALL template for the workshop this year. The author kit can be accessed here, alternatively on Overleaf: <https://spraakbanken.gu.se/sites/default/files/2023/NLP4CALL%20workshop%20t…> <https://spraakbanken.gu.se/sites/default/files/2023/nlp4call%20template.doc> <https://www.overleaf.com/latex/templates/nlp4call-workshop-template/qqqzqqy…> Submissions will be managed through the electronic conference management system EasyChair <https://easychair.org/conferences/?conf=nlp4call2023>. Papers must be submitted digitally through the conference management system, in PDF format. Final camera-ready versions of accepted papers will be given an additional page to address reviewer comments. Papers should describe original unpublished work or work-in-progress. Papers will be peer reviewed by at least two members of the program committee in a double-blind fashion. All accepted papers will be collected into a proceedings volume to be submitted for publication in the NEALT Proceeding Series (Linköping Electronic Conference Proceedings) and, additionally, double-published through the ACL anthology, following experiences from the previous NLP4CALL editions (<https://www.aclweb.org/anthology/venues/nlp4call/>). ==Important dates== 03 April 2023: paper submission deadline 21 April 2023: notification of acceptance 01 May 2023: camera-ready papers for publication 22 May 2023: workshop date ==Organizers== David Alfter (1), Elena Volodina (2), Thomas François (3), Arne Jönsson (4), Evelina Rennes (4) (1) Gothenburg Research Infrastructure for Digital Humanities, Department of Literature, History of Ideas, and Religion, University of Gothenburg, Sweden (2) Språkbanken, Department of Swedish, Multilingualism, Language Technology, University of Gothenburg, Sweden (3) CENTAL, Institute for Language and Communication, Université Catholique de Louvain, Belgium (4) Department of Computer and Information Science, Linköping University, Sweden ==Contact== For any questions, please contact David Alfter, david.alfter(a)gu.se For further information, see the workshop website <https://spraakbanken.gu.se/en/research/themes/icall/nlp4call-workshop-serie…> Follow us on Twitter @NLP4CALL <https://twitter.com/NLP4CALL/>

2 6

3-year PhD position in Automatic Argumentation Mining in French Legal Decisions (Inria Lille, University of Lille, and LexisNexis France)
by Pascal Denis 10 Nov '23

10 Nov '23

Hi there, Could you please distribute the following job offer? Thanks. Best, Pascal ------------------------------------------------------------------------------------- We invite applications for a 3-year PhD position co-funded by Inria, the French national research institute in Computer Science and Applied Mathematics, and LexisNexis France, leader of legal information in France and subsidiary of the RELX Group. The overall objective of this project is to develop an automated system for detecting argumentation structures in French legal decisions, using recent machine learning-based approaches (i.e. deep learning approaches). In the general case, these structures take the form of a directed labeled graph, whose nodes are the elements of the text (propositions or groups of propositions, not necessarily contiguous) which serve as components of the argument, and edges are relations that signal the argumentative connection between them (e.g., support, offensive). By revealing the argumentation structure behind legal decisions, such a system will provide a crucial milestone towards their detailed understanding, their use by legal professionals, and above all contributes to greater transparency of justice. The main challenges and milestones of this project start with the creation and release of a large-scale dataset of French legal decisions annotated with argumentation structures. To minimize the manual annotation effort, we will resort to semi-supervised and transfer learning techniques to leverage existing argument mining corpora, such as the European Court of Human Rights (ECHR) corpus, as well as annotations already started by LexisNexis. Another promising research direction, which is likely to improve over state-of-the-art approaches, is to better model the dependencies between the different sub-tasks (argument span detection, argument typing, etc.) instead of learning these tasks independently. A third research avenue is to find innovative ways to inject the domain knowledge (in particular the rich legal ontology developed by LexisNexis) to enrich enrich the representations used in these models. Finally, we would like to take advantage of other discourse structures, such as coreference and rhetorical relations, conceived as auxiliary tasks in a multi-tasking architecture. The successful candidate holds a Master's degree in computational linguistics, natural language processing, machine learning, ideally with prior experience in legal document processing and discourse processing. Furthermore, the candidate will provide strong programming skills, expertise in machine learning approaches and is eager to work at the interplay between academia and industry. The position is affiliated with the MAGNET [1], a research group at Inria, Lille, which has expertise in Machine Learning and Natural Language Processing, in particular Discourse Processing. The PhD student will also work in close collaboration with the R&D team at LexisNexis France, who will provide their expertise in the legal domain and the data they have collected. Applications will be considered until the position is filled. However, you are encouraged to apply early as we shall start processing the applications as and when they are received. Applications, written in English or French, should include a brief cover letter with research interests and vision, a CV (including your contact address, work experience, publications), and contact information for at least 2 referees. Applications (and questions) should be sent to Pascal Denis (pascal.denis(a)inria.fr). The starting date of the position is 1 November 2022 or soon thereafter, for a total of 3 full years. Best regards, Pascal Denis [1] https://team.inria.fr/magnet/ [2] https://www.lexisnexis.fr/ -- Pascal ---- Pour une évaluation indépendante, transparente et rigoureuse ! Je soutiens la Commission d'Évaluation de l'Inria. ---- +++++++++++++++++++++++++++++++++++++++++++++++ Pascal Denis Equipe MAGNET, INRIA Lille Nord Europe Bâtiment B, Avenue Heloïse Parc scientifique de la Haute Borne 59650 Villeneuve d'Ascq Tel: ++33 3 59 35 87 24 Url: http://researchers.lille.inria.fr/~pdenis/ +++++++++++++++++++++++++++++++++++++++++++++++

1 3

Core metadata scheme for learner corpora - feedback needed!
by Magali Paquot 30 Oct '23

30 Oct '23

Dear colleagues, Last month, we shared the result of our collaborative work on a core metadata scheme for learner corpora with LCR2022 participants. Our proposal builds on Granger and Paquot (2017)'s first attempt to design such a scheme and during our presentation, we explained the rationale for expanding on the initial proposal and discussed selected aspects of the revised scheme. Our proposal is available at https://docs.google.com/spreadsheets/d/1-RbX5iUCUtCBkZU9Rfk-kv-Vzc--F-eUW2O…<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.goog…> We firmly believe that our efforts to develop a core metadata scheme for learner corpora will only be successful to the extent that (1) the LCR community is given the opportunity to engage with our work in various ways (provide feedback on the general structure of the scheme, the list of variables that we identified as core and their operationalization; test the metadata on other learner corpora; use the scheme to start a new corpus compilation, etc.) and (2) the core metadata scheme is the result of truly collaborative work. As mentioned at LCR2022, we will be collecting feedback on the metadata scheme until the end of October. The online feedback form is available at: https://docs.google.com/document/d/1NeDUuxGJlPSJI9wHVA1xgGM-aV8jXTa8Qlb45K-…<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.goog…> We'd like to thank all the colleagues who already got back to us (at LCR2022, by email or via the online form). We also thank them for their appreciation and enthusiasm for our work! We'd also like to encourage more colleagues (and particularly those of you who have experience in learner corpus compilation) to provide feedback! We need help in finalizing the core metadata scheme to make sure that it can be applied in all learner compilation contexts. In short, we need you to make sure the scheme meets the needs of the LCR community at large. With very best wishes, Magali Paquot (also on behalf of Alexander König, Jennifer-Carmen Frey, and Egon W. Stemle) Reference Granger, S. & M. Paquot (2017). Towards standardization of metadata for L2 corpora. Invited talk at the CLARIN workshop on Interoperability of Second Language Resources and Tools, 6-8 December 2017, University of Gothenburg, Sweden. Dr. Magali Paquot Centre for English Corpus Linguistics Institut Langage et Communication UCLouvain https://perso.uclouvain.be/magali.paquot/

1 1

CoCo4MT 2023 @ MT Summit Deadline Extended to July 16th!
by John Ortega 04 Sep '23

04 Sep '23

CoCo4MT is extended its deadline for paper submission to July 16th! The Second Workshop on Corpus Generation and Corpus Augmentation for Machine Translation (CoCo4MT) @MT-SUMMIT XIX The 19th Machine Translation Summit Sep 4-8, 2023, Macau SAR, China https://sites.google.com/view/coco4mt SCOPE It is a well-known fact that machine translation systems, especially those that use deep learning, require massive amounts of data. Several resources for languages are not available in their human-created format. Some of the types of resources available are monolingual, multilingual, translation memories, and lexicons. Those types of resources are generally created for formal purposes such as parliamentary collections when parallel and more informal situations when monolingual. The quality and abundance of resources including corpora used for formal reasons is generally higher than those used for informal purposes. Additionally, corpora for low-resource languages, languages with less digital resources available, tends to be less abundant and of lower quality. CoCo4MT is a workshop centered around research that focuses on manual and automatic corpus creation, cleansing, and augmentation techniques specifically for machine translation. We accept work that covers any language (including sign language) but we are specifically interested in those submissions that explicitly report on work with languages with limited existing resources (low-resource languages). Since techniques from high-resource languages are generally statistical in nature and could be used as generic solutions for any language, we welcome submissions on high-resource languages also. CoCo4MT aims to encourage research on new and undiscovered techniques. We hope that the methods presented at this workshop will lead to the development of high-quality corpora that will in turn lead to high-performing MT systems and new dataset creation for multiple corpora. We hope that submissions will provide high-quality corpora that are available publicly for download and can be used to increase machine translation performance thus encouraging new dataset creation for multiple languages that will, in turn, provide a general workshop to consult for corpora needs in the future. The workshop’s success will be measured by the following key performance indicators: - Promotes the ongoing increase in quality of machine translation systems when measured by standard measurements, - Provides a meeting place for collaboration from several research areas to increase the availability of commonly used corpora and new corpora, - Drives innovation to address the need for higher quality and abundance of low-resource language data. Topics of interest include: - Difficulties with using existing corpora (e.g., political considerations or domain limitations) and their effects on final MT systems, - Strategies for collecting new MT datasets (e.g., via crowdsourcing), - Data augmentation techniques, - Data cleansing and denoising techniques, - Quality control strategies for MT data, - Exploration of datasets for pretraining or auxiliary tasks for training MT systems. SHARED TASK To encourage research on corpus construction for low-resource machine translation, we introduce a shared task focused on identifying high-quality instances that should be translated into a target low-resource language. Participants are provided access to multi-way corpora in the high-resource languages of English, Spanish, German, Korean, and Indonesian, and using these, are required to identify beneficial instances, that when translated into the low-resource languages of Cebuano, Gujarati, and Burmese, lead to high-performing MT systems. More details on data, evaluation and submission can be found on the website (https://sites.google.com/view/coco4mt/shared-task) or by emailing coco4mt-shared-task(a)googlegroups.com. SUBMISSION INFORMATION CoCo4MT will accept research, review, or position papers. The length of each paper should be at least four (4) and not exceed ten (10) pages, plus unlimited pages for references. Submissions should be formatted according to the official MT Summit 2023 style templates (https://www.overleaf.com/latex/templates/mt-summit-2023-template/knrrcnxhkq…). Accepted papers will be published in the MT Summit 2023 proceedings which are included in the ACL Anthology and will be presented at the conference either orally or as a poster. Submissions must be anonymized and should be made to the workshop using the Softconf conference management system (https://softconf.com/mtsummit2023/CoCo4MT). Scientific papers that have been or will be submitted to other venues must be declared as such, and must be withdrawn from the other venues if accepted and published at CoCo4MT. The review will be double-blind. We would like to encourage authors to cite papers written in ANY language that are related to the topics, as long as both original bibliographic items and their corresponding English translations are provided. Registration will be handled by the main conference. (To be announced) IMPORTANT DATES May 18, 2023 - Call for papers released May 19, 2023 - Shared task release of train, dev and test data May 25, 2023 - Shared task release of baselines June 5, 2023 - Second call for papers June 20, 2023 - Third and final call for papers July 16, 2023 - Paper submissions due July 16, 2023 - Shared task deadline to submit results July 27, 2023 - Notification of acceptance July 27, 2023 - Shared task system description papers due August 03, 2023 - Camera-ready due September 4-5, 2023 - CoCo4MT workshop CONTACT CoCo4MT Workshop Organizers: coco4mt-2023-organizers(a)googlegroups.com CoCo4MT Shared Task Organizers: coco4mt-shared-task(a)googlegroups.com ORGANIZING COMMITTEE (listed alphabetically) Ananya Ganesh University of Colorado Boulder Constantine Lignos Brandeis University John E. Ortega Northeastern University Jonne Sälevä Brandeis University Katharina Kann University of Colorado Boulder Marine Carpuat University of Maryland Rodolfo Zevallos Universitat Pompeu Fabra Shabnam Tafreshi University of Maryland William Chen Carnegie Mellon University PROGRAM COMMITTEE (listed alphabetically tentative) Abteen Ebrahimi University of Colorado Boulder Adelani David Saarland University Ananya Ganesh University of Colorado Boulder Alberto Poncelas ADAPT Centre at Dublin City University Anna Currey Amazon Amirhossein Tebbifakhr University of Trento Atul Kr. Ojha National University of Ireland Galway Ayush Singh Northeastern University Barrow Haddow University of Edinburgh Bharathi Raja Chakravarthi National University of Ireland Galway Beatrice Savoldi University of Trento Bogdan Babych Heidelberg University Briakou Eleftheria University of Maryland Constantine Lignos Brandeis University Dossou Bonaventure Mila Quebec AI Institute Duygu Ataman New York University Eleftheria Briakou University of Maryland Eleni Metheniti Université Toulosse - Paul Sabatier Jasper Kyle Catapang University of Birmingham John E. Ortega Northeastern University Jonne Sälevä Brandeis University Kalika Bali Microsoft Katharina Kann University of Colorado Boulder Kochiro Watanabe The University of Tokyo Koel Dutta Chowdhury Saarland University Liangyou Li Huawei Manuel Mager University of Stuttgart Maria Art Antonette Clariño University of the Philippines Los Baños Marine Carpuat University of Maryland Mathias Müller University of Zurich Nathaniel Oco De La Salle University Niu Xing Amazon Patrick Simianer Lilt Rico Sennrich University of Zurich Rodolfo Zevallos Universitat Pompeu Fabra Sangjee Dondrub Qinghai Normal University Santanu Pal Saarland University Sardana Ivanova University of Helsinki Shantipriya Parida Silo AI Shiran Dudy Northeastern University Surafel Melaku Lakew Amazon Tommi A Pirinen University of Tromsø Valentin Malykh Moscow Institute of Physics and Technology Xing Niu Amazon Xu Weijia University of Maryland

1 1

2nd CFP: 1st Workshop on Readability for Low Resourced Languages (RLRL 2023)
by El-Haj, Mo 04 Sep '23

04 Sep '23

2nd Call for Abstracts: 1st Workshop on Readability for Low Resourced Languages (RLRL 2023) Free registration is now open https://bit.ly/3pwUwlG - a few tickets are still available. Please join us for an exciting online workshop where experts in natural language processing will come together to discuss the latest research and innovative approaches to assessing the readability of low-resource languages. The workshop will take place as a free online event on September 5, 2023, and is being hosted jointly by Lancaster University, Sheffield Hallam University and King Saud University. We welcome researchers and practitioners to submit presentation abstract proposals of up to 500 words for talks related to the development of a Readability Framework for low-resource languages. The ultimate goal of the workshop is to discuss best practices and state-of-the-art AI-based approaches to create mathematical representations of expected readability levels at different school grade or cognitive ability levels. The workshop will also focus on utilising classifiers that are intuitive for humans to understand and adjust, enabling the analysis and improvement of the decision-making criteria. We welcome abstracts on work that is still in progress or that does not yet have conclusive results. We encourage authors to share their work at various stages of development to facilitate discussions and collaboration during the workshop. Important Dates: - Due date for workshop abstract submission: August 1, 2023 (extended) - Notification of abstract acceptance to authors: August 10, 2023 - Workshop date: September 5, 2023 (online event<https://bit.ly/3pwUwlG>) Keynote speakers: - Professor Laurence Anthony - Faculty of Science and Engineering at Waseda University, Japan. - Dr Violetta Cavalli-Sforza - School of Science and Engineering at Al Akhawayn University, Morocco. - Professor Hend Al-Khalifa - College of Computer and Information Sciences at King Saud University, KSA - Dr Abdel-Karim Al Tamimi- Computer Science and Software Engineering at Sheffield Hallam University, UK - Dr Mo El-Haj - School of Computing and Communications at Lancaster University, UK For list of speakers, talks' titles and abstract please visit the workshop's website: https://wp.lancs.ac.uk/acc/rlrl2023/ The main objectives of the workshop are three-fold: 1- Increase awareness of the importance of readability in low-resource languages and its impact on language learning and literacy. 2- Discuss the challenges of readability in low-resource languages, such as limited resources and lack of standardization, and brainstorm strategies for addressing these challenges. 3- Foster a community of practice among participants, allowing them to share their experiences and best practices for addressing readability issues in low-resource languages. Abstract submission: Abstract submission page is now open, please submit abstracts of no more than 500 words https://easychair.org/conferences/?conf=rlrl2023 Alternatively, you can contact the organisers directly with presentation ideas on topics related to readability or low resourced languages. Topics of interest include, but are not limited to: - Machine learning for text readability - Applications of readability assessment - Readability in low-resource languages - Comprehensibility measures - Mathematical representations of readability levels - Text simplification for low-resource languages - Readability and comprehensibility in language learning - The effects of text simplification on readability - Readability frameworks for indigenous languages - Updating readability representations We look forward to your contributions and to a productive and enlightening workshop on September 5, 2023. RLRL 2023 Organisers: - Dr Mo El-Haj (SCC/DSI/UCREL, Lancaster University) - Dr Abdel-Karim Al Tamimi (CSSE, Sheffield Hallam University) - Prof. Hend Al Khalifa (iWAN, King Saud University) https://wp.lancs.ac.uk/acc/rlrl2023/ Best wishes, Mahmoud --------------------- Dr Mo El-Haj Senior Lecturer in NLP Co-Director of UCREL NLP Group Strategic Lead of Arabic and Financial NLP Research Advisory Board of the Natural Language Processing Journal https://benjamins.com/catalog/nlp School of Computing and Communications, Lancaster University https://www.lancaster.ac.uk/staff/elhaj @DocElhaj<https://twitter.com/DocElhaj>

2 3

Job : Postdoc (12 months), NLP and gender stereotypes in the French media, Universite Grenoble Alpes, France
by François Portet 25 Aug '23

25 Aug '23

Call for postdoc applications in Natural Language Processing for the automatic detection of gender stereotypes in the French media (Grenoble Alps University, France) Starting date: flexible, November 30, 2023, at the latest Duration: full-time position for 12 months Salary: according to experience (up to 4142€/ month) Application Deadline: Open until filled Location: The position will be based in Grenoble, France. This is not a remote work. Keywords: natural language processing, gender stereotypes bias, corpus analysis, language models, transfer learning, deep learning *Context* The University of Grenoble Alps (UGA) has an open position for a highly motivated postdoc researcher to joint the multidisciplinary GenderedNews project. Natural Language Processing models trained on large amount of on-line content, have quickly opened new perspectives to process on-line large amount of on-line content for measuring gender bias in a daily basis (see our project https://gendered-news.imag.fr/ <https://gendered-news.imag.fr/> ). Regarding research on stereotypes, most recent works have studied Language Models (LM) from a stereotype perspective by providing specific corpora such as StereoSet (Nadeem et al., 2020) or CrowS-Pairs (Nangia et al. 2020). However, these studies are focusing on the quantifying of bias in the LM predictions rather than bias in the original data (Choenni et al., 2021). Furthermore, most of these studies ignore named entities (Deshpande et al., 2022) which account for an important part of the referents and speakers in news. In this project, we intend to build corpora, methods and NLP tools to qualify the differences between the language used to describe groups of people in French news. *Main Tasks* The successful postdoc will be responsible for day-to-day running of the research project, under the supervision of François Portet (Prof UGA at LIG) and Gilles Bastin (prof UGA at PACTE). Regular meetings will take place every two weeks. - Defining the dimensions of stereotypes to be investigated and the possible metrics that can be processed from a machine learning perspective. - Exploring, managing and curating news corpora in French for stereotypes investigation, with a view to making them widely available to the community to favor reproducible research and comparison. - Studying and developing new computational models to process large number of texts to reveal stereotype bias in news. Make use of pretrained models for the task. - Evaluate the methods on curated focused corpus and apply it to the unseen real longitudinal corpus and analyze the results with the team. - Preparing articles for submission to peer-reviewed conferences and journals. - Organizing progress meetings and liaising between members of the team. The hired person will interact with PhD students, interns and researchers being part of the GenderedNews project. According to his/her background his/her own interests and in accordance with the project's objective, the hired person will have the possibility to orient the research in different directions. *Scientific Environment* The recruited person will be hosted within the GETALP teams of the LIG laboratory (https://lig-getalp.imag.fr/ <https://lig-getalp.imag.fr/>), which offers a dynamic, international, and stimulating environment for conducting high-level multidisciplinary research. The person will have access to large datasets of French news, GPU servers, to support for missions as well as to the scientific activities of the labs. The team is housed in a modern building (IMAG) located in a 175-hectare landscaped campus that was ranked as the eighth most beautiful campus in Europe by the Times Higher Education magazine in 2018. The person will also closely work with Gilles Bastin (PACTE, a Sociology lab in Grenoble) and Ange Richard (PhD at LIG and PACTE). The project also includes an informal collaboration with "Prenons la une" (https://prenonslaune.fr/ <https://prenonslaune.fr/>) a journalists’ association which promotes a fair representation of women in the media. *Requirements* The candidate must have a PhD degree in Natural Language Processing or computer science or in the process of acquiring it. The successful candidate should have - Good knowledge of Natural Language Processing - Experience in corpus collection/formatting and manipulation. - Good programming skills in Python - Publication record in a close field of research - Willing to work in multidisciplinary and international teams - Good communication skills - Good mastering of French is required *Instructions for applying* Applications will be considered on the fly and must be addressed to François Portet (Francois.Portet(a)imag.fr <mailto:Francois.Portet@imag.fr>). It is therefore advisable to apply as soon as possible. The application file should contain - Curriculum vitae - References for potential letter(s) of recommendation - One-page summary of research background and interests for the position - Publications demonstrating expertise in the aforementioned areas - Pre-defense reports and defense minutes; or summary of the thesis with the date of defense for those currently in doctoral studies *References* Deshpande et al. (2022). StereoKG: Data-Driven Knowledge Graph Construction for Cultural Knowledge and Stereotypes. arXiv preprint arXiv:2205.14036. Choenni et al. (2021). Stepmothers are mean and academics are pretentious: What do pretrained language models learn about you? arXiv preprint arXiv:2109.10052. Nadeem et al. (2020) StereoSet: Measuring stereotypical bias in pretrained language models. ArXiv. Nangia et al. (2020) CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. In EMNLP2020. -- François PORTET Professeur - Univ Grenoble Alpes Laboratoire d'Informatique de Grenoble - Équipe GETALP Bâtiment IMAG - Office 333 700 avenue Centrale Domaine Universitaire - 38401 St Martin d'Hères FRANCE Phone: +33 (0)4 57 42 15 44 Email:francois.portet@imag.fr www:http://membres-liglab.imag.fr/portet/

1 1

[CfP] Second Workshop on Information Extraction from Scientific Publications (WIESP) at IJCNLP-AACL 2023
by Tirthankar Ghosal 25 Aug '23

25 Aug '23

*** Second Workshop on Information Extraction from Scientific Publications (WIESP) at IJCNLP-AACL 2023 *** *** Website: https://ui.adsabs.harvard.edu/WIESP/2023/ *** *** Twitter: https://twitter.com/wiesp_nlp *** Building on the success of the First WIESP at AACL-IJCNLP 2022, the Second Workshop on Information Extraction from Scientific Publications (WIESP) will provide a platform to researchers to foster discussion and research on information extraction, mining, generation, and knowledge discovery from scientific publications using Natural Language Processing and Machine Learning techniques. A lot of technological change happened in one year (since the 1st WIESP), especially with Generative Artificial Intelligence research. We are incorporating a few additional topics to stay abreast with the latest developments and research in the community. The 2nd iteration of WIESP would focus on the following topics (but not limited to): - Large Language Models (LLMs) for Science - Application of LLMs on information extraction, generation, mining and knowledge discovery from scientific publications - Probing LLMs for scientific fact checking and misinformation - Scientific document parsing - Scientific named-entity recognition - Scientific article summarization - Question-answering on scientific articles - Citation context/span extraction - Structured information extraction from full-text, tables, figures, bibliography - Novel datasets curated from scientific publications - Argument extraction and mining - Challenges in information extraction from scientific articles - Building knowledge graphs via mining scientific literature; querying scientific knowledge graphs - Novel tools for IE on scientific literature and interaction with users - Mathematical information extraction - Scientific concepts, facts extraction - Visualizing scientific knowledge - Bibliometric and Altmetric studies via information extraction from scientific articles and metadata In addition to research paper presentations, WIESP will also feature keynote talks, a panel discussion on “Large Language Models and Scientific Literature Mining'', and shared tasks. We will update the details on our website as and when they become available. We especially welcome participation from academic and research institutions, government and industry labs, publishers, and information service providers. Projects and organizations using NLP/ML techniques in their text mining and enrichment efforts are also welcome to participate. We strongly encourage participation of students, researchers, and science practitioners from diverse backgrounds, especially from underrepresented groups and communities, to be a part of WIESP events, and pro-actively make the workshop a diverse and inclusive one. ***Call for Papers*** We invite papers of the following categories: ***Long papers*** must describe substantial, original, completed, and unpublished work. Wherever appropriate, concrete evaluation and analysis should be included. Papers must not exceed eight (8) pages of content, plus unlimited pages of references. The final versions of long papers will be given one additional page of content (up to 9 pages) so that reviewers' comments can be taken into account. ***Short papers*** must describe original and unpublished work. Please note that a short paper is not a shortened long paper. Instead, short papers should have a point that can be made in a few pages, such as a small, focused contribution, a negative result, or an interesting application nugget. Short papers must not exceed four (4) pages, plus unlimited pages of references. The final versions of short papers will be given one additional page of content (up to 5 pages) so that reviewers' comments can be taken into account. In addition to papers, WIESP will also host shared tasks. More details on the WIESP shared tasks will be available on our website shortly. Also, we will publish separate CfPs on the shared tasks. Shared task authors will be invited to write their system descriptions and those will be subjected to peer review. ***Shared Task: Function of Citation in Astrophysics Literature (FOCAL)*** The citation graph is an essential tool for helping researchers find relevant literature. To further empower discovery, we aim to label the edges of the graph with the function of the citation: e.g. is the cited work necessary background knowledge, or is it used as a comparison, to the citing work? To start this process, we propose a shared task of automatically labeling citations with a function based on the textual context of the citation. A sample dataset and more instructions can be found at: https://ui.adsabs.harvard.edu/WIESP/2023/SharedTasks All accepted papers would be published in the WIESP proceedings as part of IJCNLP-AACL 2023 and indexed in the ACL Anthology. ***Important Dates*** - Paper Submission Deadline: August 25, 2023 - Notification of workshop paper/abstract acceptance: October 2, 2023 - Camera-ready Submission Deadline: October 15, 2023 - Workshop: November 2-4, 2023 (online, final date TBD) ***All submission deadlines are 11.59 pm UTC -12h ("Anywhere on Earth")*** ***Submission Website and Format*** Submission Link: TBD (please have an eye on the website) Submission will be via softconf. Submissions should follow the ACLPUB formatting guidelines (https://acl-org.github.io/ACLPUB/formatting.html) and template files (https://github.com/acl-org/acl-style-files/tree/master). Submissions (Long and Short Papers) will be subject to a double-blind peer-review process. We follow the same policies as IJCNLP-AACL 2023 regarding preprints and double submissions. The anonymity period for WIESP 2023 is from July 25 to August 25. ***Organizers*** - Tirthankar Ghosal, National Center for Computational Sciences, Oak Ridge National Laboratory, USA - Felix Grezes, Center for Astrophysics | Harvard & Smithsonian, USA - Thomas Allen, Center for Astrophysics | Harvard & Smithsonian, USA - Kelly Lockhart, Center for Astrophysics | Harvard & Smithsonian, USA - Alberto Accomazzi, Center for Astrophysics | Harvard & Smithsonian, USA -- +++++++++++++++++++++++++++++++++++ Tirthankar Ghosal https://member.acm.org/~tghosal ++++++++++++++++++++++++++++++++++++

1 1

FinCausal 2023: Training set released (English & Spanish)
by FinCausal 2023 11 Aug '23

11 Aug '23

*FinCausal 2023: Financial Document Causality Detection* We are glad to announce that the Training Dataset for both English and Spanish is released and ready on Codalab in this link: https://codalab.lisn.upsaclay.fr/competitions/14596 Please register on CodaLab and get to the FInCausal.2023 Competition. Under Participate, you will find the Training Datasets together with a Starting Kit to guide you through the Task. ###### *Task Description and Important Links *####### *FinCausal-2023 Shared Task: “Financial Document Causality Detection” *is organised within the *5th Financial Narrative Processing Workshop (FNP 2023)* taking place in the 2023 IEEE International Conference on Big Data (IEEE BigData 2023) <http://bigdataieee.org/BigData2023/>, Sorrento, Italy, 15-18 December 2023. It is a *one-day event*. Workshop URL: https://wp.lancs.ac.uk#####cfie/fincausal2023/ <https://wp.lancs.ac.uk/cfie/fincausal2023/> ###### *Additional Information *####### *Shared Task Description:* Financial analysis needs factual data and an explanation of the variability of these data. Data state facts but need more knowledge regarding how these facts materialised. Furthermore, understanding causality is crucial in studying decision-making processes. The *Financial Document Causality Detection Task* (FinCausal) aims at identifying elements of cause and effect in causal sentences extracted from financial documents. Its goal is to evaluate which events or chain of events can cause a financial object to be modified or an event to occur, regarding a given context. In the financial landscape, identifying cause and effect from external documents and sources is crucial to explain why a transformation occurs. Two subtasks are organised this year. *English FinCausal subtask *and* Spanish FinCausal subtask*. This is the first year where we introduce a subtask in Spanish. *Objective*: For both tasks, participants are asked to identify, given a causal sentence, which elements of the sentence relate to the cause, and which relate to the effect. Participants can use any method they see fit (regex, corpus linguistics, entity relationship models, deep learning methods) to identify the causes and effects. *English FinCausal subtask* - *Data Description: *The dataset has been sourced from various 2019 financial news articles provided by Qwam, along with additional SEC data from the Edgar Database. Additionally, we have augmented the dataset from FinCausal 2022, adding 500 new segments. Participants will be provided with a sample of text blocks extracted from financial news and already labelled. - *Scope: *The* English FinCausal subtask* focuses on detecting causes and effects when the effects are quantified. The aim is to identify, in a causal sentence or text block, the causal elements and the consequential ones. Only one causal element and one effect are expected in each segment. - *Length of Data fragments: *The* English FinCausal subtask* segments are made up of up to three sentences. - *Data format: *CSV files. Datasets for both the English and the Spanish subtasks will be presented in the same format. This shared task focuses on determining causality associated with a quantified fact. An event is defined as the arising or emergence of a new object or context regarding a previous situation. So, the task will emphasise the detection of causality associated with the transformation of financial objects embedded in quantified facts. *Spanish FinCausal subtask* - *Data Description: *The dataset has been sourced from a corpus of Spanish financial annual reports from 2014 to 2018. Participants will be provided with a sample of text blocks extracted from financial news, labelled through inter-annotator agreement. - *Scope: *The *Spanish FinCausal subtask* aims to detect all types of causes and effects, not necessarily limited to quantified effects. The aim is to identify, in a paragraph, the causal elements and the consequential ones. Only one causal element and one effect are expected in each paragraph. - *Length of Data fragments: *The *Spanish FinCausal subtask* involves complete paragraphs. - *Data format: *CSV files. Datasets for both the English and the Spanish subtasks will be presented in the same format. This shared task focuses on determining causality associated with both events or quantified facts. For this task, a cause can be the justification for a statement or the reason that explains a result. This task is also a relation detection task. Best regards, FinCausal 2023 Team

1 1

Any literature about tensors-based corpora NLP research with actual examples (and homework ;-)) you would suggest? ...
by Albretch Mueller 31 Jul '23

31 Jul '23

At times google showers you with senseless links. lbrtchx

8 40

2025

2024

2023

2022

Corpora July 2023