July 2022 - Corpora - ELRA lists

FigLang 2022 (co-located with EMNLP 2022): Shared Task on Euphemism Detection
by Anna Feldman 08 Jul '22

08 Jul '22

https://sites.google.com/view/figlang2022/shared-tasks?authuser=0 Euphemism Detection Shared Task Euphemisms are mild or indirect expressions used in place of harsher or more offensive ones. Euphemisms are often used to mask profanity or refer to taboo topics such as death, disability, sex, religion or personal relationships in a polite way. Euphemisms are often ambiguous: their literal and non-literal interpretation is context-dependent: Asked to choose *between jobs* and the environment, a majority -- at least in our warped, first-past-the-post system -- will pick jobs. [non-euphemistic] vs. This summer, the budding talent agent was *between jobs* and free to babysit pretty much any time. [euphemistic] The state of the art language models perform well on many major NLP benchmarks; however, it is unclear how such models perform on euphemisms. Thus, we propose a euphemism detection task: given an input sentence, identify whether the sentence contains a euphemism. For more information about the shared task and to participate visit https://codalab.lisn.upsaclay.fr/competitions/5726 <https://www.google.com/url?q=https%3A%2F%2Fnam10.safelinks.protection.outlo…> . *Important dates:* - July 5, 2022: CodaLab competition is open; training data can be downloaded - Aug 5, 2022: Test data can be downloaded and results submitted; performance will be tracked on CodaLab dashboard - Aug 20, 2022: Last day for submitting predictions on test data - Sept 7, 2022: Papers describing the systems are due - Oct 9, 2022: Notification of acceptance - TBD, 2022: Camera-ready papers due - December 7 or 8, 2022: Workshop -- ********************************************** Anna Feldman, Ph.D. Professor of Linguistics and Computer Science Graduate Program Coordinator & Chair of Linguistics Montclair State University http://www.purl.org/net/fa <http://www.purl.org/NET/fa>

1 0

FigLang2022 Shared Task on “Understanding Figurative Language” (competition opens July 10, 2022)
by Smaranda Muresan 08 Jul '22

08 Jul '22

Shared Task on Understanding Figurative Language at FigLang2022 We are happy to announce a new shared task on Understanding Figurative Language as part of the Figurative Language Workshop (FigLang 2022) at EMNLP 2022. In recent years, there have been several benchmarks dedicated to figurative language understanding, which generally frame "understanding" as a recognizing textual entailment task -- deciding whether one sentence (premise) entails/contradicts another (hypothesis) (Chakrabarty et al 2021, Stowe et al 2022). We introduce a new shared task for figurative language understanding around this textual entailment paradigm, where the hypothesis is a sentence containing the figurative language expression (e.g., metaphor, sarcasm, idiom, simile) and the premise is a literal sentence containing the literal meaning. There are two important aspects of this task: 1) the task requires not only to generate the label (entail/contradict) but also to generate a plausible explanation for the prediction; 2) the entail/contradict label and the exploration are related to the meaning of the figurative language expression. For more information about the shared task, including the link to the datasets, evaluation metrics and scripts important dates please visit the Shared task website (https://figlang2022sharedtask.github.io/). Participants can use the following CodaLab ( https://codalab.lisn.upsaclay.fr/competitions/5908) link to participate in the task as well as submit the predictions. Important dates: · July 10, 2022: CodaLab competition is open; training data can be downloaded · Aug 15, 2022: Test data (available only to registered participants) can be downloaded and results submitted; performance will be tracked on CodaLab dashboard · Aug 20, 2022: Last day for submitting predictions on test data · Sept 7, 2022: Papers describing the systems are due · Oct 9, 2022: Notification of acceptance · TBD, 2022: Camera-ready papers due · December 8, 2022: Workshop at EMNLP 2022 Organizing Team Tuhin Chakrabarty, Columbia University; tuhin.chakr(a)cs.columbia.edu Arkadiy Saakyan, Columbia University; as5423(a)columbia.edu Debanjan Ghosh, Educational Testing Service; dghosh(a)ets.org Smaranda Muresan, Data Science Institute, Columbia University;smara(a)columbia.edu

1 0

[CFP] Dialogue & Discourse journal: Fall 2022 Regular Issue
by Ryuichiro Higashinaka 07 Jul '22

07 Jul '22

Apologies for cross-posting It is our pleasure to announce the first call for submissions for the next issue of the journal Dialogue and Discourse. Submissions are invited on all topics in the formal, computational, or psycholinguistic study of dialogue and discourse. Submissions received by August 1, 2022 will be considered for the next regular issue. Later submissions will be slated for the next available issue. http://www.dialogue-and-discourse.org/cfps-current.shtml Dialogue and Discourse (D&D http://www.dialogue-and-discourse.org/) is the first peer-reviewed free open access journal dedicated exclusively to work that deals with language "beyond the sentence". The journal adopts an interdisciplinary perspective, accepting work from Linguistics, Computer Science, Psychology, Sociology, Philosophy, and other associated fields with an interest in formally, technically, empirically or experimentally rigorous approaches. Descriptive papers should make a substantial theoretical contribution to be considered. We are committed to ensuring the highest editorial standards and rigorous peer-review of all submissions, while granting open access to all interested readers. D&D has published regular issues every year since 2010, and occasionally special issues on common topics. As of June 2022, D&D has published 99 papers, and the journal's h-index is 26. D&D is endorsed by ACL SIGdial, ACL SemDial, and AMLaP. D&D is indexed by Scopus and the European Reference Index for the Humanities and Social Sciences. Submissions are made via the online submission system at http://www.dialogue-and-discourse.org/submission.shtml. Authors are required to indicate if a submission is an extended version of one or more previously published conference papers (to which we would expect substantial additions); simultaneous submission to another venue is prohibited. Submissions will undergo rigorous peer-review. Once accepted and finalized, papers will appear online immediately, as part of the current issue. Selected papers will furthermore be offered the opportunity to present a poster at the following SIGDIAL Conference. Dialogue and Discourse Editors Issue Editor: Ryuichiro Higashinaka (Volume 13, Issue 2) Junyi Jessy Li (Volume 13, Issue 1) Editor In Chief: Barbara Di Eugenio, University of Illinois at Chicago, United States Associate Editors: Vera Demberg, Saarland University, Germany Kallirroi Georgila, University of Southern California, United States Jonathan Ginzburg, Université Paris-Diderot (Paris 7), France Pat Healey, Queen Mary University London, United Kingdom Ryuichiro Higashinaka, Nagoya University, Japan Junyi Jessy Li, University of Texas at Austin, United States Massimo Poesio, Queen Mary University London, United Kingdom Manfred Stede, University of Potsdam, Germany David R. Traum, University of Southern California, United States Amir Zeldes, Georgetown University, United States Full editorial board at: http://www.dialogue-and-discourse.org/editors.shtml

1 0

Announcement: Release of BabelNet 5.1
by Roberto Navigli 07 Jul '22

07 Jul '22

We are proud to announce the release of a new version of BabelNet <https://babelnet.org/> and its APIs, *both Java and the brand-new Python version*, developed jointly by the Sapienza NLP Group <http://nlp.uniroma1.it> of the *Sapienza University of Rome* under the supervision of prof. Roberto Navigli <https://www.diag.uniroma1.it/navigli/> and Babelscape <http://babelscape.com/>, *a successful deep-tech multilingual NLP Company* providing innovative solutions for multilingual NLP. BabelNet -- winner of the *prominent paper award 2017* from the Artificial Intelligence Journal and the META prize 2015, and covered in media such as The Guardian <https://www.theguardian.com/news/2018/feb/23/oxford-english-dictionary-can-…> and Time magazine <http://wwwusers.di.uniroma1.it/~navigli/img/Redefining_the_modern_dictionar…> -- is today’s *most far-reaching multilingual resource* which, according to need, can be used as an *encyclopedic dictionary*, or a *semantic network* or a huge *knowledge base/ontology*. It has been used by more than *1000 universities and research institutions*, enabling multilinguality in several fields of AI and NLP, such as semantic search, Word Sense Disambiguation, Semantic Role Labeling and image tagging. BabelNet was created by means of the seamless integration and interlinking of the largest multilingual Web encyclopedia - i.e., Wikipedia - with the most popular computational lexicon of English - i.e., WordNet, and other lexical resources such as Wiktionary, OmegaWiki, Wikidata, dozens of wordnets, Wikiquote, GeoNames, and ImageNet. BabelNet provides *multilingual synsets*, i.e., concepts and named entities lexicalized in many languages, and connected with large amounts of semantic relations. *Version 5.1* comes with the following features: - *500 languages* and *22 million synsets* covered; - *53 resources *linked and integrated; - *Wikipedia* and *Wikidata* updated thanks to *BabelNet live*; - *Open English WordNet* has been updated to version 2021; - Added *Q-codes* identifiers (e.g. https://www.hetop.eu/hetop/3CGP/?la=en&rr=CGP_QC_QD8); - Added *string tags *from *Wikipedia labels*; - *French wordnets cleaned up* by removing most potentially incorrect translations; - *Italian wordnet definitions *cleaned up; - *General data cleanup* (glosses, senses, Named Entity vs. concept labels); - *Lemma casing corrected in 24 languages* (English, Italian, Spanish, German, French, Dutch, Polish, Portuguese, Russian, Bulgarian, Czech, Danish, Greek, Estonian, Finnish, Croatian, Hungarian, Lithuanian, Latvian, Maltese, Romanian, Slovak, Slovenian, Swedish). More statistics are available at: babelnet.org/statistics. Kind regards, The BabelNet team -- ============================================== Roberto Navigli* - Professor* Department of Computer, Control and Management Engineering Sapienza University of Rome Via Ariosto, 25 00185 Roma Italy Phone: +39 06 77274109 Home Page: https://www.diag.uniroma1.it/navigli/ Sapienza NLP Group: http://nlp.uniroma1.it Co-founder of Babelscape <https://babelscape.com> ==============================================

1 0

Online workshop for linguists: "Learning to use the Terminal"
by Mario Barcala 07 Jul '22

07 Jul '22

Dear corpora subscribers: At NLPgo (https://www.nlpgo.com) we are organizing an online workshop for linguists titled "Learning to use the Terminal", which may be of interest to you. > What is the Terminal? > It is the usual black or white background application available in any operating system which allows the users to run commands. It is also called the command interpreter. > What it is useful for? > It is useful for many different things, among them making different kinds of transformations on files and, therefore, it allows us to make some interesting corpus calculations, which would otherwise be very difficult to make. > Workshop content > 1. Preparation: installing the necessary tools and setup the working environment. > 2. Basic concepts: file and folder structure, file types, character encodings (types, differences and compatibility problems). > 3. Basic commands: show files available in a folder, change current folder, show file contents, copy and move files, column extraction and reordering, result sorting, etc. > 4. Advanced commands and transformations: standard input/output/error, command chaining, finding file names, text finding and replacing, applying commands to several files at a time. > 5. Regular expressions: advaced text search and replacement techniques > 6. Corpus specific tasks: Working with data from spreadsheets, texts (orthographic words) and Part-Of-Speech tagged texts (grammatical elements). More details about the workshop are available on our web page and, specifically, on the workshop specific one: https://www.nlpgo.com/teaching/terminal As you can see, it will be held from 18th to 22th of July (in Spanish language, Spain timezone). If you want to register in the workshop you can do it from the workshop web page I have included before. In addition, if you are interested in receiving information about this kind of workshop we organize and other useful information, you can subscribe to our newsletter here: hhttps://www.palabrasbinarias.com/subscribe. If you have any question don't hesitate to write us through the contact form available at the workshop web page. Thank you for you attention. Best regards, Mario Barcala -- Mario Barcala CEO at NLPgo http://www.nlpgo.com GPG key id: F1C15EB7

1 0

PhD Position at University of Klagenfurt
by Wiegand, Michael 07 Jul '22

07 Jul '22

The University of Klagenfurt is pleased to announce the following open position at the Digital Age Research Center (D!ARC), employment to commence as soon as possible: *PreDoc Scientist (in German: Universitätsassistent*in) (Doctoral Candidate) (all genders welcome)* The Digital Age Research Center (D!ARC), founded in 2019 as an inter-faculty university centre, aims to shed light not only on the technological but also on the economic, legal, social, individual, behavioural and cultural aspects of the digital revolution. Over the next few years, it is set to develop a corresponding profile in research with a European claim to excellence as well as modules for the range of courses offered at the University of Klagenfurt. Level of employment: 75 % (30 hours/week) Minimum salary: € 32.116,-- per annum (gross); classification according to collective agreement: B1 Limited to: 4 years Application deadline: July 27, 2022 Reference code: 191/22 Tasks and responsibilities: • Contributing to research in a project in the area of computational linguistics and digital humanities • Participation in research activities of D!ARC, which offers unique opportunities for interdisciplinary collaboration • Teaching courses on computational linguistic topics Prerequisites for the appointment: • A Master’s degree completed at a domestic or foreign higher education institution in the field of computational linguistics, linguistics, computer science or alike, graded with success and corresponding knowledge in the field • Good programming skills (preferably Python) • Proven expertise in: • Natural language processing • Digital humanities • Linguistics • Machine learning (particularly deep learning) • Fluent in English and German, both spoken and written Additional desired qualifications: • Experience with web crawling and processing large amounts of textual data • Instructing and supervising linguistic annotation • Publications at scientific conferences and in journals in the field relating to the position • Profound knowledge of publicly available tools and resources for natural language processing • Experience in teaching at a university • Experience in working in interdisciplinary research projects • Social and communication skills, ability to work independently Our offer: The employment contract is concluded with a starting salary of € 2.294,-- gross per month (14 times a year; previous experience deemed relevant to the job can be recognised in accordance with the https://jobs.aau.at/en/faq/). The University of Klagenfurt also offers: • Personal and professional advanced training courses, management and career coaching • Numerous attractive additional benefits, see also https://jobs.aau.at/en/the-university-as-employer/ • Diversity- and family-friendly university culture • The opportunity to live and work in the attractive Alps-Adriatic region with a wide range of leisure activities in the spheres of culture, nature and sports The application: If you are interested in this position, please apply in German or English providing the usual documents: • Letter of application • Curriculum vitae • Proof of all completed higher education programmes (certificates, supplements, if applicable) • Other documentary evidence that may be relevant to this announcement (see prerequisites and desired qualifications) The position is solely intended for the completion of a Doctorate. Applicants with a Doctorate or Ph.D. already completed in a related discipline are therefore ineligible for this position. To apply, please select the position with the reference code 191/22 in the category “Scientific Staff” using the link “Apply for this position” in the job portal at http://jobs.aau.at/en/ Candidates must furnish proof that they meet the required qualifications by July 27, 2022 at the latest. For further information on this specific vacancy, please contact Michael Wiegand (mailto:michael.wiegand@aau.at). General information about the university as an employer can be found at http://www.aau.at/jobs/en/information. At the University of Klagenfurt, recruitment and staff matters are accompanied not only by the authority responsible for the recruitment procedure but also by the Equal Opportunities Working Group (https://www.aau.at/en/university/organisation/representations-commissioners…) and, if necessary, by the Representative for Disabled Persons (https://www.aau.at/en/university/organisation/administration-and-management…).

1 0

CoCo4MT Final CFP - The First Workshop on Corpus Generation and Corpus Augmentation for Machine Translation
by Shabnam Tafreshi 07 Jul '22

07 Jul '22

!!!!!!!!!! The deadline for submission has been extended to July 20th !!!!!!!!!! Upload Submissions Now https://cmt3.research.microsoft.com/AMTA2022 The First Workshop on Corpus Generation and Corpus Augmentation for Machine Translation (CoCo4MT) https://sites.google.com/view/coco4mt @ AMTA – 2022 This 15th biennial conference of the Association for Machine Translation in the Americas 12-16 September 2022, Orlando, Florida, USA INVITED TALKS Jörg Tiedemann University of Helsinki Julia Kreutzer Google Research Maria Nadejde Amazon SCOPE It is a well-known fact that machine translation systems, especially those that use deep learning, require massive amounts of data. Several resources for languages are not available in their human-created format. Some of the types of resources available are monolingual, multilingual, translation memories, and lexicons. Those types of resources are generally created for formal purposes such as parliamentary collections when parallel and more informal situations when monolingual. The quality and abundance of resources including corpora used for formal reasons is generally higher than those used for informal purposes. Additionally, corpora for low-resource languages, languages with less digital resources available, tends to be less abundant and of lower quality. CoCo4MT sets out to be the first workshop centered around research that focuses on corpora creation, cleansing, and augmentation techniques specifically for machine translation. We accept work that covers any spoken language (including high-resource languages) but we are specifically interested in those submissions that are on languages with limited existing resources (low-resource languages) where resources are not highly available. The goal of this workshop is to begin to close the gap between corpora available for low-resource translation systems and promote high-quality data for online systems that can be used by native speakers of low-resource languages is of particular interest. Therefore, It will be beneficial if the techniques presented in research papers include their impact on the quality of MT output and how they can be used in the real world. CoCo4MT aims to encourage research on new and undiscovered techniques. We hope that submissions will provide high-quality corpora that is available publicly for download and can be used to increase machine translation performance thus encouraging new dataset creation for multiple languages that will, in turn, provide a general workshop to consult for corpora needs in the future. The workshop’s success will be measured by the following key performance indicators: - Promotes the ongoing increase in quality of machine translation systems when measured by standard measurements, - Provides a meeting place for collaboration from several research areas to increase the availability of commonly used corpora and new corpora, - Drives innovation to address the need for higher quality and abundance of low-resource language data. TOPICS We are highly interested in original research papers on the topics below; however, we welcome all novel ideas that cover research on corpora techniques. - Difficulties with using existing corpora (e.g., political considerations or domain limitations) and their effects on final MT systems, - Strategies for collecting new MT datasets (e.g., via crowdsourcing), - Data augmentation techniques, - Data cleansing and denoising techniques, - Quality control strategies for MT data, - Exploration of datasets for pretraining or auxiliary tasks for training MT systems. SUBMISSION INFORMATION There is one type of submission in the workshop: Research, review and position paper. The length of each paper should be at least four (4) and not exceed ten (10) pages, plus unlimited pages for references. Submissions should be formatted according to the official AMTA 2022 style templates (PDF, LaTeX, Word). Accepted papers will be published on-line in the AMTA 2022 proceedings which includes the ACL Anthology and will be presented at the conference either orally or as a poster. Submissions must be anonymized and should be done using the official conference management system (https://cmt3.research.microsoft.com/AMTA2022). Scientific papers that have been or will be submitted to other venues must be declared as such, and must be withdrawn from the other venues if accepted and published at CoCo4MT. The review will be double-blind. We would like to encourage authors to cite papers written in ANY language that are related to the topics, as long as both original bibliographic items and their corresponding English translations are provided. Registration will be handled by the main conference. (To be announced) IMPORTANT DATES June 1, 2022 – Call for papers released June 15, 2022 – Second call for papers June 29, 2022 – Third and final call for papers July 20, 2022 – Paper submissions due (updated extension!) July 27, 2022 – Notification of acceptance August 7, 2022 – Camera-ready due August 31, 2022 – Video recordings due September 16, 2022 - CoCo4MT workshop CONTACT CoCo4MT Workshop Organizers coco4mt2022(a)googlegroups.com ORGANIZING COMMITTEE (listed alphabetically) Constantine Lignos Brandeis University John E. Ortega New York University and University of Santiago de Compostela (CITIUS) Katharina Kann University of Colorado Boulder Maja Popopvić ADAPT Centre at Dublin City University Marine Carpuat University of Maryland Shabnam Tafreshi University of Maryland William Chen Carnegie Mellon University PROGRAM COMMITTEE (listed alphabetically tentative) Abteen Ebrahimi University of Colorado Boulder Adelani David Saarland University Ananya Ganesh University of Colorado Boulder Alberto Poncelas ADAPT Centre at Dublin City University Amirhossein Tebbifakhr University of Trento Anna Currey Amazon Arturo Oncevay University of Edinburgh Atul Kr. Ojha National University of Ireland Galway Bharathi Raja Chakravarthi National University of Ireland Galway Beatrice Savoldi University of Trento Bogdan Babych Heidelberg University Briakou Eleftheria University of Maryland Dossou Bonaventure Mila Quebec AI Institute Duygu Ataman New York University Eleni Metheniti Université Toulosse - Paul Sabatier Francis Tyers Indiana University Jasper Kyle Catapang University of Birmingham John E. Ortega New York University and USC - CITIUS José Ramom Pichel Campos Universidade de Santiago de Compostela - CITIUS Kalika Bali Microsoft Koel Dutta Chowdhury Saarland University Liangyou Li Huawei Manuel Mager University of Stuttgart Maria Art Antonette Clariño University of the Philippines Los Baños Mathias Müller University of Zurich Nathaniel Oco De La Salle University Niu Xing Amazon Pablo Gamallo Universidade de Santiago de Compostela - CITIUS Rodolfo Joel Zevallos Salazar Universitat Pompeu Fabra Rico Sennrich University of Zurich Sangjee Dondrub Qinghai Normal University Santanu Pal Saarland University Sardana Ivanova University of Helsinki Shantipriya Parida Silo AI Surafel Melaku Lakew Amazon Tommi A Pirinen University of Tromsø Valentin Malykh Moscow Institute of Physics and Technology -- *Shabnam Tafreshi, PhD* *Assistant Research Scientist* *Computational Linguistics, NLP* *UMD: ARLIS @ College Park* *"All the problems of the world could be settled easily, if people only willing to think."* *-Thomas J. Watson*

1 0

Fully Funded PhD Position (Belgium)
by ashwin.ittoo＠uliege.be 06 Jul '22

06 Jul '22

We are offering a fully funded, industry-sponsored PhD scholarship on the topic of Language Models. The selected candidate will have the opportunity to conduct research at the junction of industry and academia. She/he will also be part of an exciting team of data scientists & PhD researchers from the corporate and from the academic world. For more details, please see https://www.akadeus.com/announcement,a7165.html https://www.digitallab.be/en/

1 0

[DEADLINE EXTENSION] [CFP] VarDial 2022 - Ninth Workshop on NLP for Similar Languages, Varieties and Dialects
by Scherrer, Yves 06 Jul '22

06 Jul '22

Co-located with COLING 2022, at VarDial we anticipate discussion on computational methods and on language resources for closely related languages, language varieties and dialects. We plan to organize VarDial 2022 as a hybrid workshop with options for both on-site and remote participation. We accept paper submissions until July 22, 2022 (details below). https://sites.google.com/view/vardial-2022 We welcome papers dealing with one or more of the following topics: - Language resources and tools for similar languages, varieties and dialects; - Adaptation of tools (taggers, parsers) for similar languages, varieties and dialects; - Evaluation of language resources and tools when applied to language varieties; - Reusability of language resources in NLP applications (e.g., for machine translation, POS tagging, syntactic parsing, etc.); - Corpus-driven studies in dialectology and language variation; - Computational approaches to the study of mutual intelligibility between dialects and similar languages; - Automatic identification of lexical variation; - Automatic classification of language varieties; - Text similarity and adaptation between language varieties; - Linguistic issues in the adaptation of language resources and tools (e.g., semantic discrepancies, lexical gaps, false friends); - Machine translation between closely related languages, language varieties and dialects. In addition to the topics listed above, we also welcome papers dealing with diachronic language variation (e.g. phylogenetic methods, historical dialects). Instructions for Authors Submissions should be formatted according to the COLING template and submitted in PDF format. The review process will be double-blind. More information on the website. Important Dates Submission deadline: EXTENDED TO JULY 22, 2022 (anywhere on earth) Notification of acceptance: August 22, 2022 Camera-ready papers due: September 5, 2022 VarDial Workshop at COLING 2022: October 16, 2022 Organizers Yves Scherrer - University of Helsinki (Finland) Tommi Jauhiainen - University of Helsinki (Finland) Nikola Ljubešić - Jožef Stefan Institute (Slovenia) and University of Zagreb (Croatia) Preslav Nakov - Qatar Computing Research Institute, HBKU (Qatar) Jörg Tiedemann - University of Helsinki (Finland) Marcos Zampieri - Rochester Institute of Technology (USA) Contact: yves.scherrer(a)helsinki.fi<mailto:yves.scherrer@helsinki.fi>

1 0

Postdoctoral Researcher position in Artificial Intelligence and Natural Language Processing (SCAI-Sorbonne/BnF research program)
by Laure Soulier 06 Jul '22

06 Jul '22

--- apologies for cross-postings --- Dear colleagues, We have an open position for a postdoctoral researcher on natural language processing / information retrieval / machine learning (SCAI/BnF research program) Starting period: autumn 2022 Duration: 12-month postdoctoral contract, renewable) Location: Sorbonne university (ISIR lab in the MLIA team) / DataLab of the BNF Supervision: Laure Soulier, MCF in computer science at Sorbonne University, MLIA team, ISIR. Emmanuelle Bermès, Scientific and Technical Assistant to the Director of Services and Networks at BnF. Jean-Philippe Moreux, Scientific expert of Gallica at the BnF. More info: https://scai.sorbonne-universite.fr/public/news/view/27d72d260c950c8d66c6/1 _*Context*_ Gallica, the digital library of the BnF, contains nearly 10 million digitized documents that are freely accessible online (18.5 million visits per year). However, most users do not know that Gallica contains not only printed documents, but also photographs, sound recordings, videos, and 3D objects. In satisfaction surveys, only a minority of users consider the search engine's answers to be relevant and a majority would like to be better guided in their searches. A recommendation system should be able to help users find their way through the mass of collections and improve the visibility of the least known. In this project, BnF is committed to adopting a resolutely ethical approach. The exploitation of user logs must respect their privacy and guarantee both the relevance and transparency of the algorithms, avoiding the risk of filter bubbles. The interface design is also at the heart of the approach: a trustworthy system relies on a good user experience and on the diversity and relevance of the proposed recommendations. Three lines of thought emerge: 1) based on the available data, including both user logs and collection descriptions, how to develop predictive algorithms? 2) how to integrate diversity in the recommendation algorithm while leaving the choice to the user to moderate his serendipity threshold? 3) how to build user trust in algorithm design and audit? _*Main missions*_ This project consists in working on information access in the Gallica library, from the point of view of machine and deep learning techniques. The research axes concern (1) the analysis and indexing of textual documents as well as (2) the analysis of user traces and (3) recommendation systems. We are particularly interested in multimodal techniques that allow contextualizing a document or a query based on user interactions. The successful candidate will be responsible for: ● Implementing models to learn the semantics of textual data for the purpose of indexing them. ● Developing algorithms based on representation learning methodologies to effectively blend text and user traces. ● Reporting and presenting development work in a clear and effective manner, both for discussion with BnF experts and writing machine learning publications. The printed book collection will be the primary focus of the program described above, but an extension to other collections with textual descriptors (in particular iconographic collections) may be considered. -- ------------- Laure Soulier Maître de conférences Equipe MLIA - Laboratoire ISIR - Sorbonne Université Tour 26, Couloir 26-00, Bureau 515 (+33) 1 44 27 74 91 https://pages.isir.upmc.fr/soulier/

1 0

2026

2025

2024

2023

2022

Corpora July 2022