dear colleagues,
in the context of an upcoming research project (TRAVOLTA – Tracing Attitudes and Variation in Online Luxembourgish Text Archives) we currently have two open positions for doctoral candidates with a background in nlp, computational linguistics, sociolinguistics or variationist linguistics. the candidates will be members of the department of humanities at the university of luxembourg.
the project will be the first to monitor the development of luxembourgish into a fully-fledged written language using data from the rtl.lu news portal. we will develop state-of-the-art tools and pipelines for nlp purposes and use them to trace variation and social stances in written text.
additional information about the project can be found here<https://purschke.info/en/travolta/>.
it would be great if you could share the positions in your respective networks. please also feel free to reach out should you have any questions regarding the positions or would be interested in applying.
both positions will be filled for 3 + 1 years max. with a current yearly salary of roughly 38k €. given the multilingual setup of the university and the special requirements of the project, candidates should have proficiency in english and german (intermediate), luxembourgish will be an asset, of course.
thx & cheers
christoph
Position 1)
Doctoral candidate in in Computational Linguistics / Variationist Linguistics
* Full profile: http://emea3.mrted.ly/39skj
* Research in the field of Computational/Variationist Linguistics with a focus on variation in written Luxembourgish
* Background in
*
* Either: Computational Linguistics, Computer Science, or a related field with prior experience in working with language data
* Or: Linguistics, Variationist Linguistics, or a related field with prior experience in data science applications (NLP, Python, r)
Position 2)
Doctoral candiate in Computational Linguistics / Sociolinguistics
* Full profile: http://emea3.mrted.ly/39snq
* Research in the field of Computational Linguistics/Sociolinguistics with a focus on attitudes & stances in written luxembourgish
* Background in
* Either: Computational Linguistics, Computer Science, or a related field with prior experience in working with language data
* Or: Sociolinguistics, Linguistics, or a related field with prior experience in data science applications (NLP, Python, r)
—
Christoph Purschke
Associate Professor in Computational Linguistics
Head of the Culture & Computation Lab
Faculty of Humanities, Education and Social Sciences
UNIVERSITÉ DU LUXEMBOURG
BELVAL CAMPUS
11, Porte des Sciences
L-4366 Esch-sur-Alzette
T +352 46 66 44 9226
M +352 661 266 323
W purschke.info<http://www.purschke.info> | cucolab.uni.lu<http://cucolab.uni.lu>
christoph.purschke(a)uni.lu<mailto:christoph.purschke@uni.lu> | @questoph<https://twitter.com/questoph/>
Languages: DEU, ENG, FRA, ITA, LTZ, NDS, PYT
Book a meeting: 30 min<https://fantastical.app/questoph/short-meeting> | 60 min<https://fantastical.app/questoph/long-meeting> | office hours<https://fantastical.app/questoph/office-hours>
Third Call for submission
TAL Journal: regular issue
http://tal-64-1.sciencesconf.org/ <http://tal-64-1.sciencesconf.org/>
2023 Volume 64-1
Deadline for submission: 12/15/2022
Editors : Cécile Fabre, Emmanuel Morin, Sophie Rosset and
Pascale Sébillot
TOPICS
The journal Automatic Language Processing has an open call for papers.
Submissions may concern theoretical and experimental contributions on
all aspects of written, spoken, and signed language processing and
computational linguistics, both theoretical and experimental, for
example:
- Computational models of language
- Linguistic resources
- Statistical learning and modeling
- Intermodality and multimodality
- Language multiplicity and diversity
- Semantics and comprehension
- Information access and text mining
- Language production and processing/generation/synthesis
- Evaluation
- Explicability and reproducibility
- NLP in interaction with other disciplines, digital humanities
This list is indicative. On all topics, it is essential that the
aspects related to natural language processing are emphasized.
We also welcome position papers and survey papers.
LANGUAGE
Manuscripts may be submitted in English or French. Submissions in
English are accepted only if one of the co-authors is a non
French-speaking person.
THE JOURNAL
TAL (http://www.atala.org/revuetal_ <http://www.atala.org/revuetal_> - Traitement Automatique des
Langues / Natural Language Processing) is an international journal
published by ATALA (French Association for Natural Language Processing)
since 1960 with the support of CNRS (National Centre for Scientific
Research). It has moved to an electronic mode of publication.
IMPORTANT DATES
Deadline for submission: 12/15/2022
Notification to authors after first review: 02/15/2023
Notification to authors after second review: 04/30/2023
Publication: September, 2023
FORMAT SUBMISSION
Papers must be between 20 and 25 pages long, including references
and appendices (with no possible derogation on the length).
TAL performs double-blind review: it is thus necessary to anonymise the
manuscript and the name of the pdf file and to avoid self references.
Style sheets are available for download on the Web site of the journal
(https://www.atala.org/content/instruction-authors-style-files-0 <https://www.atala.org/content/instruction-authors-style-files-0>).
Authors who intend to submit a paper are encouraged to upload your
contribution via the menu "Paper submission" (PDF format). To do so,
you will need to have an account on the sciencesconf platform.
To create an account, go to the site _http://www.sciencesconf.org_ <http://www.sciencesconf.org_/>
and click on "create account" next to the "Connect" button at the top
of the page. To submit, come back to the page (soon available)
http://tal-64-1.sciencesconf.org/ <http://tal-64-1.sciencesconf.org/>, connect to you account and upload
your submission.
*** Apologies for cross-posting ***
Call for papers: Explainable AI in Natural Language Processing
Traditional Natural Language Processing (NLP) models (e.g., decision trees, Markov models, etc.) have primarily been based on techniques that are inherently interpretable models, referred to as white-box techniques. However, in recent years, NLP models have employed advanced neural approaches along with language embedding features. Using these advanced approaches, mostly referred to as black-box techniques, the NLP models have yielded state-of-art performance. Nonetheless, the level of interpretability (e.g., how the model arrives at its results) has reduced significantly. This obfuscated interpretability not only lowers the end users’ trust in the NLP models but also makes it challenging for the developers to debug or improve by analyzing the models for further improvement. Therefore, nowadays, researchers in the NLP community are giving significant attention to the emerging field called Explainable AI (XAI) to tackle the obfuscated complexity of AI systems for trust and improvement. Apart from academia, organizations and companies also have launched high-funding projects such as DARPA XAI, People +AI Research (PAIR), etc.
As XAI is still a growing field, there is plenty of room for innovation to improve the explainability of NLP systems. In recent works, explainable NLP models have captured linguistic knowledge of neural networks, explain predictions, stress-test models via challenge sets or adversarial examples, and interpret language embeddings.
The goal of this Research Topic is to better understand the present status of the XAI in NLP by identifying: new dimensions for a better explanation, evaluation techniques used to measure the quality of explanations, approaches or developments of new software toolkits to explain XAI in NLP, and transparent deep learning models for different NLP task.
The scope of this Research Topic covers (but is not restricted to) the following topics:
• Survey of XAI in NLP in general or any particular NLP task such as NER, QA, Sentiment analysis, social media (SocialNLP), etc.
• Explainable Neural models in Machine Translation
• Explainable Neural models in Named Entity Recognition
• Explainable Neural models in Question Answering
• Explainable Neural models in Sentiment Analysis
• Explainable Neural models in Opinion Mining
• Explainable Neural models in SocialNLP
• Evaluation techniques used to measure the quality of explanations
• Tools for explaining explainability
• Resources related to XAI in the context of NLP
The Research Topic welcomes contributions toward interpretable models for efficient solutions to NLP research problems that explain the explainability of the proposed model using suitable explainability technique(s) (e.g., example-driven, provenance, feature importance, induction, surrogate models, etc.), visualization technique(s) (e.g., raw examples, saliency, raw declarative, etc.), and other aspects. Software toolkits or approaches that can help users express explainability to their models and ML pipelines are also welcome.
The full Call for Papers is available at https://www.frontiersin.org/research-topics/48440/explainable-ai-in-natural…
Impact of the publication: https://www.frontiersin.org/about/impact
The current deadlines are:
* Abstract Deadline:16 December 2022 - A soft deadline, thus whilst not mandatory, if you would like feedback on your prospective manuscript’s suitability, I encourage you to submit an abstract around this time.
* Manuscript Deadline: 14 April 2023, This is a mandatory deadline for your full manuscript submission. However, we can accommodate personal extensions on a case-by-case basis.
Guest Associate Editors:
Somnath Banerjee (University of Tartu, somnath.banerjee(a)ut.ee)
David Tomás (University of Alicante, dtomas(a)dlsi.ua.es)
Somnath Banerjee
Lecturer,
Institute of Computer Science,
University of Tartu,
Narva mnt 18, room 3063
51009 Tartu, ESTONIA
webpage: http://www.ut.ee//~somnath/
Dear Researchers,
We are happy to inform you that the Eleventh International Conference on
Frontiers of Intelligent Computing: Theory and Applications (FICTA-2023)
will be organized by Cardiff Metropolitan University, United Kingdom. We
invite you to participate in FICTA-2023: https://ficta.co.uk/ on 11-12
April 2023, being organized in a hybrid mode.
Publication: All FICTA 2023 registered and presented papers will be
published in conference proceedings by Springer-Smart Innovation, Systems
and Technologies (SIST) Series (https://www.springer.com/series/8767).
Topics of interest: Submissions of quality papers are expected in all areas
of research and application in intelligent computing, refer call for papers
at https://ficta.co.uk/call-for-papers.
Call for Special Session Proposals: If interested in floating/organizing a
special session please visit the link and follow the necessary guidelines:
https://ficta.co.uk/call-for-sessions
For any queries related to the conference you may feel free to e-mail:
FICTA2023(a)cardiffmet.ac.uk
Thank you
--
Warm Regards,
*Sandeep Singh Sengar*,
Lecturer in Computer Science
Cluster Leader Computer Vision / Image Processing
Cardiff Metropolitan University, Cardiff, UK CF5 2YB
*-------------------------------------------------------------------------------*
*Email: SSSengar(a)cardiffmet.ac.uk <SSSengar(a)cardiffmet.ac.uk>*
*Web: **https://sites.google.com/view/sandeepsengar
<https://sites.google.com/view/sandeepsengar>*
Hi all,
We are hiring a Research Associate (post-doc) to work on one of our NLP
projects at the University of Sheffield. The successful candidate will be
expected to lead the design and development of strategies for more
transparent machine learning models to generate accurate cross-lingual
representations for idiomatic language, as well as to contribute to the
design and development of resources and evaluation of downstream tasks,
like machine translation. For both lines of research, you will build on
state-of-the-art approaches based on deep learning.
The applicants should hold a PhD (or be close to completion) or have
equivalent work experience and a strong publication record. Solid knowledge
of Machine Learning models applied to Natural Language Processing and Deep
Learning is required, as is excellent programming skills in Python and deep
learning frameworks (esp. Keras, TensorFlow or PyTorch). Previous
experience developing word embedding models and/or Machine Translation is
also desirable.
Deadline for applications: 07/12/2022
More details, including how to apply can be found in the following link:
bit.ly/3gnGP4l <https://t.co/KvevJ2LPoS>
Kind regards,
Carol
--
*Carolina Scarton*
Lecturer in Natural Language Processing
Department of Computer Science
University of Sheffield
http://staffwww.dcs.shef.ac.uk/people/C.Scarton/
Dear All,
Please share the CFP of 7th International Conference on Innovations and
Creativity (ICIC 2023) to be held in Liepaja, Latvia in June 2023 with your
colleagues. ICIC covers a wide range of topics from Mathematics and
Computer Science to Art, Energy and Environmental Science. Check out Call
For Papers and Deadlines at: http://icic.liepu.lv/
Kind regards,
--
Dariush Alimohammadi, PhD,
Professor,
Faculty of Science and Engineering,
Liepaja University, Liepaja, Latvia.
=================================
*IberLEF 2023 -- Call for Task Proposals*
=================================
The goal of IberLEF is to encourage the research community to organize
competitive text processing, understanding and generation tasks, with the
aim of defining new research challenges and advancing the state of the art
in Natural Language Processing challenges involving at least one of the
following Iberian languages: Spanish, Portuguese, Catalan, Basque or
Galician. Researchers and practitioners from all areas of Natural Language
Processing and related communities are invited to submit task proposals that
fit IberLEF goals *by December 5, 2022.*
Proposals must be submitted (as a pdf file) to iberlef(a)googlegroups.com,
and should include the following fields:
● Title of the task.
● Description of the task, highlighting:
○ Relevance and novelty of the task, and the challenges involved.
○ Evaluation measures, and other relevant methodological aspects.
○ Expected target community, and actual or potential industrial take
up.
○ Related evaluation activities, if any.
○ Previous editions of the task, if any. If it has been organized
previously, what the roadmap is and what the novelties for 2023 are.
○ Linguistic resources to be gathered, created and/or reused. Please
include as many details on data gathering, selection and annotation
procedures as possible: sources and representativity,
training/validation/test sizes, harvesting procedures, profile of
annotators (experts, linguists, crowdworkers, etc.), multiple annotation
policy, IPR issues, baselines, etc.
● Tentative schedule (note that camera ready versions of the
proceedings must be ready *by July 6, 2023*).
● Organization committee: full name and affiliation of the
organizers, with a succinct description of their research interests, areas
of expertise and experience organizing similar events.
● Funding, if available.
● Contact person.
● Any other relevant issues.
*Task organizers duties*
Note that organizers of accepted tasks are expected to:
● Setup the evaluation exercise according to the submitted proposal.
● Promote the task within the target research community.
● Manage the submission and scientific evaluation of the system
description papers of the corresponding systems submitted by the
participants. The accepted papers will be published in
the IberLEF proceedings.
● Prepare and submit an overview of the evaluation exercise.
● Present the results of the task at IberLEF 2023.
*Task selection procedure*
Each submitted proposal will be reviewed by members of the IberLEF steering
and program committee, and decisions will be sent back to the task
organizers by* January 16, 2023*.
*Proceedings*
IberLEF 2023 Proceedings including the description of the participating
systems will be published at CEUR-WS.org. Task Overviews will be published
in the journal *Natural Language Processing* (
http://www.sepln.org/en/journal, indexed in Clarivate ESCI and Elsevier
SJR) in its September 2023 issue. Task Organizers are expected to send the
camera ready task and system description papers for their task to
IberLEF organizers
by *July 6, 2023*.
*Important dates*
● Task proposals due: December 5, 2022.
● Notification of acceptance: January 16, 2023.
● Camera ready submissions due: July 6, 2023.
● IberLEF Workshop: September 2023.
*IberLEF general chairs*:
Salud María Jiménez Zafra, SINAI, Universidad de Jaén (Spain)
Manuel Montes y Gómez, INAOE (Mexico)
Francisco Rangel, Symanto Research (Spain)
*Contact*
E-mail: iberlef(a)googlegroups.com
[image: Universidad de Jaén] <http://www.uja.es/> *Salud María Jiménez
Zafra*
sjzafra(a)ujaen.es
Universidad de Jaén
Grupo de Investigación SINAI <http://sinai.ujaen.es/> | Departamento de
Informática
EPS Jaén, Edificio A3, Despacho 219
Campus Las Lagunillas s/n 23071 - Jaén | +34 953212992
[image: Universidad de Jaén] <http://www.uja.es/>
Dear Madam/Sir,
Please forward this CFP to corpora-list's members
Regards
--Max Silberztein
NLDB 2023
The 28th International Conference on Natural Language & Information Systems
21-23 June 2023, University of Derby, United Kingdom.
https://www.derby.ac.uk/events/latest-events/nldb-2023/
About NLDB
The 28th International Conference on Natural Language & Information
Systems will be held at the University of Derby, United Kingdom and
will be a face to face event.
Since 1995, the NLDB conference brings together researchers, industry
practitioners, and potential users interested in various application
of Natural Language in the Database and Information Systems field. The
term "Information Systems" has to be considered in the broader sense
of Information and Communication Systems, including Big Data, Linked
Data and Social Networks.
The field of Natural Language Processing (NLP) has itself recently
experienced several exciting developments. In research, these
developments have been reflected in the emergence of neural language
models (Deep Learning, Word Embeddings, Transformers) and the
importance of aspects such as transparency, bias and fairness, a
(renewed) interest in various linguistic phenomena, such as in
discourse and argumentation mining, and in new problems such as the
detection of disinformation and hate speech in social media, as well
of mental health disorders that increased during the recent pandemic.
Regarding applications, NLP systems have evolved to the point that
they now offer real-life, tangible benefits to enterprises. Many of
these NLP systems are now considered a de-facto offering in business
intelligence suites, such as algorithms for recommender systems and
opinion mining/sentiment analysis.
It is against this backdrop of recent innovations in NLP and its
applications in information systems that the 28th edition of the NLDB
conference takes place. We welcome research and industrial
contributions, describing novel, previously unpublished works on NLP
and its applications across a plethora of topics as described in the
Call for Papers.
Call for Papers
NLDB 2023 invites authors to submit papers for oral or poster
presentations on unpublished research that addresses theoretical
aspects, algorithms, applications, architectures for applied and
integrated NLP, resources for applied NLP, and other aspects of NLP,
as well as survey and discussion papers. This year's edition of NLDB
also introduces an Industry Track, to foster fruitful interaction
between the industry and the research community.
Topics of interest include but are not limited to:
Social Media and Web Analytics: Opinion mining/sentiment analysis,
irony/sarcasm detection; detection of fake reviews and deceptive
language; detection of harmful information: fake news and hate speech;
sexism and misogyny; detection of mental health disorders;
identification of stereotypes and social biases; robust NLP methods
for sparse, ill-formed texts; recommendation systems.
Deep Learning and eXplainable Artificial Intelligence (XAI): Deep
learning architectures, word embeddings, transparency,
interpretability, fairness, debiasing, ethics.
• Argumentation Mining and Applications: Automatic detection of
argumentation components and relationships; creation of resource (e.g.
annotated corpora, treebanks and parsers); Integration of NLP
techniques with formal, abstract argumentation structures;
Argumentation Mining from legal texts and scientific articles.
• Question Answering (QA): Natural language interfaces to databases,
QA using web data, multi-lingual QA, non-factoid QA(how/why/opinion
questions, lists), geographical QA, QA corpora and training sets, QA
over linked data (QALD).
• Corpus Analysis: multi-lingual, multi-cultural and multi-modal
corpora; machine translation, text analysis, text classification and
clustering; language identification; plagiarism detection; information
extraction: named entity, extraction of events, terms and semantic
relationships.
• Semantic Web, Open Linked Data, and Ontologies: Ontology learning
and alignment, ontology population, ontology evaluation, querying
ontologies and linked data, semantic tagging and classification,
ontology-driven NLP, ontology-driven systems integration.
• Natural Language in Conceptual Modelling: Analysis of natural
language descriptions, NLP in requirement engineering, terminological
ontologies, consistency checking, metadata creation and harvesting.
• Natural Language and Ubiquitous Computing: Pervasive computing,
embedded, robotic and mobile applications; conversational agents; NLP
techniques for Internet of Things (IoT); NLP techniques for ambient
intelligence
• Big Data and Business Intelligence: Identity detection, semantic
data cleaning, summarisation, reporting, and data to text.
Important Dates:
Full paper submission: 14 March, 2023
Paper notification: 10 April, 2023
Camera-ready deadline: 24 April, 2023
Conference: 21-23 June 2023
Submission Guidelines
Authors should follow the LNCS format
(https://www.springer.com/gp/computer-science/lncs/conference-proceedings-gu…
) and submit their manuscripts in pdf via Easychair
(https://easychair.org/conferences/?conf=nldb2023 )
Submissions can be full papers (12 pages maximum including
references), short papers (8 pages including references) or papers for
a poster presentation or system demonstration (6 pages including
references). The programme committee may decide to accept some full
papers as short papers or poster papers.
The reviewing process of NLDB 2023 is double-blind, i.e., submissions
to the main conference and to the industry track must not contain
author names or other identifying information, such as funding
sources, acknowledgments and must use the third person to refer to
work the authors have previously undertaken. System demonstration
papers may not be anonymous.
Committee
Conference Chairs:
Manning Warren, University of Derby, UK
Métais Elisabeth, Conservatoire des Arts et Métiers, Paris, France
Meziane Farid, University of Derby, UK
Program Chairs:
Reiff-Marganiec Stephan, University of Derby, UK
Sugumaran Vijayan, Oakland University, Rochester, USA
Programme Committee:
Abdi Asad, University of Derby, UK.
Akoka Jacky, CNAM & TEM, France
Anselma Luca, University of Turin, Italy
Bajaj Ahsaas, University of Massachusetts Amherst, USA
Balakrishna Mithun, Limba Corp, USA
Banerjee Somnath, University of Milano-Bicocca, Italy
Bensalem Imene, MISC Lab, Constantine 2 University, Algeria
Bosco Cristina, University of Turin, Italy
Braschler Martin, ZHAW School of Engineering, Switzerland
Buscaldi Davide, University Sorbonne Paris Nord, France
Cabrio Elena, Université Côte d’Azur, Inria, CNRS, I3S, France
Caselli Tommaso, Rijksuniversiteit Groningen, The Netherlands
Chiruzzo Luis, Universidad de la República, Uruguay
Cignarella Alessandra Teresa, University of Turin, Italy
Cimiano Philipp, Bielefeld University, Germany
Croce Danilo, University of Rome "Tor Vergata", Italy
Delgado Agustín, Universidad Nacional de Educación a Distancia, Spain
Dinu Liviu, University of Bucharest, Romania
Doucet Antoine, La Rochelle University, France
Fersini Elisabetta, University of Milano-Bicocca, Italy
Florio Komal, University of Turin, Italy
Fomichov Vladimir, Moscow Aviation Institute, Russia
Fornaciari Tommaso, Bocconi University Milan, Italy
Franco Marc, Symanto, Germany/Spain
Frasincar Flavius, Erasmus University Rotterdam, The Netherlands
Hacohen-Kerner Yaakov, Jerusalem College of Technology, Israel
Horacek Helmut, DFKI, Germany
Ienco Dino, IRSTEA, France
Iglesias Carlos A., Universidad Politécnica de Madrid, Spain
Ittoo Ashwin, HEC, Univ. of Liege, Belgium
Kapetanios Epaminondas, University of Hertfordshire, UK
Kedad Zoubida, UVSQ, France
Kop Christian, University of Klagenfurt, Austria
Koufakou Anna, Florida Gulf Coast University, USA
Laatar Rim, University of Sfax, Tunisia
Lai Mirko, University of Turin, Italy
Lopez Cédric, Emvista, Montpellier, France
Loukachevitch Natalia, Moscow State University, Russia
Aaisha Makkar, University of Derby
Mandl Thomas, University of Hildesheim, Germany
Martínez Paloma, Universidad Carlos III de Madrid, Spain
Martínez Patricio, Universidad de Alicante, Spain
Martínez Unanue Raquel, Universidad Nacional de Educación a Distancia, Spain
Masmoudi Abir, LIUM, University of Le Mans, France
Mazzei Alessandro, University of Turin, Italy
Métais Elisabeth, Conservatoire des Arts et Métiers, France
Meziane Farid, University of Derby, UK
Mich Luisa, University of Trento, Italy
Mimouni Nada, Conservatoire des Arts et Métiers, France
Mitrović Jelena, University of Passau, Germany
Montalvo Soto, Universidad Rey Juan Carlos, Spain
Montes y Gómez Manuel, INAOE Puebla, Mexico
Monti Johanna, L'Orientale University of Naples, Italy
Muñoz Rafael, Universidad de Alicante, Spain
Okky-Ibrohim Muhammad, University of Turin, Italy
Passaro Lucia, University of Pisa, Italy
Patti Viviana, University of Turin, Italy
Picca Davide, Columbia University, USA
Plaza Laura, Universidad Nacional de Educación a Distancia, Spain
Rangel Francisco, Symanto, Germany/Spain
Reyes Antonio, Autonomous University of Baja California, Mexico
Roche Mathieu, Cirad, TETIS, France
Rosso Paolo, Universitat Politècnica de València, Spain
Saggion Horacio, Universitat Pompeu Fabra, Spain
Sakketou Flora, Philipps-Marburg University
Sanguinetti Manuela, Università degli Studi di Cagliari, Italy
Silberztein Max, Université de Franche-Comté, France
Sprugnoli Rachele, Catholic University of the Sacred Heart Milan, Italy
Sugumaran Vijayan, Oakland University, Rochester, USA
Tagarelli Andrea, University of Calabria, Italy
Taulé Mariona, Universitat de Barcelona, Spain
Teisseire Maguelonne, Irstea, TETIS, France
Tomás David, Universidad de Alicante, Spain
Tufis Dan, RACAI, Bucharest, Romania
Uban Ana Sabina, University of Bucharest, Romania
Ureña-López Alfonso, Universidad de Jaén, Spain
Vadera Sunil, University of Salford, UK
Valencia-Garcia Rafael, Universidad de Murcia, Spain
Villatoro Esaú, Idiap Research Institute, Switzerland
Wattiau Isabelle, ESSEC, France
Zaghouani Wajdi, Hamad Bin Khalifa University, Qatar
Zubiaga Arkaitz, Queen Mary University of London, UK
RESEARCH INTERNSHIP
*Quantifying diversity of language phenomena: Case
study of multiword expressions* (LIFAT, Blois, France)
We propose a master internship position in Blois
(France). Please send an email to apply, with a
CV, a transcript of bachelor and master grades,
and a few lines explaining your motivation to
Arnaud Soulet <arnaud.soulet(a)univ-tours.fr>
<mailto:arnaud.soulet@univ-tours.fr>, as well as
Agata Savary and Thomas Lavergne
<first.last(a)universite-paris-saclay.fr>
<mailto:first.last@universite-paris-saclay.fr>.
Internship proposal description:
https://selexini.lis-lab.fr/jobs/2022/11/26/internship
Application deadline: *December 8*, 2022 (or until
filled)
------
MOTIVATION AND CONTEXT
*Diversity* of naturally occurring phenomena is a
vital heritage to be preserved in the current
progress- and optimization-driven globalization
era. Diversity has been quantified in many
domains: ecology, economy, information science,
etc. but less so in *Natural Language Processing*
(NLP). Recently, we have been addressing this
aspect with respect to a particular linguistic
phenomenon: the one of *multiword expressions *
(MWEs).
MWEs, such as (FR) /casser sa pipe/ ‘to die’
(literally to break one’s pipe) or (FR) /sortir du
lot /'to be better than others' (literally to quit
the batch), are groups of words which exhibit
unexpected properties (Baldwin & Kim, 2010;
Constant et al. 2017). Most prominently, their
meaning does not straightforwardly derive from the
meanings of their components. Language resources
dedicated to MWEs include MWE lexicons and
MWE-annotated corpora (Savary et al., 2017), while
a major computational task is to *automatically
identify MWEs *in running text. The PARSEME
network has been addressing the MWE identification
task via a series of *shared tasks* on automatic
identification of verbal MWEs (Ramisch et al.
2020). Our recent work (Lion-Bouton, 2021;
Lion-Bouton et al. 2022) is explicitly dedicated
to *quantifying diversity in MWE language
resources and MWE identification systems*. We have
adapted measures of *variety* (number of types in
a system), *balance* (equity of items in various
types) and *disparity* (differences between
types), stemming notably from ecology and
information theory (Morales 2021).
------
OBJECTIVE
The objective of this internship is to extend the
formalisation of the diversity by benefiting from
*Good-Turing frequency estimation*. Successfully
used to estimate the biomass, Good-Turing
frequency estimation is a statistical technique
for estimating the probability of encountering an
object of an unseen species, given a set of past
observations of objects from different species
(Good, 1953). Under this same principle, the idea
would be to *estimate the number of unseen MWEs
from the MWEs observed *in the corpus. Thus, it
will be possible to correct the diversity measures
to take the unseen MWEs into account and to
evaluate the possible selection bias of the corpus.