RESEARCH INTERNSHIP
*Quantifying diversity of language phenomena: Case study of multiword expressions* (LIFAT, Blois, France)
We propose a master internship position in Blois (France). Please send an email to apply, with a CV, a transcript of bachelor and master grades, and a few lines explaining your motivation to Arnaud Soulet arnaud.soulet@univ-tours.fr mailto:arnaud.soulet@univ-tours.fr, as well as Agata Savary and Thomas Lavergne first.last@universite-paris-saclay.fr mailto:first.last@universite-paris-saclay.fr.
Internship proposal description: https://selexini.lis-lab.fr/jobs/2022/11/26/internship
Application deadline: *December 8*, 2022 (or until filled)
------
MOTIVATION AND CONTEXT
*Diversity* of naturally occurring phenomena is a vital heritage to be preserved in the current progress- and optimization-driven globalization era. Diversity has been quantified in many domains: ecology, economy, information science, etc. but less so in *Natural Language Processing* (NLP). Recently, we have been addressing this aspect with respect to a particular linguistic phenomenon: the one of *multiword expressions * (MWEs).
MWEs, such as (FR) /casser sa pipe/ ‘to die’ (literally to break one’s pipe) or (FR) /sortir du lot /'to be better than others' (literally to quit the batch), are groups of words which exhibit unexpected properties (Baldwin & Kim, 2010; Constant et al. 2017). Most prominently, their meaning does not straightforwardly derive from the meanings of their components. Language resources dedicated to MWEs include MWE lexicons and MWE-annotated corpora (Savary et al., 2017), while a major computational task is to *automatically identify MWEs *in running text. The PARSEME network has been addressing the MWE identification task via a series of *shared tasks* on automatic identification of verbal MWEs (Ramisch et al. 2020). Our recent work (Lion-Bouton, 2021; Lion-Bouton et al. 2022) is explicitly dedicated to *quantifying diversity in MWE language resources and MWE identification systems*. We have adapted measures of *variety* (number of types in a system), *balance* (equity of items in various types) and *disparity* (differences between types), stemming notably from ecology and information theory (Morales 2021).
------
OBJECTIVE
The objective of this internship is to extend the formalisation of the diversity by benefiting from *Good-Turing frequency estimation*. Successfully used to estimate the biomass, Good-Turing frequency estimation is a statistical technique for estimating the probability of encountering an object of an unseen species, given a set of past observations of objects from different species (Good, 1953). Under this same principle, the idea would be to *estimate the number of unseen MWEs from the MWEs observed *in the corpus. Thus, it will be possible to correct the diversity measures to take the unseen MWEs into account and to evaluate the possible selection bias of the corpus.