The Autogramm project (https://autogramm.github.io/en) invites applications for a 3-year PhD position starting between now and October 2023. The position is funded by ANR (Agence National de la recherche), France.
Applications and questions can be sent to Sylvain Kahane sylvain@kahane.fr
Applications should include: - Cover letter outlining interest in the position - Names of two referees - Curriculum Vitae (CV) with publications (if applicable) - Copy of MA degree - University grade sheet of at least the two last years
Today, we have databases concerning several dozen languages, including corpora annotated according to the same principle, thanks in particular to corpora annotated in interlinear gloss (IGT, see for example the Pangloss collection, https://pangloss.cnrs.fr) or with the Universal Dependencies annotation scheme (UD, https://universaldependencies.org and its SUD variant, https://surfacesyntacticud.github.io/). These databases allow typological studies and have several advantages: - the results obtained are based directly on primary data (corpora) and not secondary data (grammars written by linguists). (This is only partially true, since the results still depend on the choices made by a linguist in selecting the corpus and annotating it; nevertheless, these choices are visible and can be discussed.) - the results are reproducible as long as the data are freely accessible; - the nature of the data allows for quantitative results: we will not say that a language is OV or VO, but that it has such and such a percentage of OV constructions, and we will be able to observe directly on the data which factors determine the distribution between OV and VO (Levshina 2019, Gerdes et al. 2019, Futrell et al. 2015). (See also https://typometrics.elizia.net/#/.)
The goal of the thesis topic is to contribute to the development of quantitative typology by participating in the construction of a quantitative database on a large number of typologically diverse languages and by focusing on the exploitation of such a dataset (Levshina 2022). The originality of the project lies in the fact that we are working on quantitative data and not on categorical features like existing typological databases (see in particular the Word Atlas of Language Structure online, https://wals.info/, which gives access to data on more than 2500 languages).
The following questions can be studied: - How to identify cross-linguistic regularities, such as quantitative entailment universals, from a set of corpora of world languages (see for example Gerdes et al. 2021)? How can we make inferences between quantitatively valued features? - What quantitative information can be extracted from a corpus that is useful for a typological study? Which features require prior annotation of the data and what is the nature of the annotations needed (see for example the case of IGT for morphosyntactic features and treebanks for word order). - How to identify the typological signature of a language from an annotated corpus and determine what makes it special within a group of languages (see Bickel & Nichols 2002 and AutoTyp project). - How to take into account the imbalance of a database that is not representative of the distribution of languages in the world, but includes a higher proportion of languages from certain regions or families (Indo-European languages, Semitic languages, East Asian languages, etc.) to the detriment of other regions or families (Papua New Guinea, Oceania, Sub-Saharan Africa, Amerindian languages, aboriginal languages)? (see Guzmán Naranjo & Becker 2022). - How to solve the question of the commensurability of the categories used in the description of the different languages? How can we check the consistency of the data? This question can be addressed by studying the consistency of treebanks of the same language or language family. How to detect the presence of aberrations in some treebanks (categorization choices not conforming to the universal scheme, e.g. assignment of the subject relation in ergative languages, use of the ADJ category in languages without real adjectives, etc.)? - How to visualize multidimensional quantitative data? Linguistic data pose many challenges.
The work will be conducted in collaboration with the members of the ANR Autogramm project (https://autogramm.github.io/), researchers in field linguistics, typology, formal linguistics and automatic language processing. It could lead, with the help of engineers, to the constitution of a typometric database accompanied by query and data visualization tools.
Bickel & Nichols 2021 Futrell 2015 Gerdes et al. 2019 Gerdes et al. 2021 Guzmán Naranjo & Becker 2022 Levshina 2019 Levshina 2022