[Corpora-List] PhD in NLP - PATRIMALP Materials, Pigments, Lights: the colors of Heritage – Natural Language Processing for cultural heritage

20 Jul 2022


      PhD in NLP - PATRIMALP Materials, Pigments, Lights: the colors of 
Heritage – Natural Language Processing for cultural heritage
Starting date: October 01, 2022 (flexible)
Application deadline: September 5th, 2022
Interviews (tentative): September 12th, 2022
Salary: 1 975 € gross/month (social security included)
Mission: research oriented (teaching possible but not mandatory)
Keywords: natural language processing, knowledge representation, 
cultural heritage, transfer learning, multilingualism
CONTEXT
The main challenge of the Patrimalp project is the development of an 
integrated and interdisciplinary Heritage Science, in order to ensure 
cultural Heritage sustainability, promotion and dissemination in 
contemporary society. The ambition is to produce the forms of 
intelligibility of a global and moving process which starts from the 
collection of the raw material, its transformation into a primitive 
object, different lives as a material (alterations, degradations, 
transformations ...) and as a symbol (relegation, disinterest, oblivion 
or rebirth, exaltation...) throughout history, and finally from its 
election as an object of historical and Heritage value and its 
“promotion” into a work of art. This research is applied to understand 
how inks and pigments have been conceived over several centuries, how it 
has been used in art work as well as how the handcrafting method has 
evolved and been disseminated over centuries and countries.
To make this study possible, the project will gather a large collection 
of textual material made up of alchemical works and collections of 
natural or artificial objects collected between the 16th and 18th 
centuries. To better understand the choice of colors for these 
"wonders", we want to reconstruct the recipes for making colored 
material in its context of thought, whether technical or symbolic. These 
recipes will constitute a new body of research for literary people and a 
new data-study case for building knowledge about color. This corpus 
indeed offers modes of representation inscribed in complex forms of 
writing and fiction whose modalities and frames of reference remain to 
be analyzed (accounts of technical, medical or physico-chemical 
experiments inscribed in fictional worlds or mythological, symbolic 
descriptions of artifacts, or materials collected in nature, mines). On 
the linguistic level, the inventory of this lexicon in different 
European and Eastern languages will lead to the formalization of the 
knowledge of these various skills over time and several cultures. This 
corpus will thus provide complex data on the material and symbolic 
origin of the ingredients of color, on its use, its names and its 
physical or symbolic perception: these data represent a challenge for 
computer researchers who will have to organize them in order to benefit 
curators, chemists or physicists, in ontologies representing the state 
of knowledge from the point of view of scholars over the ages.
To systematically explore the corpus of these recipes, we will use NLP 
techniques to uncover the correlations between recipes, physical and 
chemical composition of objects and symbolic references. The final 
objective is to build a knowledge base (objects, components of objects, 
materials, colors, know-how, reference framework) each of the parts 
being able to reference a specific ontology (ontology of pigments, 
materials, colors...) to make it possible for researchers to observe the 
trajectory from the writing of color to its technical and artisan 
practice from this specific corpus.
PHD OBJECTIVES
The PhD project will focus on segmenting, extracting and representing 
recipes from a corpus of alchemical works from the 16th and 18th 
centuries to make them accessible to researchers in the humanities. This 
necessitates to :
·    identify which excerpts of the text belong to a recipe;
·    supervise an annotation campaign to build an analysis and training 
corpus
·    build NLP tools to extract automatically the list of elements (raw 
material, tools, quantity, units) and actions (verb, adverb, adjective) 
that made up the recipes;
·    analyze the dependencies between the elements of a recipe rules ;
·    Represent these rules in a formal knowledge representation.
The results of this processing will support :
·    The documentation of this unique set of text, by inserting the 
extracted elements to the document meta data to easy retrieval
·    The building a knowledge base of alchemical recipes
This PhD will need to address several challenges. One of them is to be 
able to process text composed of multiple non-modern languages (French, 
German, English, Latin, Greek) [Coavoux2022,Grobol2022] . One approach 
we will be to study how large multilingual pre-trained models 
[Delvin2019, Xue2020] can be leveraged and adapted for the task and how 
disparate collection of corpora of ancient texts can be used to 
fine-tune them. Another challenge will be the paucity of data for the 
downstream tasks (segmentation, parsing, Natural Language Understanding 
[Desot2022]) for this we will need to identify other related corpus 
(e.g. cooking) to address the problem in a multitask setting (such as 
NER and NLU) and transfer learning.
SKILLS
·    Master 2 in Natural Language Processing, computer science or data 
science.
·    Good mastering of  Python programming and  deep learning frameworks.
·    Previous experience in text classification, parsing, processing of 
several languages or text retrieval would be a plus
·    Very good communication skills in English and good command of French
SCIENTIFIC ENVIRONMENT
The thesis will be conducted within the STEAMER and GETALP teams of the 
LIG laboratory 
(http://steamer.imag.fr/%C2%A0and%C2%A0https://lig-getalp.imag.fr/).The GETALP 
team has strong expertise and track record in Natural Language 
Processing, STEAMER team has strong expertise in Knowkledge 
representation and reasoning.The recruited person will be welcomed 
within the teams which offer a stimulating, multinational and pleasant 
working environment. The means to carry out the PhD will be provided 
both in terms of missions in France and abroad and in terms of equipment 
(personal computer, access to the LIG GPU servers).
The PhD candidate will collaborate with the partners involved in the 
PATRIMALP project, in particular with Laurence Rivière from the LUHCIE 
lab (Laboratoire Universitaire Histoire Cultures Italie Europe) and 
Véronique Adam from the LITT&ARTS lab (Littératures et Arts).
INSTRUCTIONS FOR APPLYING
Applications  must contain: CV + letter/message of motivation + master 
notes + be ready to provide letter(s) of recommendation; and be 
addressed to Danielle Ziebelin 
(Danielle.Ziebelin@univ-grenoble-alpes.fr), François Portet 
(francois.Portet@imag.fr) Maximin Coavoux 
(Maximin.Coavoux@univ-grenoble-alpes.fr)
REFERENCES
[Coavoux2022]  Maximin Coavoux, Corinne Denoyelle, Olivier Kraif, Julie 
Sorba. Phraséologie du roman médiéval en prose. Diachro X – le français 
en diachronie, Sorbonne Université, May 2022, Paris, France
[Delvin2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina 
Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers 
for language understanding. In Proceedings of NAACL.
[Desot 2022] Desot, T., Portet, F., & Vacher, M. (2022). End-to-End 
Spoken Language Understanding: Performance analyses of a voice command 
task in a low resource setting. Computer Speech & Language, 75, 101369.
[Grobol2022] Loïc Grobol, Mathilde Regnault, Pedro Ortiz Suarez, Benoît 
Sagot, Laurent Romary and Benoit Crabbé BERTrade: Using Contextual 
Embeddings to Parse Old French. 13th International Conference on 
Language Resources and Evaluation (LREC 2022), May 2022, Marseille, France
[Xue2020] Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., 
Siddhant, A.,  ... & Raffel, C. (2020). mT5: A massively multilingual 
pre-trained  text-to-text transformer. arXiv preprint arXiv:2010.11934.

2026

2025

2024

2023

2022

[Corpora-List] PhD in NLP - PATRIMALP Materials, Pigments, Lights: the colors of Heritage – Natural Language Processing for cultural heritage