Hi everyone,
I have the pleasure to invite you to my Ph.D. defense next Thursday at 8:30 AM Paris time, it will take place at the Inria Paris Research Centre (2 rue Simone Iff, 75012 Paris) at the Jacques-Louis Lions room. It will be followed by a pot de thèse around 11 am.
The Ph.D. committee will be made of: • Yoav Golberg, Bar Ilan University, Rapporteur • Lilja Øvrelid, University of Oslo, Rapportrice • Benjamin Piwowarski, CNRS, Examinateur • Benoît Sagot INRIA Paris, Directeur • Natalie Schulter, IT University of Copenhagen, Examinatrice • Djamé Seddah, INRIA Paris, Superviseur
Abstract:
Deep Learning techniques applied to Natural Language Processing (NLP) have led to impressive empirical progress in recent years. In essence, this progress is due to the development of better-contextualized representations of textual data that can be easily used --- or transferred --- for a wide variety of NLP tasks. In their most recent and popular forms, these models consist of large-scale deep-learning language models, first pretrained on a large quantity of raw data and then adapted to specific tasks. These language models are now essential for search engines, question-answering pipelines, machine translation systems, etc.
However, these models usually require substantial computing power and large amounts of raw textual data. This makes natural language's inherent diversity and variability a vivid challenge in NLP. Indeed, collecting large datasets for low-resource languages is challenging and costly, and training models from scratch for every domain and language is unreasonable in practice. Additionally, understanding the behavior of deep learning-based models is intrinsically tricky, making the development of more cost-effective techniques even more challenging.
For these reasons, we focus on the following question: ``How can we make language models better at handling the variability and diversity of natural languages?''.
As a starting step, we explore the generalizability of language models by building one of the first large-scale replication of a BERT model for a non-English language. We analyze the critical training ingredients and show that it can achieve state-of-the-art performance with only a few gigabytes of diverse data.
Our results raise the question of using these language models on highly-variable domains such as these found in user-generated content. Focusing on domain-gap reduction via lexical normalization, we show that this task can be addressed accurately with BERT-like models. However, we show that it only partially helps downstream performance. In consequence, we focus on direct adaptation techniques using what we refer to as representation transfer and explore challenging settings such as the zero-shot setting, low-resource language varieties like Bambara or Uyghur, and highly variable and non-standardized code-mixed dialects such as a North-African Arabic dialect written in the Latin script. We show that multilingual language models can be adapted and used efficiently with low-resource languages, even with the ones unseen during pretraining, and that the script is a critical component in this adaptation.
NLP technologies are becoming increasingly critical to accessing knowledge, connecting with our friends, and extracting meaningful information from large quantities of text. In this thesis, we present concrete and usable solutions to ensure that we can build accurate NLP systems for the most significant number of domains and languages at a reasonable cost.
For those who cannot make it in person, a video-conference link will be shared in the following google doc:
https://docs.google.com/document/d/1YW1sa_x6oTLiXzgXoc2vcJqxaiZmhyu4TUGgl_Dq...
Best, Benjamin Muller