[Corpora-List] PhD in ML/NLP – Efficient, Fair, robust and knowledge informed self-supervised learning for speech processing

22 Aug 2022

      ...
PhD in ML/NLP – Efficient, Fair, robust and knowledge informed 
self-supervised learning for speech processing
Starting date: November 1st, 2022 (flexible)
Application deadline: September 5th, 2022
Interviews (tentative): September 19th, 2022
Salary: ~2000€ gross/month (social security included)
Mission: research oriented (teaching possible but not mandatory)
*Keywords:*speech processing, natural language processing, 
self-supervised learning, knowledge informed learning, Robustness, 
fairness
*CONTEXT*
The ANR project E-SSL (Efficient Self-Supervised Learning for 
Inclusive and Innovative Speech Technologies) will start on November 
1st 2022. Self-supervised learning (SSL) has recently emerged as one 
of the most promising artificial intelligence (AI) methods as it 
becomes now feasible to take advantage of the colossal amounts of 
existing unlabeled data to significantly improve the performances of 
various speech processing tasks.
*PROJECT OBJECTIVES*
Recent SSL models for speech such as HuBERT or wav2vec 2.0 have shown 
an impressive impact on downstream tasks performance. This is mainly 
due to their ability to benefit from a large amount of data at the 
cost of a tremendous carbon footprint rather than improving the 
efficiency of the learning. Another question related to SSL models is 
their unpredictable results once applied to realistic scenarios which 
exhibit their lack of robustness. Furthermore, as for any pre-trained 
models applied in society, it isimportant to be able to measure the 
bias of such models since they can augment social unfairness.
The goals of this PhD position are threefold:

to design new evaluation metrics for SSL of speech models ;

to develop knowledge-driven SSL algorithms ;

to propose methods for learning robust and unbiased representations.

SSL models are evaluated with downstream task-dependent metrics e.g., 
word error rate for speech recognition. This couple the evaluation of 
the universality of SSL representations to a potentially biased and 
costly fine-tuning that also hides the efficiencyinformation related 
to the pre-training cost. In practice, we will seek to measure the 
training efficiency as the ratio between the amount of data, 
computation and memory needed to observe a certain gain in terms of 
performance on a metric of interest i.e.,downstream dependent or not. 
The first step will be to document standard markers that can be used 
as robust measurements to assess these values robustly at training 
time. Potential candidates are, for instance, floating point 
operations for computational intensity, number of neural parameters 
coupled with precision for storage, online measurement of memory 
consumption for training and cumulative input sequence length for data.
Most state-of-the-art SSL models for speech rely onmasked prediction 
e.g. HuBERT and WavLM, or contrastive losses e.g. wav2vec 2.0. Such 
prevalence in the literature is mostly linked to the size, amount of 
data and computational resources injected by thecompany producing 
these models. In fact, vanilla masking approaches and contrastive 
losses may be identified as uninformed solutions as they do not 
benefit from in-domain expertise. For instance, it has been 
demonstrated that blindly masking frames in theinput signal i.e. 
HuBERT and WavLM results in much worse downstream performance than 
applying unsupervised phonetic boundaries [Yue2021] to generate 
informed masks. Recently some studies have demonstrated the 
superiority of an informed multitask learning strategy carefully 
selecting self-supervised pretext-tasks with respect to a set of 
downstream tasks, over the vanilla wav2vec 2.0 contrastive learning 
loss [Zaiem2022]. In this PhD project, our objective is: 1. continue 
to develop knowledge-driven SSL algorithms reaching higher efficiency 
ratios and results at the convergence, data consumption and downstream 
performance levels; and 2. scale these novel approaches to a point 
enabling the comparison with current state-of-the-art systems and 
therefore motivating a paradigm change in SSL for the wider speech 
community.
Despite remarkable performance on academic benchmarks, SSL powered 
technologies e.g. speech and speaker recognition, speech synthesis and 
many others may exhibit highly unpredictable results once applied to 
realistic scenarios. This can translate into a global accuracy drop 
due to a lack of robustness to adversarial acoustic conditions, or 
biased and discriminatory behaviors with respect to different pools of 
end users. Documenting and facilitating the control of such aspects 
prior to the deployment of SSL models into the real-life is necessary 
for the industrial market. To evaluate such aspects, within the 
project, we will create novel robustness regularization and debasing 
techniques along two axes: 1. debasing and regularizing speech 
representations at the SSL level; 2. debasing and regularizing 
downstream-adapted models (e.g. using a pre-trained model).
To ensure the creation of fair and robust SSL pre-trained models, we 
propose to act both at the optimization and data levels following some 
of our previous work on adversarial protected attribute 
disentanglement and the NLP literature on data sampling and 
augmentation [Noé2021]. Here, we wish to extend this technique to more 
complex SSL architectures and more realistic conditions by increasing 
the disentanglement complexity i.e. the sex attribute studied in 
[Noé2021] is particularly discriminatory. Then, and to benefit from 
the expert knowledge induced by the scope of the task of interest, we 
will build on a recent introduction of task-dependent counterfactual 
equal odds criteria [Sari2021] to minimize the downstream performance 
gap observed in between different individuals of certain protected 
attributes and to maximize the overall accuracy. Following this 
multi-objective optimization scheme, we will then inject further 
identified constraints as inspired by previous NLP work [Zhao2017]. 
Intuitively, constraints are injected so the predictions are 
calibrated towards a desired distribution i.e. unbiased.
*SKILLS*

Master 2 in Natural Language Processing, Speech Processing,
 computer science or data science.

Good mastering of  Python programming and  deep learning framework.

Previous in Self-Supervised Learning, acoustic modeling or ASR
 would be a plus

Very good communication skills in English

Good command of French would be a plus but is not mandatory

*SCIENTIFIC ENVIRONMENT*
The thesis will be conducted within the Getalp teams of the LIG 
laboratory (_https://lig-getalp.imag.fr/_ 
https://lig-getalp.imag.fr/) and the LIA laboratory 
(https://lia.univ-avignon.fr/). The GETALP team and the LIA have a 
strong expertise and track record in Natural Language Processing and 
speech processing. The recruited person will be welcomed within the 
teams which offer a stimulating, multinational and pleasant working 
environment.
The means to carry out the PhD will be providedboth in terms of 
missions in France and abroad and in terms of equipment. The candidate 
will have access to the cluster of GPUs of both the LIG and LIA. 
Furthermore, access to the National supercomputer Jean-Zay will enable 
to run large scale experiments.
The PhD position will be co-supervised by Mickael Rouvier (LIA, 
Avignon) and Benjamin Lecouteux and François Portet (Université 
Grenoble Alpes). Joint meetings are planned on a regular basis and the 
student is expected to spend time in both places. Moreover, the PhD 
student will collaborate with several team members involved in the 
project in particular the two other PhD candidates who will be 
recruited  and the partners from LIA, LIG and Dauphine Université PSL, 
Paris. Furthermore, the project will involve one of the founders of 
SpeechBrain, Titouan Parcollet with whom the candidate will interact 
closely.
*INSTRUCTIONS FOR APPLYING*
Applications must contain: CV + letter/message of motivation + master 
notes + be ready to provide letter(s) of recommendation; and be 
addressed to Mickael Rouvier (_mickael.rouvier@univ-avignon.fr_ 
mailto:mickael.rouvier@univ-avignon.fr), Benjamin 
Lecouteux(benjamin.lecouteux@univ-grenoble-alpes.fr) and François 
Portet (_francois.Portet@imag.fr_ mailto:francois.Portet@imag.fr). 
We celebrate diversity and are committed to creating an inclusive 
environment for all employees.
*REFERENCES:*
[Noé2021] Noé, P.- G., Mohammadamini, M., Matrouf, D., Parcollet, T., 
Nautsch, A. & Bonastre, J.- F. Adversarial Disentanglement of Speaker 
Representation for Attribute-Driven Privacy Preservation in Proc. 
Interspeech 2021 (2021), 1902–1906.
[Sari2021] Sarı, L., Hasegawa-Johnson, M. & Yoo, C. D. 
Counterfactually Fair Automatic Speech Recognition. IEEE/ACM 
Transactions on Audio, Speech, and Language Processing 29, 3515–3525 
(2021)
[Yue2021] Yue, X. & Li, H. Phonetically Motivated Self-Supervised 
Speech Representation Learning in Proc. Interspeech 2021 (2021), 746–750.
[Zaiem2022] Zaiem, S., Parcollet, T. & Essid, S. Pretext Tasks 
Selection for Multitask Self-Supervised Speech Representation in AAAI, 
The 2nd Workshop on Self-supervised Learning for Audio and Speech 
Processing, 2023 (2022).
[Zhao2017] Zhao, J., Wang, T., Yatskar, M., Ordonez, V. & Chang, K. - 
W. Men Also Like Shopping: Reducing Gender Bias Amplification using 
Corpus-level Constraints in Proceedings of the 2017 Conference on 
Empirical Methods in Natural Language Processing (2017), 2979–2989.
-- 
François PORTET
Professeur - Univ Grenoble Alpes
Laboratoire d'Informatique de Grenoble - Équipe GETALP
Bâtiment IMAG - Office 333
700 avenue Centrale
Domaine Universitaire - 38401 St Martin d'Hères
FRANCE

Phone:  +33 (0)4 57 42 15 44
Email:francois.portet@imag.fr
www:http://membres-liglab.imag.fr/portet/

2026

2025

2024

2023

2022

[Corpora-List] PhD in ML/NLP – Efficient, Fair, robust and knowledge informed self-supervised learning for speech processing