New subject: PhD in ML/NLP – Fairness and self-supervised learning for speech processing

20 Jul 2022


      PhD in ML/NLP – Efficient, Fair, robust and knowledge informed 
self-supervised learning for speech processing
Starting date: November 1st, 2022 (flexible)
Application deadline: September 5th, 2022
Interviews (tentative): September 19th, 2022
Salary: ~2000€ gross/month (social security included)
Mission: research oriented (teaching possible but not mandatory)
*Keywords:*speech processing, natural language processing, 
self-supervised learning, knowledge informed learning, Robustness, fairness
*CONTEXT*
The ANR project E-SSL (Efficient Self-Supervised Learning for Inclusive 
and Innovative Speech Technologies) will start on November 1st 2022. 
Self-supervised learning (SSL) has recently emerged as one of the most 
promising artificial intelligence (AI) methods as it becomes now 
feasible to take advantage of the colossal amounts of existing unlabeled 
data to significantly improve the performances of various speech 
processing tasks.
*PROJECT OBJECTIVES*
Recent SSL models for speech such as HuBERT or wav2vec 2.0 have shown an 
impressive impact on downstream tasks performance. This is mainly due to 
their ability to benefit from a large amount of data at the cost of a 
tremendous carbon footprint rather than improving the efficiency of the 
learning. Another question related to SSL models is their unpredictable 
results once applied to realistic scenarios which exhibit their lack of 
robustness. Furthermore, as for any pre-trained models applied in 
society, it isimportant to be able to measure the bias of such models 
since they can augment social unfairness.
The goals of this PhD position are threefold:
- to design new evaluation metrics for SSL of speech models ;
- to develop knowledge-driven SSL algorithms ;
- to propose methods for learning robust and unbiased representations.
SSL models are evaluated with downstream task-dependent metrics e.g., 
word error rate for speech recognition. This couple the evaluation of 
the universality of SSL representations to a potentially biased and 
costly fine-tuning that also hides the efficiencyinformation related to 
the pre-training cost. In practice, we will seek to measure the training 
efficiency as the ratio between the amount of data, computation and 
memory needed to observe a certain gain in terms of performance on a 
metric of interest i.e.,downstream dependent or not. The first step will 
be to document standard markers that can be used as robust measurements 
to assess these values robustly at training time. Potential candidates 
are, for instance, floating point operations for computational 
intensity, number of neural parameters coupled with precision for 
storage, online measurement of memory consumption for training and 
cumulative input sequence length for data.
Most state-of-the-art SSL models for speech rely onmasked prediction 
e.g. HuBERT and WavLM, or contrastive losses e.g. wav2vec 2.0. Such 
prevalence in the literature is mostly linked to the size, amount of 
data and computational resources injected by thecompany producing these 
models. In fact, vanilla masking approaches and contrastive losses may 
be identified as uninformed solutions as they do not benefit from 
in-domain expertise. For instance, it has been demonstrated that blindly 
masking frames in theinput signal i.e. HuBERT and WavLM results in much 
worse downstream performance than applying unsupervised phonetic 
boundaries [Yue2021] to generate informed masks. Recently some studies 
have demonstrated the superiority of an informed multitask learning 
strategy carefully selecting self-supervised pretext-tasks with respect 
to a set of downstream tasks, over the vanilla wav2vec 2.0 contrastive 
learning loss [Zaiem2022]. In this PhD project, our objective is: 1. 
continue to develop knowledge-driven SSL algorithms reaching higher 
efficiency ratios and results at the convergence, data consumption and 
downstream performance levels; and 2. scale these novel approaches to a 
point enabling the comparison with current state-of-the-art systems and 
therefore motivating a paradigm change in SSL for the wider speech 
community.
Despite remarkable performance on academic benchmarks, SSL powered 
technologies e.g. speech and speaker recognition, speech synthesis and 
many others may exhibit highly unpredictable results once applied to 
realistic scenarios. This can translate into a global accuracy drop due 
to a lack of robustness to adversarial acoustic conditions, or biased 
and discriminatory behaviors with respect to different pools of end 
users. Documenting and facilitating the control of such aspects prior to 
the deployment of SSL models into the real-life is necessary for the 
industrial market. To evaluate such aspects, within the project, we will 
create novel robustness regularization and debasing techniques along two 
axes: 1. debasing and regularizing speech representations at the SSL 
level; 2. debasing and regularizing downstream-adapted models (e.g. 
using a pre-trained model).
To ensure the creation of fair and robust SSL pre-trained models, we 
propose to act both at the optimization and data levels following some 
of our previous work on adversarial protected attribute disentanglement 
and the NLP literature on data sampling and augmentation [Noé2021]. 
Here, we wish to extend this technique to more complex SSL architectures 
and more realistic conditions by increasing the disentanglement 
complexity i.e. the sex attribute studied in [Noé2021] is particularly 
discriminatory. Then, and to benefit from the expert knowledge induced 
by the scope of the task of interest, we will build on a recent 
introduction of task-dependent counterfactual equal odds criteria 
[Sari2021] to minimize the downstream performance gap observed in 
between different individuals of certain protected attributes and to 
maximize the overall accuracy. Following this multi-objective 
optimization scheme, we will then inject further identified constraints 
as inspired by previous NLP work [Zhao2017]. Intuitively, constraints 
are injected so the predictions are calibrated towards a desired 
distribution i.e. unbiased.
*SKILLS*
*
Master 2 in Natural Language Processing, Speech Processing, computer
    science or data science.
*
Good mastering of  Python programming and  deep learning framework.
*
Previous in Self-Supervised Learning, acoustic modeling or ASR would
    be a plus
*
Very good communication skills in English
*
Good command of French would be a plus but is not mandatory
*SCIENTIFIC ENVIRONMENT*
The thesis will be conducted within the Getalp teams of the LIG 
laboratory (_https://lig-getalp.imag.fr/_ https://lig-getalp.imag.fr/) 
and the LIA laboratory (https://lia.univ-avignon.fr/). The GETALP team 
and the LIA have a strong expertise and track record in Natural Language 
Processing and speech processing. The recruited person will be welcomed 
within the teams which offer a stimulating, multinational and pleasant 
working environment.
The means to carry out the PhD will be providedboth in terms of missions 
in France and abroad and in terms of equipment. The candidate will have 
access to the cluster of GPUs of both the LIG and LIA. Furthermore, 
access to the National supercomputer Jean-Zay will enable to run large 
scale experiments.
The PhD position will be co-supervised by Mickael Rouvier (LIA, Avignon) 
and Benjamin Lecouteux and François Portet (Université Grenoble Alpes). 
Joint meetings are planned on a regular basis and the student is 
expected to spend time in both places. Moreover, the PhD student will 
collaborate with several team members involved in the project in 
particular the two other PhD candidates who will be recruited  and the 
partners from LIA, LIG and Dauphine Université PSL, Paris. Furthermore, 
the project will involve one of the founders of SpeechBrain, Titouan 
Parcollet with whom the candidate will interact closely.
*INSTRUCTIONS FOR APPLYING*
Applications must contain: CV + letter/message of motivation + master 
notes + be ready to provide letter(s) of recommendation; and be 
addressed to Mickael Rouvier (_mickael.rouvier@univ-avignon.fr_ 
mailto:mickael.rouvier@univ-avignon.fr), Benjamin 
Lecouteux(benjamin.lecouteux@univ-grenoble-alpes.fr) and François Portet 
(_francois.Portet@imag.fr_ mailto:francois.Portet@imag.fr). We 
celebrate diversity and are committed to creating an inclusive 
environment for all employees.
*REFERENCES:*
[Noé2021] Noé, P.- G., Mohammadamini, M., Matrouf, D., Parcollet, T., 
Nautsch, A. & Bonastre, J.- F. Adversarial Disentanglement of Speaker 
Representation for Attribute-Driven Privacy Preservation in Proc. 
Interspeech 2021 (2021), 1902–1906.
[Sari2021] Sarı, L., Hasegawa-Johnson, M. & Yoo, C. D. Counterfactually 
Fair Automatic Speech Recognition. IEEE/ACM Transactions on Audio, 
Speech, and Language Processing 29, 3515–3525 (2021)
[Yue2021] Yue, X. & Li, H. Phonetically Motivated Self-Supervised Speech 
Representation Learning in Proc. Interspeech 2021 (2021), 746–750.
[Zaiem2022] Zaiem, S., Parcollet, T. & Essid, S. Pretext Tasks Selection 
for Multitask Self-Supervised Speech Representation in AAAI, The 2nd 
Workshop on Self-supervised Learning for Audio and Speech Processing, 
2023 (2022).
[Zhao2017] Zhao, J., Wang, T., Yatskar, M., Ordonez, V. & Chang, K. - W. 
Men Also Like Shopping: Reducing Gender Bias Amplification using 
Corpus-level Constraints in Proceedings of the 2017 Conference on 
Empirical Methods in Natural Language Processing (2017), 2979–2989.
-- 
François PORTET
Professeur - Univ Grenoble Alpes
Laboratoire d'Informatique de Grenoble - Équipe GETALP
Bâtiment IMAG - Office 333
700 avenue Centrale
Domaine Universitaire - 38401 St Martin d'Hères
FRANCE

Phone:  +33 (0)4 57 42 15 44
Email:francois.portet@imag.fr
www:http://membres-liglab.imag.fr/portet/

PhD in ML/NLP – Efficient, Fair, robust and knowledge informed self-supervised learning for speech processing