Utrecht University, The Netherlands
In NLP, there is a growing recognition that data quality is key to better language models, yet we still know very little about the link between data and model behavior. In this project, we will develop methods to measure the diversity of NLP datasets, assess the impact of diversity on NLP models, and improve data collection and model training.
As a PhD student, you will develop innovative methods to measure the diversity of NLP datasets. A major focus will be on measuring the dataset diversity from a sociolinguistic perspective, considering language variation – such as styles and dialects - and combining (socio)linguistic insights with neural language modeling. You will also draw from relevant disciplines, particularly the social sciences, that have developed measurement approaches for diversity. Furthermore, you will carry out experiments to assess the impact of data diversity on NLP models, with a focus on fairness and robustness, and investigate ways to leverage data diversity to improve NLP models.
You will join the NLP & Society Lab, headed by Dong Nguyen, where we work on a variety of topics, including computational sociolinguistics, analysis of online conversations, data-centered NLP, and evaluation of NLP models. We are part of the wider NLP group within the Department of Information and Computing Sciences of Utrecht University (UU), the Netherlands.
For more details and to apply, visit the link below: https://www.uu.nl/en/organisation/working-at-utrecht-university/jobs/phd-pos... (Deadline: Jan 5)
Contact: Dong Nguyen (d.p.nguyen@uu.nl)