The 1st Workshop on Computational Terminology in NLP and Translation
Studies (ConTeNTs)
Varna, 7th-8th September, 2023
In conjunction with RANLP 2023 - International Conference "Recent
Advances in Natural Language Processing"
Third call for papers
Computational Terminology and new technologies applied to translation
studies have attracted the interest of researchers with very different
multidisciplinary backgrounds and motivations. Those fields cover a
range of areas in Natural Language Processing (NLP) such as information
retrieval, terminology extraction, question-answering systems, ontology
building, machine translation, computer-aided translation, automatic or
semi-automatic abstracting, text generation, etc.
Terminological identification, extraction and coinage of new terms are
essential for knowledge mining from texts, both in high and low
resources languages. Quick evolutions and new developments in
specialised domains require efficient and systematic automatic term
management. New terms need to be coined and translated to ensure the
equitable development of domains in all languages.
During the last decade, deep learning and neural methods have become the
state of the art for most NLP applications. Those applications were
shown to outperform previous methods on various tasks, including
automatic term extraction, language mining, assessment of quality in
machine translation, accessibility of terminology, etc. On the one hand,
NLP and computational linguistics try to improve the work of translators
and interpreters by developing Computer-Assisted Translation (CAT)
tools, Translation Memories (TMs), terminological databases and
terminology extraction tools, etc. On the other hand, the NLP field
still needs the efforts and knowledge of translators, interpreters and
linguists to provide better services and tools based on the real
necessities of those language professionals.
The aim of this workshop is to promote new insights into the ongoing and
forthcoming developments in computational terminology by bringing
together NLP experts, as well as terminologists and translators. By
uniting researchers with such diverse profiles, we hope to bridge some
of the gaps between these disciplines and inspire a dialogue between
various parties, thus paving the way to more artificial intelligence
applications based on mutual collaboration between language and
technology.
Topics of Interest
The ConTeNTs workshop invites the submission of papers reporting on
original and unpublished research on topics related to Computational
Terminology in NLP and Translation Studies, including but not limited
to:
* Automatic term extraction: monolingual and multilingual extraction
of terms from parallel and comparable corpora, including single and
multiword expressions;
* Extraction and acquisition of semantic relations between terms;
* Extraction and generation of domain specific definitions and
disambiguation of terms;
* Representation of terms, management of term variation and the
discovery of synonym terms or term clusters and its relation to NLP
applications;
* Extraction of terminological context, through the use of comparable
and parallel corpus;
* Accessibility of terminology in certain domains, relevant to
non-experts or to laypersons, and its relevance to NLP applications such
as, chatbots, automatic email generation or spoken language interface;
* The impact of terminology on MT (applying terminology constraints,
evaluation of MT in domain-specific settings, etc.);
* The creation of domain ontologies, thesaurus, terminological
resources in specialised domains;
* The use of new technologies in translation studies and research and
the use of terminological resources in specialised translation;
* Identification of key problems in terminology and new technologies
used in translation studies;
* Evaluation of terminological resources in various NLP applications
and the impact of these resources have on the performance of the
automatic systems;
* Emerging language technologies: how the increased reliance on
real-time language technologies would change the structure of language;
* Corpus based studies applied to translation and interpreting: the
use of parallel and comparable corpora for translating phraseological
units;
* Phraseology and multiword expressions in cross-linguistic studies;
* Translation and interpreting tools, such as translation memories,
machine translation and alignment tools;
* User requirements for interpreting and translation tools.
SUBMISSION GUIDELINES
Submissions must consist of full-text papers and should not exceed 7
pages excluding references, they should be a minimum of 5 pages long.
The accepted papers will be published as ConTeNTs workshop e-proceedings
with ISBN, will be assigned a DOI and will be also available at the time
of the conference. The papers should be in English.
Authors of accepted papers will receive guidelines regarding how to
produce camera-ready versions of their papers for inclusion in the
proceedings.
Each submission will be reviewed by at least two programme committee
members. Accepted papers will be presented orally as part of the
programme of the workshop.
Submissions
Link to START system: https://softconf.com/ranlp23/ConTeNTS
Website of the workshop: https://contents2023.kulak.kuleuven.be/
Should you require any assistance with the submission, please do not
hesitate to contact us at amalhaddad(a)ugr.es and
ayla.rigoutsterryn(a)kuleuven.be.
Important Dates
Deadline for paper submission: 10 July 2023
Acceptance notification: 5 August 2023
Final camera-ready version: 25 August 2023
Workshop camera-ready proceedings ready: 31 August 2023
ConTeNTs workshop: 7/8 September 2023
Workshop Chairs & Organising Committee
Ayla Rigouts Terryn, Katholieke Universiteit Leuven, Belgium
Amal Haddad Haddad, Universidad de Granada, Spain
Ruslan Mitkov, University of Wolverhampton, United Kingdom
Programme Committee
* Sophia Ananiadou (University of Manchester)
* Maria Andreeva Todorova (Bulgarian Academy of Sciences)
* Silvia Bernardini (University of Bologna)
* Melania Cabezas García (Universidad de Granada)
* Rute Costa (Universidade Nova de Lisboa)
* Esther Castillo Pérez (Universidad de Granada)
* Patrick Drouin (Université de Montréal)
* Pamela Faber (Universidad de Granada)
* Mercedes García de Quesada (Universidad de Granada)
* Dagmar Gromann (Centre for Translation Studies - University of
Vienna)
* Tran Thi Hong Hanh (L3i Laboratory, University of La Rochelle)
* Rejwanul Haque (National College of Ireland)
* Amir Hazem (Nantes University)
* Kyo Kageura (University of Tokyo)
* Barbara Karsch (BIK Terminology - USA)
* Dorothy Kenny (Dublin City University)
* Miloš Jakubíček (Sketch Engine)
* Hendrik Kockaert (KU Leuven)
* Philipp Koehn (Johns Hopkins University)
* Maria Kunilovskaya (Saarland University)
* Marie-Claude L'Homme (Université de Montréal)
* Hélène Ledouble (Université de Toulon)
* Pilar León-Araúz (Universidad de Granada)
* Rodolfo Maslias (former Head of TermCoord, European Parliament)
* Silvia Montero Martínez (Universidad de Granada)
* Emmanuel Morin (LS2N-TALN)
* Rogelio Nazar (Pontificia Universidad Católica de Valparaíso)
* Sandrine Peraldi (University College Dublin)
* Silvia Piccini (Italian National Research Council)
* Thierry Poibeau (CNRS)
* Senja Pollak (Jožef Stefan Institute)
* Maria Pozzi Pardo (El Colegio de México)
* Tharindu Ranasinghe (Aston University)
* Arianne Reimerink (Universidad de Granada)
* Andres Repar (Jožef Stefan Institute)
* Christophe Roche (Université Savoie Mont-Blanc)
* Antonio San Martín Pizarro (Université du Québec à Trois-Rivières)
* Beatriz Sánchez Cárdenas (Universidad de Granada)
* Vilelmini Sosoni (Ionian University)
* Irena Spasic (Cardiff University)
* Elena Isabelle Tamba (Romanian Academy, Iași Branch)
* Rita Temmerman (Vrije Universiteit Brussel)
* Jorge Vivaldi Palatresi (Universitat Pompeu Fabra)
PhD in ML/NLP – Efficient, Fair, robust and knowledge informed
self-supervised learning for speech processing
Starting date: November 1st, 2022 (flexible)
Application deadline: September 5th, 2022
Interviews (tentative): September 19th, 2022
Salary: ~2000€ gross/month (social security included)
Mission: research oriented (teaching possible but not mandatory)
*Keywords:*speech processing, natural language processing,
self-supervised learning, knowledge informed learning, Robustness, fairness
*CONTEXT*
The ANR project E-SSL (Efficient Self-Supervised Learning for Inclusive
and Innovative Speech Technologies) will start on November 1st 2022.
Self-supervised learning (SSL) has recently emerged as one of the most
promising artificial intelligence (AI) methods as it becomes now
feasible to take advantage of the colossal amounts of existing unlabeled
data to significantly improve the performances of various speech
processing tasks.
*PROJECT OBJECTIVES*
Recent SSL models for speech such as HuBERT or wav2vec 2.0 have shown an
impressive impact on downstream tasks performance. This is mainly due to
their ability to benefit from a large amount of data at the cost of a
tremendous carbon footprint rather than improving the efficiency of the
learning. Another question related to SSL models is their unpredictable
results once applied to realistic scenarios which exhibit their lack of
robustness. Furthermore, as for any pre-trained models applied in
society, it isimportant to be able to measure the bias of such models
since they can augment social unfairness.
The goals of this PhD position are threefold:
- to design new evaluation metrics for SSL of speech models ;
- to develop knowledge-driven SSL algorithms ;
- to propose methods for learning robust and unbiased representations.
SSL models are evaluated with downstream task-dependent metrics e.g.,
word error rate for speech recognition. This couple the evaluation of
the universality of SSL representations to a potentially biased and
costly fine-tuning that also hides the efficiencyinformation related to
the pre-training cost. In practice, we will seek to measure the training
efficiency as the ratio between the amount of data, computation and
memory needed to observe a certain gain in terms of performance on a
metric of interest i.e.,downstream dependent or not. The first step will
be to document standard markers that can be used as robust measurements
to assess these values robustly at training time. Potential candidates
are, for instance, floating point operations for computational
intensity, number of neural parameters coupled with precision for
storage, online measurement of memory consumption for training and
cumulative input sequence length for data.
Most state-of-the-art SSL models for speech rely onmasked prediction
e.g. HuBERT and WavLM, or contrastive losses e.g. wav2vec 2.0. Such
prevalence in the literature is mostly linked to the size, amount of
data and computational resources injected by thecompany producing these
models. In fact, vanilla masking approaches and contrastive losses may
be identified as uninformed solutions as they do not benefit from
in-domain expertise. For instance, it has been demonstrated that blindly
masking frames in theinput signal i.e. HuBERT and WavLM results in much
worse downstream performance than applying unsupervised phonetic
boundaries [Yue2021] to generate informed masks. Recently some studies
have demonstrated the superiority of an informed multitask learning
strategy carefully selecting self-supervised pretext-tasks with respect
to a set of downstream tasks, over the vanilla wav2vec 2.0 contrastive
learning loss [Zaiem2022]. In this PhD project, our objective is: 1.
continue to develop knowledge-driven SSL algorithms reaching higher
efficiency ratios and results at the convergence, data consumption and
downstream performance levels; and 2. scale these novel approaches to a
point enabling the comparison with current state-of-the-art systems and
therefore motivating a paradigm change in SSL for the wider speech
community.
Despite remarkable performance on academic benchmarks, SSL powered
technologies e.g. speech and speaker recognition, speech synthesis and
many others may exhibit highly unpredictable results once applied to
realistic scenarios. This can translate into a global accuracy drop due
to a lack of robustness to adversarial acoustic conditions, or biased
and discriminatory behaviors with respect to different pools of end
users. Documenting and facilitating the control of such aspects prior to
the deployment of SSL models into the real-life is necessary for the
industrial market. To evaluate such aspects, within the project, we will
create novel robustness regularization and debasing techniques along two
axes: 1. debasing and regularizing speech representations at the SSL
level; 2. debasing and regularizing downstream-adapted models (e.g.
using a pre-trained model).
To ensure the creation of fair and robust SSL pre-trained models, we
propose to act both at the optimization and data levels following some
of our previous work on adversarial protected attribute disentanglement
and the NLP literature on data sampling and augmentation [Noé2021].
Here, we wish to extend this technique to more complex SSL architectures
and more realistic conditions by increasing the disentanglement
complexity i.e. the sex attribute studied in [Noé2021] is particularly
discriminatory. Then, and to benefit from the expert knowledge induced
by the scope of the task of interest, we will build on a recent
introduction of task-dependent counterfactual equal odds criteria
[Sari2021] to minimize the downstream performance gap observed in
between different individuals of certain protected attributes and to
maximize the overall accuracy. Following this multi-objective
optimization scheme, we will then inject further identified constraints
as inspired by previous NLP work [Zhao2017]. Intuitively, constraints
are injected so the predictions are calibrated towards a desired
distribution i.e. unbiased.
*SKILLS*
*
Master 2 in Natural Language Processing, Speech Processing, computer
science or data science.
*
Good mastering of Python programming and deep learning framework.
*
Previous in Self-Supervised Learning, acoustic modeling or ASR would
be a plus
*
Very good communication skills in English
*
Good command of French would be a plus but is not mandatory
*SCIENTIFIC ENVIRONMENT*
The thesis will be conducted within the Getalp teams of the LIG
laboratory (_https://lig-getalp.imag.fr/_ <https://lig-getalp.imag.fr/>)
and the LIA laboratory (https://lia.univ-avignon.fr/). The GETALP team
and the LIA have a strong expertise and track record in Natural Language
Processing and speech processing. The recruited person will be welcomed
within the teams which offer a stimulating, multinational and pleasant
working environment.
The means to carry out the PhD will be providedboth in terms of missions
in France and abroad and in terms of equipment. The candidate will have
access to the cluster of GPUs of both the LIG and LIA. Furthermore,
access to the National supercomputer Jean-Zay will enable to run large
scale experiments.
The PhD position will be co-supervised by Mickael Rouvier (LIA, Avignon)
and Benjamin Lecouteux and François Portet (Université Grenoble Alpes).
Joint meetings are planned on a regular basis and the student is
expected to spend time in both places. Moreover, the PhD student will
collaborate with several team members involved in the project in
particular the two other PhD candidates who will be recruited and the
partners from LIA, LIG and Dauphine Université PSL, Paris. Furthermore,
the project will involve one of the founders of SpeechBrain, Titouan
Parcollet with whom the candidate will interact closely.
*INSTRUCTIONS FOR APPLYING*
Applications must contain: CV + letter/message of motivation + master
notes + be ready to provide letter(s) of recommendation; and be
addressed to Mickael Rouvier (_mickael.rouvier(a)univ-avignon.fr_
<mailto:mickael.rouvier@univ-avignon.fr>), Benjamin
Lecouteux(benjamin.lecouteux(a)univ-grenoble-alpes.fr) and François Portet
(_francois.Portet(a)imag.fr_ <mailto:francois.Portet@imag.fr>). We
celebrate diversity and are committed to creating an inclusive
environment for all employees.
*REFERENCES:*
[Noé2021] Noé, P.- G., Mohammadamini, M., Matrouf, D., Parcollet, T.,
Nautsch, A. & Bonastre, J.- F. Adversarial Disentanglement of Speaker
Representation for Attribute-Driven Privacy Preservation in Proc.
Interspeech 2021 (2021), 1902–1906.
[Sari2021] Sarı, L., Hasegawa-Johnson, M. & Yoo, C. D. Counterfactually
Fair Automatic Speech Recognition. IEEE/ACM Transactions on Audio,
Speech, and Language Processing 29, 3515–3525 (2021)
[Yue2021] Yue, X. & Li, H. Phonetically Motivated Self-Supervised Speech
Representation Learning in Proc. Interspeech 2021 (2021), 746–750.
[Zaiem2022] Zaiem, S., Parcollet, T. & Essid, S. Pretext Tasks Selection
for Multitask Self-Supervised Speech Representation in AAAI, The 2nd
Workshop on Self-supervised Learning for Audio and Speech Processing,
2023 (2022).
[Zhao2017] Zhao, J., Wang, T., Yatskar, M., Ordonez, V. & Chang, K. - W.
Men Also Like Shopping: Reducing Gender Bias Amplification using
Corpus-level Constraints in Proceedings of the 2017 Conference on
Empirical Methods in Natural Language Processing (2017), 2979–2989.
--
François PORTET
Professeur - Univ Grenoble Alpes
Laboratoire d'Informatique de Grenoble - Équipe GETALP
Bâtiment IMAG - Office 333
700 avenue Centrale
Domaine Universitaire - 38401 St Martin d'Hères
FRANCE
Phone: +33 (0)4 57 42 15 44
Email:francois.portet@imag.fr
www:http://membres-liglab.imag.fr/portet/