== 12th NLP4CALL, Tórshavn, Faroe Islands==
The workshop series on Natural Language Processing (NLP) for Computer-Assisted Language Learning (NLP4CALL) is a meeting place for researchers working on the integration of Natural Language Processing and Speech Technologies in CALL systems and exploring the theoretical and methodological issues arising in this connection. The latter includes, among others, insights from Second Language Acquisition (SLA) research, on the one hand, and promote development of “Computational SLA” through setting up Second Language research infrastructure(s), on the other.
The intersection of Natural Language Processing (or Language Technology / Computational Linguistics) and Speech Technology with Computer-Assisted Language Learning (CALL) brings “understanding” of language to CALL tools, thus making CALL intelligent. This fact has given the name for this area of research – Intelligent CALL, ICALL. As the definition suggests, apart from having excellent knowledge of Natural Language Processing and/or Speech Technology, ICALL researchers need good insights into second language acquisition theories and practices, as well as knowledge of second language pedagogy and didactics. This workshop invites therefore a wide range of ICALL-relevant research, including studies where NLP-enriched tools are used for testing SLA and pedagogical theories, and vice versa, where SLA theories, pedagogical practices or empirical data are modeled in ICALL tools.
The NLP4CALL workshop series is aimed at bringing together competences from these areas for sharing experiences and brainstorming around the future of the field.
We welcome papers:
- that describe research directly aimed at ICALL;
- that demonstrate actual or discuss the potential use of existing Language and Speech Technologies or resources for language learning;
- that describe the ongoing development of resources and tools with potential usage in ICALL, either directly in interactive applications, or indirectly in materials, application or curriculum development, e.g. learning material generation, assessment of learner texts and responses, individualized learning solutions, provision of feedback;
- that discuss challenges and/or research agenda for ICALL
- that describe empirical studies on language learner data.
This year a special focus is given to work done on error detection/correction and feedback generation.
We encourage paper presentations and software demonstrations describing the above- mentioned themes primarily, but not exclusively, for the Nordic languages.
==Shared task==
NEW for this year is the MultiGED shared task on token-level error detection for L2 Czech, English, German, Italian and Swedish, organized by the Computational SLA working group.
For more information, please see the Shared Task website: https://github.com/spraakbanken/multiged-2023
==Invited speakers==
This year, we have the pleasure to announce two invited talks.
The first talk is given by Marije Michel from the University of Amsterdam.
The second talk is given by Pierre Lison from the Norwegian Computing Center.
==Submission information==
Authors are invited to submit long papers (8-12 pages) alternatively short papers (4-7 pages), page count not including references.
We will be using the NLP4CALL template for the workshop this year. The author kit can be accessed here, alternatively on Overleaf:
<https://spraakbanken.gu.se/sites/default/files/2023/NLP4CALL%20workshop%20t…>
<https://spraakbanken.gu.se/sites/default/files/2023/nlp4call%20template.doc>
<https://www.overleaf.com/latex/templates/nlp4call-workshop-template/qqqzqqy…>
Submissions will be managed through the electronic conference management system EasyChair <https://easychair.org/conferences/?conf=nlp4call2023>. Papers must be submitted digitally through the conference management system, in PDF format. Final camera-ready versions of accepted papers will be given an additional page to address reviewer comments.
Papers should describe original unpublished work or work-in-progress. Papers will be peer reviewed by at least two members of the program committee in a double-blind fashion. All accepted papers will be collected into a proceedings volume to be submitted for publication in the NEALT Proceeding Series (Linköping Electronic Conference Proceedings) and, additionally, double-published through the ACL anthology, following experiences from the previous NLP4CALL editions (<https://www.aclweb.org/anthology/venues/nlp4call/>).
==Important dates==
03 April 2023: paper submission deadline
21 April 2023: notification of acceptance
01 May 2023: camera-ready papers for publication
22 May 2023: workshop date
==Organizers==
David Alfter (1), Elena Volodina (2), Thomas François (3), Arne Jönsson (4), Evelina Rennes (4)
(1) Gothenburg Research Infrastructure for Digital Humanities, Department of Literature, History of Ideas, and Religion, University of Gothenburg, Sweden
(2) Språkbanken, Department of Swedish, Multilingualism, Language Technology, University of Gothenburg, Sweden
(3) CENTAL, Institute for Language and Communication, Université Catholique de Louvain, Belgium
(4) Department of Computer and Information Science, Linköping University, Sweden
==Contact==
For any questions, please contact David Alfter, david.alfter(a)gu.se
For further information, see the workshop website <https://spraakbanken.gu.se/en/research/themes/icall/nlp4call-workshop-serie…>
Follow us on Twitter @NLP4CALL <https://twitter.com/NLP4CALL/>
Hi there,
Could you please distribute the following job offer? Thanks.
Best,
Pascal
-------------------------------------------------------------------------------------
We invite applications for a 3-year PhD position co-funded by Inria,
the French national research institute in Computer Science and Applied
Mathematics, and LexisNexis France, leader of legal information in
France and subsidiary of the RELX Group.
The overall objective of this project is to develop an automated
system for detecting argumentation structures in French legal
decisions, using recent machine learning-based approaches (i.e. deep
learning approaches). In the general case, these structures take the
form of a directed labeled graph, whose nodes are the elements of the
text (propositions or groups of propositions, not necessarily
contiguous) which serve as components of the argument, and edges are
relations that signal the argumentative connection between them (e.g.,
support, offensive). By revealing the argumentation structure behind
legal decisions, such a system will provide a crucial milestone
towards their detailed understanding, their use by legal
professionals, and above all contributes to greater transparency of
justice.
The main challenges and milestones of this project start with the
creation and release of a large-scale dataset of French legal
decisions annotated with argumentation structures. To minimize the
manual annotation effort, we will resort to semi-supervised and
transfer learning techniques to leverage existing argument mining
corpora, such as the European Court of Human Rights (ECHR) corpus, as
well as annotations already started by LexisNexis. Another promising
research direction, which is likely to improve over state-of-the-art
approaches, is to better model the dependencies between the different
sub-tasks (argument span detection, argument typing, etc.) instead of
learning these tasks independently. A third research avenue is to find
innovative ways to inject the domain knowledge (in particular the rich
legal ontology developed by LexisNexis) to enrich enrich the
representations used in these models. Finally, we would like to take
advantage of other discourse structures, such as coreference and
rhetorical relations, conceived as auxiliary tasks in a multi-tasking
architecture.
The successful candidate holds a Master's degree in computational
linguistics, natural language processing, machine learning, ideally
with prior experience in legal document processing and discourse
processing. Furthermore, the candidate will provide strong programming
skills, expertise in machine learning approaches and is eager to work
at the interplay between academia and industry.
The position is affiliated with the MAGNET [1], a research group at
Inria, Lille, which has expertise in Machine Learning and Natural
Language Processing, in particular Discourse Processing. The PhD
student will also work in close collaboration with the R&D team at
LexisNexis France, who will provide their expertise in the legal
domain and the data they have collected.
Applications will be considered until the position is filled. However,
you are encouraged to apply early as we shall start processing the
applications as and when they are received. Applications, written in
English or French, should include a brief cover letter with research
interests and vision, a CV (including your contact address, work
experience, publications), and contact information for at least 2
referees. Applications (and questions) should be sent to Pascal Denis
(pascal.denis(a)inria.fr).
The starting date of the position is 1 November 2022 or soon
thereafter, for a total of 3 full years.
Best regards,
Pascal Denis
[1] https://team.inria.fr/magnet/
[2] https://www.lexisnexis.fr/
--
Pascal
----
Pour une évaluation indépendante, transparente et rigoureuse !
Je soutiens la Commission d'Évaluation de l'Inria.
----
+++++++++++++++++++++++++++++++++++++++++++++++
Pascal Denis
Equipe MAGNET, INRIA Lille Nord Europe
Bâtiment B, Avenue Heloïse
Parc scientifique de la Haute Borne
59650 Villeneuve d'Ascq
Tel: ++33 3 59 35 87 24
Url: http://researchers.lille.inria.fr/~pdenis/
+++++++++++++++++++++++++++++++++++++++++++++++
Dear colleagues,
Last month, we shared the result of our collaborative work on a core metadata scheme for learner corpora with LCR2022 participants. Our proposal builds on Granger and Paquot (2017)'s first attempt to design such a scheme and during our presentation, we explained the rationale for expanding on the initial proposal and discussed selected aspects of the revised scheme.
Our proposal is available at https://docs.google.com/spreadsheets/d/1-RbX5iUCUtCBkZU9Rfk-kv-Vzc--F-eUW2O…<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.goog…>
We firmly believe that our efforts to develop a core metadata scheme for learner corpora will only be successful to the extent that (1) the LCR community is given the opportunity to engage with our work in various ways (provide feedback on the general structure of the scheme, the list of variables that we identified as core and their operationalization; test the metadata on other learner corpora; use the scheme to start a new corpus compilation, etc.) and (2) the core metadata scheme is the result of truly collaborative work.
As mentioned at LCR2022, we will be collecting feedback on the metadata scheme until the end of October. The online feedback form is available at:
https://docs.google.com/document/d/1NeDUuxGJlPSJI9wHVA1xgGM-aV8jXTa8Qlb45K-…<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.goog…>
We'd like to thank all the colleagues who already got back to us (at LCR2022, by email or via the online form). We also thank them for their appreciation and enthusiasm for our work! We'd also like to encourage more colleagues (and particularly those of you who have experience in learner corpus compilation) to provide feedback! We need help in finalizing the core metadata scheme to make sure that it can be applied in all learner compilation contexts. In short, we need you to make sure the scheme meets the needs of the LCR community at large.
With very best wishes,
Magali Paquot (also on behalf of Alexander König, Jennifer-Carmen Frey, and Egon W. Stemle)
Reference
Granger, S. & M. Paquot (2017). Towards standardization of metadata for L2 corpora. Invited talk at the CLARIN workshop on Interoperability of Second Language Resources and Tools, 6-8 December 2017, University of Gothenburg, Sweden.
Dr. Magali Paquot
Centre for English Corpus Linguistics
Institut Langage et Communication
UCLouvain
https://perso.uclouvain.be/magali.paquot/
We are pleased to announce the inaugural offering of the Plain Language Adaptation of Biomedical Abstracts (PLABA) track, as part of the 2023 Text Analysis Conference (TAC) hosted by the U.S. National Institute of Standards and Technology (NIST). This track is an opportunity to showcase your cutting-edge research on an important topic, and to take advantage of large amounts of expert annotated data and manual evaluation.
Background: Deficits of Health Literacy are linked to worse outcomes and drive health disparities. Though unprecedented amounts of biomedical knowledge are available online, patients and caregivers face a type of “language barrier” when confronted with jargon and academic writing. Advances in language modeling have improved plain language generation, but the task of automatically and accurately adapting biomedical text for a general audience has thus far lacked high-quality, standardized benchmarks.
Task: Systems will adapt biomedical abstracts to plain language. This includes substituting medical jargon, providing explanations for necessary terms, simplifying sentences, and other modifications. The training set is the publicly available PLABA dataset<https://doi.org/10.1038%2Fs41597-022-01920-3>, which contains 750 abstracts with manual, sentence-aligned adaptations for each, totaling more than 7k sentence pairs with document context.
Evaluation: Participating systems will be evaluated on 400 held out abstracts, manually adapted four-fold by different annotators for robust automatic metrics. Additionally, a subset of system output will be manually evaluated along several axes to ensure they are accurate and faithful to the original, which is crucial for the biomedical domain.
URL: https://bionlp.nlm.nih.gov/plaba2023/
Mailing list: https://groups.google.com/g/plaba2023
Key dates:
Jul 19 – Evaluation data released
Aug 16 – Submissions due
Oct 18 – Results posted
We look forward to your submissions.
The Research Training Group 2853 “Neuroexplicit Models of Language, Vision, and Action” is looking for
6 PhD students and 1 postdoc
October 2023 or later
Neuroexplicit models combine neural and human-interpretable (“explicit”) models in order to overcome the limitations that each model class has separately. They include neurosymbolic models, which combine neural and symbolic models, but also e.g. combinations of neural and physics-based models. In the RTG, we will improve the state of the art in natural language processing (“Language”), computer vision (“Vision”), and planning and reinforcement learning (“Action”) through the use of neuroexplicit models and investigate the cross-cutting design principles of effective neuroexplicit models (“Foundations”).
The RTG is scheduled to grow to a total of 24 PhD students and one postdoc by 2025. Through the inclusion of ~20 further PhD students and postdocs funded from other sources, it will be one of the largest research centers on neuroexplicit or neurosymbolic models in the world. The RTG brings together researchers at Saarland University, the Max Planck Institute for Informatics, the Max Planck Institute for Software Systems, the CISPA Helmholtz Center for Information Security, and the German Research Center for Artificial Intelligence (DFKI). All of these institutions are colocated on the same campus in Saarbrücken, Germany.
The positions are funded as follows:
• PhD students will be funded for up to four years at the TV-L E13 100% pay scale. You should have or be about to complete an MSc degree in computer science or a related field and have demonstrated expertise in one of the research areas of the RTG, e.g. through an excellent Master’s thesis or relevant publications.
• The postdoc will initially be funded for three years, with the possibility of extension up to five years, at the TV-L E13 100% pay scale. As the RTG postdoc, you will pursue your own research agenda in the field of neuroexplicit models and work with the PhD students to identify and pursue opportunities for collaborative research. You should have or be about to complete a PhD in computer science or a related field and have demonstrated your expertise in one or more of the RTG’s research areas through publications in top venues.
The RTG is part of the Saarland Informatics Campus, one of the leading centers for research in computer science, artificial intelligence, and natural language processing in Europe. The Saarland Informatics Campus brings together 900 researchers and 2500 students from 81 countries. The CISPA Helmholtz Center, located on the same campus, is home to an additional 350 researchers and on track to grow to 800 by 2026. Researchers at SIC and CISPA are part of the ELLIS network and have been awarded more than 35 ERC grants.
Each PhD student in the RTG will be jointly supervised by two PhD advisors from the list of Principal Investigators below. Each student will freely define their own research topic; we encourage the choice of topics that cross the traditional boundaries of research fields. Students may be affiliated with Saarland University or with one of the participating institutes.
Vera Demberg, Saarland University - Computational Linguistics
Jörg Hoffmann, Saarland University - AI Planning
Eddy Ilg, Saarland University - Computer Vision, Machine Learning
Dietrich Klakow, Saarland University - Natural Language Processing
Alexander Koller, Saarland University - Computational Linguistics
Bernt Schiele, MPI for Informatics - Computer Vision, Machine Learning
Philipp Slusallek, DFKI and Saarland University - Computer Graphics, Artificial Intelligence
Christian Theobalt, MPI for Informatics - Visual Computing, Machine Learning
Mariya Toneva, MPI for Software Systems - Computational Neuroscience, Machine Learning
Isabel Valera, Saarland University - Machine Learning
Jilles Vreeken, CISPA - Machine Learning, Causality
Joachim Weickert, Saarland University - Mathematical Data Analysis
Verena Wolf, DFKI and Saarland University - Modeling and Simulation, Reinforcement Learning
Ellie Pavlick, Brown University and Google AI, will join us regularly as a Mercator Fellow.
Please send your application by 31 May 2023 to bewerbung(a)uni-saarland.de. Include the reference number W2298 for the postdoc position and the reference number W2299 for the PhD positions. We aim to conduct job interviews in July (for a start in October) and September (for a later start). The legally binding version of this job ad is at https://www.uni-saarland.de/fileadmin/upload/verwaltung/stellen/Wissenschaf… (postdoc) and https://www.uni-saarland.de/fileadmin/upload/verwaltung/stellen/Wissenschaf… (PhD), respectively.
For details on what materials to submit with your application and all other information about the RTG, please see our website: https://www.neuroexplicit.org/jobs/#phd-2023
Dear colleagues and friends,
This year, we are organizing the MedVidQA
<https://medvidqa.github.io/>challenge
with TRECVID 2023 <https://www-nlpir.nist.gov/projects/tv2023/index.html>.
This challenge aims at developing models for (1) retrieving the relevant
videos and locating the visual answer in those videos for the medical or
health-related question and (2) generating the medical instructional
questions from the video segments. Following the success of the 1st
MedVidQA shared task <https://aclanthology.org/2022.bionlp-1.25/>, MedVidQA
at TRECVID 2023 expanded the tasks and introduced a new track considering
language-video understanding and generation. This track is comprised of two
main tasks Video Corpus Visual Answer Localization (VCVAL) and Medical
Instructional Question Generation (MIQG).
For more details, please visit the challenge website (
https://medvidqa.github.io/) and TRECVID 2023 website (
https://www-nlpir.nist.gov/projects/tv2023/index.html).
The link for submission:
- Task 1 (VCVAL): https://codalab.lisn.upsaclay.fr/competitions/13445
<https://codalab.lisn.upsaclay.fr/competitions/13546>
- Task 2 (MIQG): https://codalab.lisn.upsaclay.fr/competitions/13546
*Important Dates*
- *Release of the training and validation datasets:* April 30, 2023
- *Release of the video corpus:* May 12, 2023
- *Release of the test sets:* July 14, 2023
- *Run submission deadline:* August 4, 2023
- *Release of the official results:* September 29, 2023
We look forward to your participation in MedVidQA at TRECVID 2023.
Join our Google Group <https://groups.google.com/g/trecvid-medvidqa2023> for
important updates! If you have any questions, ask in our Google Group
<https://groups.google.com/g/trecvid-medvidqa2023> or email
<deepak.gupta(a)nih.gov> us.
Thank you,
MedVidQA 2023 Organizers
Dear all
Just wanted to let you know that APJCR Vol. 3, No. 1 is now available to
view online.
http://icr.or.kr/ejournals-apjcr
CK
---
*CK Jung BEng(Hons) Birmingham MSc Warwick EdD Warwick Cert Oxford*
Department of English Language and Literature, Incheon National
University, *South
Korea*
Vice President | The Korea Association of Primary English Education
(KAPEE), *South Korea*
Vice President | The Korea Association of Secondary English Education
(KASEE), *South Korea*
Director | Institute for Corpus Research, Incheon National University, *South
Korea* (http://icr.or.kr)
Editor | Asia Pacific Journal of Corpus Research, ICR, *International* (
http://icr.or.kr/apjcr)
Deputy Editor | Korean Journal of English Language and Linguistics,
KASELL, *South
Korea*
Editorial Board | Corpora, Edinburgh University Press, *UK*
Editorial Board | English Today, Cambridge University Press, *UK*
E: ckjung(a)inu.ac.kr / T: +82 (0)32 835 8129
H(EN): http://ckjung.org
H(KR): http://prof1.inu.ac.kr/user/ckjung
CASE-2023 Shared Task - Task 2: Collecting and Geocoding Armed Clash Events
in Russo-Ukrainian Conflict
================================================
The unprecedented quantity of easily accessible data on social, political,
and economic processes offers ground-breaking potential in guiding
data-driven analysis of socio political phenomena: Armed conflicts,
political movements, fights for economic and social rights, and various
related socio-political happenings are reported in news articles and social
media posts and recorded in curated databases. On the other hand, automatic
event detection from texts and event geocoding has long been a challenge
for the natural language processing (NLP) community. It requires
sophisticated methods and resources, such as Machine Learning models,
linguistic rules and dictionaries, geographic gazetteers.
Task definition
The task Collecting and Geocoding Armed Clash Events in Russo-Ukrainian
Conflict is being held as a sub-task of the 6th Workshop on Challenges and
Applications of Automated Extraction of Socio-political Events from Text
(CASE 2023). The task will use data from the Russo-Ukrainian Conflict to
test the capabilities of event detection systems to extract, geocode and
de-duplicate armed clashes in news and social media postsл Evaluation will
be based on the correlation between the spatio-temporal distribution and
number of the extracted events and those which are in the ground truth data
set.
We invite contributions from researchers in NLP, ML, Deep Learning, and
AI. The call is directed also towards socio-political scientists,
researchers in conflict analysis and forecasting, peace studies, and
computational social science.
All participating teams will be able to publish their system description
paper in the workshop proceedings published by ACL. For more information on
the workshop,
please visit the Workshop website https://emw.ku.edu.tr/case-2023/
<https://emw.ku.edu.tr/case-2022/> and the conference website
https://ranlp.org/ranlp2023/.
================================================
1.
Data
Gold Standard and Text Input Data for the participant systems for the time
range 24.02.2022-24.08.2022 has been prepared and will be shared with the
applicants on the Task website.
1.1 Training Data
No training data are provided for this Task. The data utilized for CASE
2023 Task 1, which is described in Hürriyetoğlu, A. et al. (2022, 2020b),
can be used for training systems for this task (Task 2). Additionally data
can be used to build systems/models that can detect protest events in
tweets and news articles.
1.2 Input Data
The participant systems will be evaluated on raw data collections including
Telegram messages, the New York Times and Ukrainian-Russian official news
channels.
Namely, the data collections comprise:
• English language social media massage and news corpus comprising.
48.007 Telegram Messages and The New York Times News about Ukraine.
• Ukrainian language social media collection comprising
102.135 Telegram Messages and Ukraine News Agency News.
• Russian language social media collection comprising
8.534 Telegram Message and Russian News Agency News
Further details on the text collections and sampling methods are provided
in the folders news and Social Media of the github repo for the Task (
https://github.com/zavavan/case2023_task2).
1.3 Gold Standard Data
The Russo-Ukrainian Conflict ground truth data primarily consists of data
coming from the Armed Conflict Location & Event Data Project (ACLED). We
will be adding alternative ground-truth datasets in order to prevent the
bias that may be introduced by using a single definition and interpretation
of an event. Full details on the manually curated data used as Gold
Standard for the correlation analysis will be disclosed at the end of the
evaluation period. Please check documentation on the folder gold_standard
of the Task github repo.
================================================
1.
Evaluation
The systems which participate in this shared task will be required to
detect news articles and Telegram posts which contain description of
ongoing armed clashes. The time and place of each armed clash should be
detected at date level (regarding the time) and precise geographic
coordinates (latitude and longitude). The systems should ideally extract
event times, based on multiple text reports.
In order to evaluate the ability of automatic event-coders to reproduce the
gold standard armed clash event dataset, we adapt two correlation methods
originally used in micro-level analysis of political violence by Hammond
and Weidmann (2014), based on aggregation of event counts uniform grid
geographical cells and 1-day time spans and apply a number of standard
correlation coefficients and error measures.
For each of the input text corpora in1.2, each participant may submit up to
3 different system responses. Each system response will consist of a csv
file with the following naming pattern:
“submission.<team-name>.<corpus>.<response-number>.csv”
where <corpus> is either “social_media” or “news”.
For instance: “submission.MyTeam.news.3.csv” for the 3rd submission of team
“MyTeam” on the news corpus.
Each system response file will have one line per event, where each line
will have the following format:
<id>,<City>,<Region>,<Country>,<Date>
where <id> is a numerical event identifier, <City>,<Region>,<Country> are
canonical English names of the City,State/Region and Country, respectively,
of the detected event location. While only the <country> attribute is
mandatory, systems are expected to assign a description of the event
location at the finest grained level possible, as otherwise geographical
coordinate conversion may penalize the correlation score on geographical
cell aggregation. <Date> is the assigned date of the event in the format
YYYY-MM-DD.
A sample system response file line:
0,Kharkiv,Kharkiv Oblast,Ukraine,2022-05-02
A sample system output file can be downloaded from the Task repo at:
https://github.com/zavavan/case2023_task2/blob/main/submission.myteam.news.…
Important Dates (AoE time)
================================================
It is optional to use Task 1 systems. Participants may also use their own
systems, which are developed independently of Task 1.
Task 1 Training data available: May 1, 2023
Task 1 Test data available: May 15, 2023
Task 1 Evaluation period ends: June 30, 2023
Task 2 Sample Text archive is available: May 22, 2023
Task 2 Text archive for evaluation is available: July 1, 2023
Task 2 Evaluation period starts: July 1, 2023
Task 2 Evaluation period ends: July 24
System Description Paper submissions due: July 31, 2023
Notification to authors after review: August 7, 2023
Camera ready: August 25, 2023
Workshop period @ RANLP: Sep 7-8, 2023
Organization
================================================
-
Hristo Tanev (Joint Research Centre (JRC), European Commission, Italy)
-
Onur Uca, Sociology (Sociology, Mersin University, Turkey)
-
Vanni Zavarella (University of Cagliari, Italy)
-
Ali Hürriyetoğlu (KNAW Humanities Cluster DHLab, the Netherlands)
Please contact the organizers at hristo.tanev(a)ec.europa.eu or
onuruca(a)mersin.edu.tr for your questions.
5.References
Jesse Hammond and Nils B Weidmann. Using machine-coded event data for the
micro-level study of political violence. Research & Politics,
1(2):2053168014539924, 2014.
Hürriyetoğlu, A., Mutlu, O., Duruşan, F., Uca, O,. Gürel, A.,S.,
Radford, B., Dai, Y., Hettiarachchi, H., Stoehr, N., Nomoto, T., Slavcheva,
M., Vargas, F., Javid, A., Beyhan, F., Yörük, E. (2022). Extended
Multilingual Protest News Detection Shared Task1,CASE2021 and 2022. arXiv
preprint arXiv:2211.11360. Url: https://arxiv.org/abs/2211.11360
Hürriyetoğlu, A., Yörük, E., Yüret, D., Mutlu, O., Yoltar, Ç., Duruşan, F.,
& Gürel, B. (2020b). Cross-context news corpus for protest events related
knowledge base construction. arXiv preprint arXiv:2008.00351. In Automated
Knowledge Base Construction (AKBC). URL:
https://www.akbc.ws/2020/papers/7NZkNhLCjp
Call for workshop papers and Shared Task participation: the 6th workshop on
Challenges and Applications of Automated Extraction of Socio-political
Events from Text - CASE @ RANLP 2023
************************************************************************************
URL: https://emw.ku.edu.tr/case-2023/
Paper submission deadline: 10 July 2023
Paper acceptance notification: 5 August 2023
Paper camera-ready: 25 August 2023
Workshop dates: 7-8 September 2023
Dates and deadlines for the shared task are below.
Softconf page of the workshop: https://softconf.com/ranlp23/CASE/
************************************************************************************
We invite contributions from researchers in computer science, NLP, ML, DL,
AI, socio-political sciences, conflict analysis and forecasting, peace
studies, as well as computational social science scholars involved in the
collection and utilization of socio-political event data. This includes
(but is not limited to) the following topics
1) Extracting events and their arguments such as time and location in and
beyond a sentence or document, event coreference resolution.
2) Research in NLP technologies in relation to event detection: geocoding,
temporal reasoning, argument structure detection, syntactic and semantic
analysis of event structures, text classification, for event type
detection, learning event-related lexica, event co-reference resolution,
fake news analysis, and others with a focus on real or potential event
detection applications.
3) New datasets, training data collection, and annotation for event
information.
4) Event-event relations, e.g., subevents, main events, spatio-temporal
relations, causal relations.
5) Event dataset evaluation in light of reliability and validity metrics.
6) Defining, populating, and facilitating event schemas and ontologies.
7) Automated tools and pipelines for event collection related tasks.
8) Lexical, syntactic, semantic, discursive, and pragmatic aspects of event
manifestation.
9) Methodologies for development, evaluation, and analysis of event
datasets.
10) Applications of event databases, e.g. early warning, conflict
prediction, policymaking.
11) Estimating what is missing in event datasets using internal and
external information.
12) Detection of new and emerging SPE types, e.g. creative protests.
13) Release of new event datasets.
14) Bias and fairness of the sources and event datasets.
15) Ethics, misinformation, privacy, and fairness concerns pertaining to
event datasets.
16) Copyright issues on event dataset creation, dissemination, and sharing.
17) Cross-lingual, multilingual and multimodal aspects in event analysis.
18) Resources and approaches related to contentious politics around climate
change.
**** Shared tasks ****
Please check the workshop page and Github repositories of the respective
task for additional details.
Task 1 - Multilingual protest news detection:
The performance of an automated system depends on the target event type as
it may be broad or potentially the event trigger(s) can be ambiguous. The
context of the trigger occurrence may need to be handled as well. For
instance, the ‘protest’ event type may be synonymous with ‘demonstration’
or not in a specific context. Moreover, hypothetical cases such as future
protest plans may need to be excluded from the results. Finally, the
relevance of a protest depends on the actors as in a contentious political
event only citizen-led events are in the scope. This challenge becomes even
harder in a cross-lingual and zero-shot setting in case training data are
not available in new languages. We tackle the task in four steps and hope
state-of-the-art approaches will yield optimal results.
Contact person: Ali Hürriyetoğlu (ali.hurriyetoglu(a)gmail.com)
Github: https://github.com/emerging-welfare/case-2022-multilingual-event
Task 2 - Collecting and Geocoding Armed Clash Events in Russian Ukrainian
Conflict:
There is a mismatch between the event information collected between
automated and manual approaches. We aim at identifying similarities and
differences between the results of these paradigms for creating event
datasets. The participants of Task 1 will be invited to run the systems
they will develop to tackle Task 1 on a text archive. Participation in Task
1 is not a precondition to participate in Task 2.
Contact person: Hristo Tanev (htanev(a)gmail.com) and Onur Uca (
onuruca(a)mersin.edu.tr)
Github: https://github.com/zavavan/case2023_task2
Task 3 - Event causality identification:
Causality is a core cognitive concept and appears in many natural language
processing (NLP) works that aim to tackle inference and understanding. We
are interested in studying event causality in news, and therefore,
introduce the Causal News Corpus. The Causal News Corpus consists of 3,767
event sentences, extracted from protest event news, that have been
annotated with sequence labels on whether it contains causal relations or
not. Subsequently, causal sentences are also annotated with Cause, Effect
and Signal spans. Our subtasks work on the Causal News Corpus, and we hope
that accurate, automated solutions may be proposed for the detection and
extraction of causal events in news.
Contact person: Fiona Anting Tan (tan.f(a)u.nus.edu)
Github: https://github.com/tanfiona/CausalNewsCorpus
Task 4 - Multimodal Hate Speech Event Detection:
Hate speech detection is one of the most important aspects of event
identification during political events like invasions. In the case of hate
speech detection, the event is the occurrence of hate speech, the entity is
the target of the hate speech, and the relationship is the connection
between the two. Since multimodal content is widely prevalent across the
internet, the detection of hate speech in text-embedded images is very
important. Given a text-embedded image, this task aims to automatically
identify the hate speech and its targets. This task will have two subtasks.
Contact person: Surendrabikram Thapa (surendrabikram(a)vt.edu)
Github: https://github.com/therealthapa/case2023_task4
**** Deadlines for the Shared tasks ****
** Task 1, 3, 4:
Training & Validation data available: May 1, 2023
Test data available: Jun 15, 2023
Test start: Jun 15, 2023
Test end: Jun 30, 2023
System Description Paper submissions due: Jul 10, 2023
Notification to authors after review: Aug 5, 2023
Camera ready: Aug 25, 2023
** Task 2:
Sample Text archive is available: May 22, 2023
Text archive for evaluation is available: July 1, 2023
Evaluation period starts: July 1, 2023
Evaluation period ends: July 24, 2023
System Description Paper submissions due: July 31, 2023
Notification to authors after review: August 7, 2023
Camera ready: August 25, 2023
*** Keynotes ***
We will continue our tradition of inviting keynote speakers from both
social and computational sciences. The social science keynote will be
delivered by Erdem Yörük with the title “Using Automated Text Processing to
Understand Social Movements and Human Behaviour” and the computational ones
will be delivered by Ruslan Mitkov and Kiril Simov.
Please see the workshop webpage (https://emw.ku.edu.tr/case-2023/) for
additional details.
PhD in ML/NLP – Efficient, Fair, robust and knowledge informed
self-supervised learning for speech processing
Starting date: November 1st, 2022 (flexible)
Application deadline: September 5th, 2022
Interviews (tentative): September 19th, 2022
Salary: ~2000€ gross/month (social security included)
Mission: research oriented (teaching possible but not mandatory)
*Keywords:*speech processing, natural language processing,
self-supervised learning, knowledge informed learning, Robustness, fairness
*CONTEXT*
The ANR project E-SSL (Efficient Self-Supervised Learning for Inclusive
and Innovative Speech Technologies) will start on November 1st 2022.
Self-supervised learning (SSL) has recently emerged as one of the most
promising artificial intelligence (AI) methods as it becomes now
feasible to take advantage of the colossal amounts of existing unlabeled
data to significantly improve the performances of various speech
processing tasks.
*PROJECT OBJECTIVES*
Recent SSL models for speech such as HuBERT or wav2vec 2.0 have shown an
impressive impact on downstream tasks performance. This is mainly due to
their ability to benefit from a large amount of data at the cost of a
tremendous carbon footprint rather than improving the efficiency of the
learning. Another question related to SSL models is their unpredictable
results once applied to realistic scenarios which exhibit their lack of
robustness. Furthermore, as for any pre-trained models applied in
society, it isimportant to be able to measure the bias of such models
since they can augment social unfairness.
The goals of this PhD position are threefold:
- to design new evaluation metrics for SSL of speech models ;
- to develop knowledge-driven SSL algorithms ;
- to propose methods for learning robust and unbiased representations.
SSL models are evaluated with downstream task-dependent metrics e.g.,
word error rate for speech recognition. This couple the evaluation of
the universality of SSL representations to a potentially biased and
costly fine-tuning that also hides the efficiencyinformation related to
the pre-training cost. In practice, we will seek to measure the training
efficiency as the ratio between the amount of data, computation and
memory needed to observe a certain gain in terms of performance on a
metric of interest i.e.,downstream dependent or not. The first step will
be to document standard markers that can be used as robust measurements
to assess these values robustly at training time. Potential candidates
are, for instance, floating point operations for computational
intensity, number of neural parameters coupled with precision for
storage, online measurement of memory consumption for training and
cumulative input sequence length for data.
Most state-of-the-art SSL models for speech rely onmasked prediction
e.g. HuBERT and WavLM, or contrastive losses e.g. wav2vec 2.0. Such
prevalence in the literature is mostly linked to the size, amount of
data and computational resources injected by thecompany producing these
models. In fact, vanilla masking approaches and contrastive losses may
be identified as uninformed solutions as they do not benefit from
in-domain expertise. For instance, it has been demonstrated that blindly
masking frames in theinput signal i.e. HuBERT and WavLM results in much
worse downstream performance than applying unsupervised phonetic
boundaries [Yue2021] to generate informed masks. Recently some studies
have demonstrated the superiority of an informed multitask learning
strategy carefully selecting self-supervised pretext-tasks with respect
to a set of downstream tasks, over the vanilla wav2vec 2.0 contrastive
learning loss [Zaiem2022]. In this PhD project, our objective is: 1.
continue to develop knowledge-driven SSL algorithms reaching higher
efficiency ratios and results at the convergence, data consumption and
downstream performance levels; and 2. scale these novel approaches to a
point enabling the comparison with current state-of-the-art systems and
therefore motivating a paradigm change in SSL for the wider speech
community.
Despite remarkable performance on academic benchmarks, SSL powered
technologies e.g. speech and speaker recognition, speech synthesis and
many others may exhibit highly unpredictable results once applied to
realistic scenarios. This can translate into a global accuracy drop due
to a lack of robustness to adversarial acoustic conditions, or biased
and discriminatory behaviors with respect to different pools of end
users. Documenting and facilitating the control of such aspects prior to
the deployment of SSL models into the real-life is necessary for the
industrial market. To evaluate such aspects, within the project, we will
create novel robustness regularization and debasing techniques along two
axes: 1. debasing and regularizing speech representations at the SSL
level; 2. debasing and regularizing downstream-adapted models (e.g.
using a pre-trained model).
To ensure the creation of fair and robust SSL pre-trained models, we
propose to act both at the optimization and data levels following some
of our previous work on adversarial protected attribute disentanglement
and the NLP literature on data sampling and augmentation [Noé2021].
Here, we wish to extend this technique to more complex SSL architectures
and more realistic conditions by increasing the disentanglement
complexity i.e. the sex attribute studied in [Noé2021] is particularly
discriminatory. Then, and to benefit from the expert knowledge induced
by the scope of the task of interest, we will build on a recent
introduction of task-dependent counterfactual equal odds criteria
[Sari2021] to minimize the downstream performance gap observed in
between different individuals of certain protected attributes and to
maximize the overall accuracy. Following this multi-objective
optimization scheme, we will then inject further identified constraints
as inspired by previous NLP work [Zhao2017]. Intuitively, constraints
are injected so the predictions are calibrated towards a desired
distribution i.e. unbiased.
*SKILLS*
*
Master 2 in Natural Language Processing, Speech Processing, computer
science or data science.
*
Good mastering of Python programming and deep learning framework.
*
Previous in Self-Supervised Learning, acoustic modeling or ASR would
be a plus
*
Very good communication skills in English
*
Good command of French would be a plus but is not mandatory
*SCIENTIFIC ENVIRONMENT*
The thesis will be conducted within the Getalp teams of the LIG
laboratory (_https://lig-getalp.imag.fr/_ <https://lig-getalp.imag.fr/>)
and the LIA laboratory (https://lia.univ-avignon.fr/). The GETALP team
and the LIA have a strong expertise and track record in Natural Language
Processing and speech processing. The recruited person will be welcomed
within the teams which offer a stimulating, multinational and pleasant
working environment.
The means to carry out the PhD will be providedboth in terms of missions
in France and abroad and in terms of equipment. The candidate will have
access to the cluster of GPUs of both the LIG and LIA. Furthermore,
access to the National supercomputer Jean-Zay will enable to run large
scale experiments.
The PhD position will be co-supervised by Mickael Rouvier (LIA, Avignon)
and Benjamin Lecouteux and François Portet (Université Grenoble Alpes).
Joint meetings are planned on a regular basis and the student is
expected to spend time in both places. Moreover, the PhD student will
collaborate with several team members involved in the project in
particular the two other PhD candidates who will be recruited and the
partners from LIA, LIG and Dauphine Université PSL, Paris. Furthermore,
the project will involve one of the founders of SpeechBrain, Titouan
Parcollet with whom the candidate will interact closely.
*INSTRUCTIONS FOR APPLYING*
Applications must contain: CV + letter/message of motivation + master
notes + be ready to provide letter(s) of recommendation; and be
addressed to Mickael Rouvier (_mickael.rouvier(a)univ-avignon.fr_
<mailto:mickael.rouvier@univ-avignon.fr>), Benjamin
Lecouteux(benjamin.lecouteux(a)univ-grenoble-alpes.fr) and François Portet
(_francois.Portet(a)imag.fr_ <mailto:francois.Portet@imag.fr>). We
celebrate diversity and are committed to creating an inclusive
environment for all employees.
*REFERENCES:*
[Noé2021] Noé, P.- G., Mohammadamini, M., Matrouf, D., Parcollet, T.,
Nautsch, A. & Bonastre, J.- F. Adversarial Disentanglement of Speaker
Representation for Attribute-Driven Privacy Preservation in Proc.
Interspeech 2021 (2021), 1902–1906.
[Sari2021] Sarı, L., Hasegawa-Johnson, M. & Yoo, C. D. Counterfactually
Fair Automatic Speech Recognition. IEEE/ACM Transactions on Audio,
Speech, and Language Processing 29, 3515–3525 (2021)
[Yue2021] Yue, X. & Li, H. Phonetically Motivated Self-Supervised Speech
Representation Learning in Proc. Interspeech 2021 (2021), 746–750.
[Zaiem2022] Zaiem, S., Parcollet, T. & Essid, S. Pretext Tasks Selection
for Multitask Self-Supervised Speech Representation in AAAI, The 2nd
Workshop on Self-supervised Learning for Audio and Speech Processing,
2023 (2022).
[Zhao2017] Zhao, J., Wang, T., Yatskar, M., Ordonez, V. & Chang, K. - W.
Men Also Like Shopping: Reducing Gender Bias Amplification using
Corpus-level Constraints in Proceedings of the 2017 Conference on
Empirical Methods in Natural Language Processing (2017), 2979–2989.
--
François PORTET
Professeur - Univ Grenoble Alpes
Laboratoire d'Informatique de Grenoble - Équipe GETALP
Bâtiment IMAG - Office 333
700 avenue Centrale
Domaine Universitaire - 38401 St Martin d'Hères
FRANCE
Phone: +33 (0)4 57 42 15 44
Email:francois.portet@imag.fr
www:http://membres-liglab.imag.fr/portet/