(Sorry for cross-posting)
Dear ML members
We are delighted to announce the release of the ICNALE Global Rating
Archives V2.0, which is the first public release version.
The ICNALE GRA includes analytic/holistic ratings of L2 English
learner speeches/essays by 160 raters with varied L1 and occupational
backgrounds.
It also includes fully edited versions of learner essays.
Please download the data from the link below:
https://language.sakura.ne.jp/icnale/download.html
Additional info is available from the link below:
https://language.sakura.ne.jp/icnale/modules.html#5
Thank you.
Shin
_______________________________
The ICNALE Development Team
Dr. Shin ISHIKAWA (he/his)
Professor of Applied Linguistics at Kobe University, Japan
iskwshin(a)gmail.com
Dear colleagues,
We are pleased to announce the release of the PxCorpus, a 4 hours of
transcribed and annotated dialogues of drug prescriptions in French
acquired through an experiment with 55 participants experts and
non-experts in drug prescriptions. This corpus was built in
collaboration between the Laboratoire d'Informatique de Grenoble (LIG)
the University Hospital of Grenoble (CHU Grenoble) and the Calystene
society through a CIFRE project financed by the ANRT (Association
Nationale de la Recherche et de la Technologie).
PxCorpus is to the best of our knowledge, the first spoken medical
drug prescriptions corpus to be distributed. The automatic
transcriptions were verified by human effort and aligned with semantic
labels to allow training of NLP models. The data acquisition protocol
was reviewed by medical experts and permit free distribution without
breach of privacy and regulation.
## Overview of the Corpus
The experiment has been performed in wild conditions with naive
participants and medical experts.
In total, the dataset includes 2067 recordings of 55 participants (38%
non-experts, 25% doctors, 36% medical practitioners), manually
transcribed and semantically annotated.
| Category | Sessions | Recordings | Time(m)|
|------------------| -------- | ---------- | ------ |
| Medical experts | 258 | 434 | 94.83 |
| Doctors | 230 | 570 | 105.21 |
| Non experts | 415 | 977 | 62.13 |
| Total | 903 | 1981 | 262.27 |
## License
We hope that that the community will be able to benefit from the dataset
which is distributed with an attribution 4.0 International (CC BY 4.0)
Creative Commons licence.
## How to cite this corpus
If you use the corpus or need more details please refer to the following
paper: A spoken drug prescription datset in French for spoken Language
Understanding
@InProceedings{Kocabiyikoglu2022,
author = "Alican Kocabiyikoglu and Fran{\c c}ois Portet and
Prudence Gibert and Hervé Blanchon and Jean-Marc Babouchkine and Gaëtan
Gavazzi",
title = "A spoken drug prescription datset in French for spoken
Language Understanding",
booktitle = "13th Language Ressources and Evaluation Conference
(LREC 2022)",
year = "2022",
location = "Marseille, France"
}
a more complete description of the corpus acquisition is available on arxiv
@misc{kocabiyikoglu2023spoken,
title={Spoken Dialogue System for Medical Prescription Acquisition
on Smartphone: Development, Corpus and Evaluation},
author={Ali Can Kocabiyikoglu and François Portet and Jean-Marc
Babouchkine and Prudence Gibert and Hervé Blanchon and Gaëtan Gavazzi},
year={2023},
eprint={2311.03510},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
## Download
The corpus can be found in the Zenodoo Catalogue under the following
links and references:
*PxCorpus : A Spoken Drug Prescription Dataset in French for Spoken
Language Understanding and Dialogue*
https://zenodo.org/doi/10.5281/zenodo.6482586
--
François PORTET
Professeur - Univ Grenoble Alpes
Laboratoire d'Informatique de Grenoble - Équipe GETALP
Bâtiment IMAG - Office 333
700 avenue Centrale
Domaine Universitaire - 38401 St Martin d'Hères
FRANCE
Phone: +33 (0)4 57 42 15 44
Email:francois.portet@imag.fr
www:http://membres-liglab.imag.fr/portet/
Dear colleagues,
We are inviting contributions to the Special Issue of AI Communications on "Human-Aware AI".
Contributions are sought that report on mature and highly interdisciplinary research with a focus on the human involvement in the development of meaningful paradigms of AI-enabled human-human interactions, human-AI interactions, and human-centered AI-AI interactions. An indicative list of disciplines and sub-disciplines that we expect to be relevant are: Autonomous Agents and Multi-Agent Systems, Ethics, Human-Computer Interaction, Knowledge Representation and Reasoning, Machine Learning, Ontologies, Privacy, Social Computing, Social Psychology, Social Sciences.
Any contribution relating to the general theme is welcome. The following is a non-exhaustive list of suggested topics:
• Models of human diversity and human awareness
• Models of AI diversity and human-aware AI
• Perception of diversity versus models of diversity
• Models of diverse human-AI societies and interactions
• Experimental studies on human and social diversity
• Experimental studies on hybrid human-AI diversity
• Representation and visualization of diversity
• Incentive models for Human-AI collaboration
• Human-aware machine learning technologies
• Interpretability and explainability of human-aware machine learning
• Diversification and unbiasing of machine learning
• Metrics for diversity-aware machine learning
• Diversity-aware and diversity-preserving inference and reasoning
• Ethical and privacy considerations on diversity
• Ethical and legal considerations on diversity-misuse scenarios
• Data economics, business models, and/or non-profit use
• Insights from Critical Diversity Studies
• Diversity-sensitive communication
• Content moderation for diversity-aware social interaction
More information about the Special Issue can be found here:
https://www.iospress.com/sites/default/files/media/files/2023-09/AIC_Human-…
The submission deadline is November 30, 2023. However, we would appreciate it if you could register your interest to submit a paper by completing the following form at your earliest convenience: https://forms.gle/vsVjJrCwXE8Nh9YU6
We look forward to your contributions!
Regards,
Loizos
Dear colleagues,
I would be happy to announce that the first Artificial Intelligence for
Brain Encoding and Decoding (AIBED) workshop will be held in conjunction
with AAAI on February 26, 2024 at New Orleans, U.S. We welcome paper
submissions and participations for this workshop. Here is the information.
This workshop aims to explore the intersection of AI and neuroscience,
focusing on how AI, particularly deep artificial neural networks, can
facilitate the encoding and decoding of brain activities. We will first
delve into the principles of brain encoding and decoding, examining how the
brain processes and encodes information into neural signals, and how these
signals can be decoded to understand cognition. Next, we will discuss the
challenges in encoding and decoding high-dimensional neural imaging data,
including but not limited to the complexity of brain signal
representations, scarcity of data annotations, and the need for model
generalizability. Finally, we will consider the implications of these
AI-driven advances in brain encoding and decoding for neuroscience,
including understanding cognitive functions, diagnosing neurological
disorders, and developing brain-computer interfaces
Topics
1. Understanding Brain Encoding and Decoding:
- Analyzing the processes of brain information processing and neural
signal encoding
- Utilizing AI to model complex neural processes and facilitate
cognition understanding
- Decoding from brain activities to reconstruct perceived or imagined
linguistic, visual and audio information with AI
2. Addressing Challenges in Processing Neural Imaging Data:
- Proposing AI solutions to process neural images, such as denoising,
registering and slicing etc.
- Leveraging AI’s proficiency in managing high-dimensional data to
innovate solutions of representing brain signals
3. Implications in Neuroscience:
- Considering the impact of AI developments on cognitive neuroscience
- Aiding in diagnosing neurological disorders with AI
Format and Attendance
This will be a 1-day workshop with keynotes, poster presentations, and
panel discussions.
We will invite keynote speakers and all the authors who get papers
accepted. Other AAAI attendees who are interested can also attend following
AAAI’s related policy.
Submission Requirements
We accept one-page abstract with posters, as well as short papers with no
more than 4 pages and long papers with no more than 7 pages.
*Submission Site Information*:
https://openreview.net/group?id=AAAI.org/2024/Workshop/AIBED
Offiical website of workshop: https://sites.google.com/view/aibed2024/
For more questions about this workshop please contact aibed2024(a)outlook.com
<aibed(a)outlook.com> .
Workshop Chairs:
Prof. Dr. Marie-Francine Moens, sien.moens(a)kuleuven.be
Prof. Dr. Shaonan Wang, shaonan.wang(a)nlpr.ia.ac.cn Dr. Jingyuan Sun,
jingyuan.sun(a)kuleuven.be Workshop Committee:
Mingxiao Li, mingxiao.li(a)kuleuven.be
Zijiao Chen, zijiao.chen(a)u.nus.edu
Jiaxin Qing, jqing(a)ie.cuhk.edu.hk
Xinpei Zhao, zhaoxinpei17(a)mails.ucas.ac.cn
Tiedong Liu, tiedong.liu(a)u.nus.edu
Prof. Dr. Wei Huang, lembert1990(a)163.com
Kind regards
Dr. Jingyuan Sun
tl;dr:
-
submission deadline for research track paper via Softconf: December 18th
2023
-
submission deadline for research track submissions already reviewed via
ARR: January 17th 2024
-
submission deadline for shard task systems: January 20th 2024
-
submission deadline for shard task system descriptions via SoftConf:
January 26th 2024
https://sites.google.com/view/dialogue-evaluation/
Call for Papers
The aim of this workshop is to bring together experts working in the area
of open-domain dialogue. In this speedily advancing research area many
challenges still exist, such as learning information from conversations,
engaging in realistic and convincing simulation of human intelligence,
reasoning, and so on.
SCI-CHAT follows previous workshops on open domain dialogue, but with a
focus on the simulation of intelligent conversation, including the ability
to follow a challenging topic over a multi-turn conversation, the ability
to posit questions, refuting and reasoning with live human evaluation
employed as the primary mechanism for evaluating models. The workshop will
include a research track and shared task:
SCI-CHAT's research track aims to explore recent advances and challenges in
open-domain dialogue research. Researchers working on all aspects of
open-domain dialogue are invited to submit papers on recent advances,
resources, tools, analysis, evaluation, and challenges on the broad theme
of open-domain dialogues.
The topics of the workshop include but are not limited to the following:
-
Intelligent conversation, chit-chat, open-domain dialogue;
-
Automatic and human evaluation of open-domain dialogue;
-
Limitations, risks and safety in open-domain dialogue;
-
Instruction-tuned and instruction-enabled models;
-
Any other topic of interest to the dialogue community.
SCI-CHAT's shared task will focus on simulating intelligent conversations;
participants will be asked to submit (access to the APIs of) automated
dialogue agents with the aim of carrying out nuanced conversations over
multiple dialogue turns. Participating systems will be interactively
evaluated in a live human evaluation. All data acquired within the context
of the shared task will be made public, providing an important resource for
improving metrics and systems in this research area.
Submission guidelines:
Authors are invited to submit their unpublished work that represents novel
research through either direct submission or ARR commitment. Papers should
consist of up to 8 pages of content, plus unlimited pages for references
and appendix. Authors should make use of the EACL Latex Template
<https://2023.eacl.org/calls/styles/> alongside supplementary materials,
including technical appendices, links to source code, datasets, and
multimedia appendices.
Papers can also be submitted as non-archival, so that their content can be
reused for other venues by adding "(NON-ARCHIVAL)" to the title of their
submission. Previously published work can also be submitted as non-archival
in the same way, with the additional requirement to state such on the first
page.
-
Direct paper submissions must be submitted through SoftCon submission
link: https://softconf.com/eacl2024/SCI-CHAT-2024/
Multiple submissions of the same paper to more EACL workshops are forbidden.
All papers will be double-blind peer-reviewed, by at least 2 program
committee members. As such, all submissions, including the main paper and
its supplementary materials, should be fully anonymized. For more
information on formatting and anonymity guidelines, please refer to EACL
guidelines <https://eacl.org/index.html>.
Organizers
-
Yvette Graham (Trinity College Dublin, Ireland)
-
Qun Liu (Huawei Noah's Ark Lab, China)
-
Gerasimos Lampouras (Huawei Noah's Ark Lab,UK)
-
Ignacio Iacobacci (Huawei Noah's Ark Lab, UK)
-
Sinead Madden (Trinity College Dublin, Ireland)
-
Haider Khalid (Trinity College Dublin, Ireland)
-
Rameez Qureshi (Trinity College Dublin, Ireland)
Important Dates
Regarding Research Track:
-
Research paper via Softconf: December 18th 2023
-
Pre-reviewed ARR commitment deadline: January 17th 2024
-
Notification of research paper acceptance: January 20th, 2024
-
Camera-ready papers due: January 30th 2024
Regarding Shared Task:
-
Release of training and development data: November 9th 2023
-
Release of baseline systems: November 9th 2023
-
Preliminary System submission deadline: January 13th 2024 (optional - if
you want help testing your API, please submit early)
-
System submission (API) deadline: January 20th 2024
-
System description paper via SoftConf: January 26th 2024
-
Camera-ready papers due: January 30th 2024
Overview of results at one-day workshop: March 21 or 22, 2024
CONTACT: sci-chat(a)adaptcentre.ie
The Natural Language Processing Section at the Department of Computer Science, Faculty of Science at University of Copenhagen is offering a PhD position in Explainable Natural Language Understanding with a start date of 1 September 2024. The application deadline is 1 February 2024.
Applications for the position can be submitted via UCPH's job portal<https://candidate.hr-manager.net/ApplicationInit.aspx/?cid=1307&departmentI…>.
The Natural Language Processing Section<https://di.ku.dk/english/research/nlp/> provides a strong, international and diverse environment for research within core as well as emerging topics in natural language processing, natural language understanding, computational linguistics and multi-modal language processing. It is housed within the main Science Campus, which is centrally located in Copenhagen. The successful candidate will join Isabelle Augenstein’s Natural Language Understanding research group<http://www.copenlu.com/>. The Natural Language Processing research environment at the University of Copenhagen is internationally leading, as e.g. evidenced by it being ranked 2nd in Europe according to CSRankings.
The position is offered in the context of an ERC Starting Grant held by Isabelle Augenstein on ‘Explainable and Robust Automatic Fact Checking (ExplainYourself)’. ERC Starting Grant is a highly competitive funding program by the European Research Council to support the most talented early-career scientists in Europe with funding for a period of 5 years for blue-skies research to build up or expand their research groups.
The project team will consist of the principle investigator, three PhD students and two postdocs, collaborators from CopeNLU as well as external collaborators. The role of the PhD student to be recruited in this call will be to research methods for generating faithful free-text explanations of NLU models in collaboration with the larger project team.
More information about the project can also be found here<http://www.copenlu.com/talk/2022_11_erc/>.
Informal enquiries about the positions can be made to Professor Isabelle Augenstein, Department of Computer Science, University of Copenhagen, e-mail: augenstein(a)di.ku.dk<mailto:augenstein@di.ku.dk?subject=PhD%20position%20on%20Explainable%20Natural%20Language%20Understanding>.
Isabelle Augenstein, Dr. Scient., Ph.D.
Professor and Head of the NLP Section, Department of Computer Science (DIKU)
Co-Lead, Pioneer Centre for Artificial Intelligence
University of Copenhagen
Østervold Observatory
Øster Voldgade 3
1350 Copenhagen
augenstein(a)di.ku.dk<mailto:augenstein@di.ku.dk>
http://isabelleaugenstein.github.io/
The school of Electronic Engineering and Computer Science at Queen Mary
University of London is inviting applications for several PhD Studentships
in specific areas in Electronic Engineering and Computer Science --
including several in natural language processing -- co-funded by the China
Scholarship Council (CSC). CSC is offering a monthly stipend to cover
living expenses and QMUL is waving fees and hosting the student. These
scholarships are available only for Chinese candidates. Projects available
include:
- Emerging methods for analysing Pretrained Foundation Models
- The shape of words
- Interpretable language models for healthcare
- Personalisation and temporal reasoning in LLMs wth applications in
healthcare
- Coreference resolution in the era of large language model
- Interactive adaptation for large language models
- Towards personalising and debiasing online content moderation
For details on the available projects and supervising faculty please visit:
http://eecs.qmul.ac.uk/phd/phd-studentships/csc-phd-studentships/csc-phd-st…
--
Matthew Purver - http://www.eecs.qmul.ac.uk/~mpurver/
Computational Linguistics Lab - http://compling.eecs.qmul.ac.uk/
Cognitive Science Research Group - http://cogsci.eecs.qmul.ac.uk/
School of Electronic Engineering and Computer Science
Queen Mary University of London, London E1 4NS, UK
*My working days for QMUL are **Tuesday-Thursday**; responses to mail on
other days may be delayed.*
*** First Call for Tutorial Proposals ***
36th International Conference on Advanced Information Systems Engineering
(CAiSE'24)
June 3-7, 2024, 5* St. Raphael Resort and Marina, Limassol, Cyprus
https://cyprusconferences.org/caise2024/
(*** Submission Deadline: February 28, 2024 AoE ***)
CAiSE'24 invites proposals for tutorials on advanced topics in the field of Information Systems
Engineering. Tutorials should aim at offering new insights, knowledge, and skills to
professionals, educators, researchers, and students seeking to gain a better understanding
either about methods of broad interest in the field, or emergent paradigms that are ripe for
practical adoption or that require further research to reach maturity.
Proposals emphasizing the special theme of the CAISE'24 conference “Information Systems in
the Age of Artificial Intelligence” are encouraged, but proposals on other new or long-standing
topics in information systems engineering are also welcome.
Tutorials should be focused on principles, concepts, and methods. Commercial or
sales-oriented presentations are not allowed and will not be accepted.
Tutorials are intended to provide a pedagogic introduction to or overview of a topic of
relevance. Potential presenters should keep in mind that there may be a heterogeneous
audience, including novice graduate students, experienced practitioners, and specialized
researchers. Tutorial speakers should be prepared to cope with this diversity in the audience.
Tutorials will be 90 minutes long and organized in parallel with the technical sessions of the
main conference and participants of the conference will have free access to all of them.
Potential proposers are free to contact the tutorial chairs via e-mail to validate their idea prior
to the submission.
SELECTION CRITERIA
The tutorial chairs will review each proposal and select a subset of them based on the
following criteria:
1. relevance to the field of IS engineering;
2. anticipated appeal to the conference audience;
3. timeliness and importance for the conference audience;
4. past experience and qualifications of the instructor(s).
The tutorial chairs will also consider the complementarity of the proposal w.r.t. the conference
program and other tutorial proposals.
SUBMISSION GUIDELINES
Tutorial proposals should be submitted to Easy Chair using the conference submission site
(https://easychair.org/conferences/?conf=caise2024) and then selecting the “CAiSE 2024
Tutorials” track.
The proposal (length up to 1500 words) should cover the following points:
• Title
• Presenters and affiliation
• Goal: The overall goal of the tutorial.
• Scope: Intended audience, level (basic or advanced), and prerequisites.
• Topic relevance and novelty: Specifically indicate the relevance to the scope of CAiSE,
the relevance to practice, the novel aspects that would make this tutorial beneficiary and
appealing to CAiSE participants.
• Structure of contents: Here you should provide a structured overview of your planned
tutorial, organized into numbered sections and subsections. For each subsection, you
should sketch its contents in a few sentences or bullet points.
• References: Provide references to papers, books, etc. that your tutorial builds on. Please
specify previous venues at which similar tutorials have been presented by you and
indicate the difference between the proposed tutorial and previous ones. CAiSE usually
does not accept tutorials that have been presented in other venues.
• Sample Slides: Include at least 5 sample slides of the presentation you plan to give if
your tutorial is accepted. Select slides that are typical of your presentation style. These
slides have to be submitted in a separate PDF file.
Services provided to tutorialists
• A 2-page tutorial abstract will be published in the CAiSE LNCS proceedings
• Tutorials will benefit from the local organizational infrastructure (registration, badges,
refreshments, beamers, screens, etc.).
• Advertisement of the tutorial on CAISE 2024 homepage and mailings.
• The conference fee will be waived for tutorial presenters (one fee per tutorial).
IMPORTANT DATES
• Submission of Tutorial Proposals: 28th February 2024 (AoE)
• Notification of Acceptance: 15th March 2024
• Camera-ready Abstracts: 5th April 2024
TUTORIAL CHAIRS
• Adela del Rio Ortega, University of Seville, Spain (adeladelrio(a)us.es)
• Tiago Prince Sales, University of Twente, The Netherlands (t.princesales(a)utwente.nl)
Other Committee Members
https://cyprusconferences.org/caise2024/committees/
The University of Texas at Austin's School of Information (iSchool) is
seeking talented and committed students to join our Ph.D. Program in Fall
2024! We seek passionate and driven applicants who are excited to undertake
cutting-edge research to benefit the world and its people.
December 1st is our deadline to apply for fall 2024 admissions
<https://www.ischool.utexas.edu/programs/phd-admissions>. Late applications
may be submitted but are not guaranteed equal consideration. For additional
information, please refer to our admissions website
<https://www.ischool.utexas.edu/programs/phd-admissions> and recorded
information session <https://youtu.be/_9OJ-uGo-Cg>.
We will host a live information session online on Wednesday, November 15th
(12pm-1pm US CDT). Event link: https://utexas.zoom.us/j/94848553117.
Our iSchool is a top-ranked PhD program committed to making a difference in
the world through high-impact research
<https://www.ischool.utexas.edu/research>. Our work seeks to enhance human
lives and communities by understanding the impact and potential of
information, in all its forms. We work to harness the massive scale and
value of information, discover the principles and processes to manage it,
and design solutions that are novel, creative, accessible, useful, usable,
and sustainable. To increase understanding of the role and impact of
information in all human endeavors, we study problems and develop solutions
and critique for better information design, management, organization,
access, preservation, and retrieval. The study of information can encompass
and often extends beyond any existing field. Our world-class,
interdisciplinary faculty
<https://www.ischool.utexas.edu/people/ischool-faculty-staff-students>
spans a wide range of expertise: anthropology, communications, computer
science, industrial engineering, library and information science,
psychology, science & technology studies, and more.
Our school has a strong tradition of fully financially supporting our
doctoral students through a combination of Fellowships, Research
Assistantships, and Teaching Assistantships. Our support packages are
competitive, including tuition, health insurance, and a stipend sufficient
to live comfortably in Austin.
Beyond our top-ranked international graduate program, UT Austin is one of
the world's premier research universities and is located in one of the
USA's sunniest and most vibrant cities in which to live and work:
-
http://www.utexas.edu/about/overview
-
http://www.utexas.edu/campus-life/life-in-austin
-
https://www.austintexas.org/things-to-do/
Our university’s motto is not only that "What Starts Here Changes the World"
<https://www.utexas.edu/> and we invite our community to "Make it Y/OUR
Texas." Everyone who joins the iSchool community helps contribute to and
amplifies our shared sense of belonging and purpose. We hope that you will
embrace the opportunity and challenge that our PhD program offers and join
us in conducting high-impact research that truly changes the world.
We look forward to your application! For any questions about admissions to
our doctoral program, please contact us at graduateadmit(a)ischool.utexas.edu.
--
Matt Lease
Professor
School of Information
University of Texas at Austin
Voice: (512) 471-9350 · Fax: (512) 471-3971 · Office: UTA 5.536
http://www.ischool.utexas.edu/~ml
AAC/CFP Corpus 26 - 2025 - https://journals.openedition.org/corpus/
<https://journals.openedition.org/corpus/>
Background noise or added value? Managing noise during computer processing of linguistic corpora
Elisa Gugliotta, Luca Pallanti, Olivier Kraif, Iris Fabry et Martina Barletta (eds.)
-------FRENCH VERSION BELOW-----
The increasing influence of NLP-related methodologies on corpus linguistics has compelled researchers to reassess their practices for managing noise and its impact on research results (Fuchs & Habert, 2004; Léon, 2018; Zalmout et al., 2018). Whether working with long-diachronic corpora (e.g., medieval French), dialectal corpora with limited resources (e.g., oral or written texts in dialectal Arabic, cf. Arabizi), or corpora of texts deviating from the norm (e.g., learner corpora), conducting noise analysis becomes an essential step in drawing linguistic conclusions from the available data (Molinelli & Putzu, 2015; Scaglione, 2018; Litosseliti, 2018). This special issue of Corpus builds upon a workshop held in April 2023 (https://je-bruit-corpus.sciencesconf.org/) and offers an opportunity to examine noise management methods in the fields of NLP and corpus linguistics, as well as their impact on the quality of linguistic data (Kraif & Ponton, 2007; Goutte et al., 2012; Zeroual, 2018).
The fundamental inquiries in any linguistic study revolve around defining the research object, understanding the nature of the data, and determining ways to preserve its inherent characteristics throughout the various processing steps (such as lemmatisation, normalisation, labelling, etc.) (Sarrica et al., 2016). Hence, selecting appropriate methods for identifying and controlling noise becomes crucial throughout the entire process, from data collection to the archiving phase, and from data preparation to annotation (Egbert & Baker, 2019). The definition of noise itself is diverse and far from self-evident. In the field of NLP alone, this term encompasses a wide range of highly heterogeneous phenomena, including web peritexts - such as hyperlinks, menus and computer codes - as well as code switching and instances of spelling or grammatical errors that punctuate productions (Al Sharou et al., 2021).
This special issue aims to delve into the definition of noise, from a linguistic perspective, and the practices employed by researchers to mitigate the biases that can arise from it. These practices are implemented during collection, recording, and annotation of data. The question of noise inevitably emerges at each stage of the empirical process involved in data construction and analysis:
1. Noise during data collection and recording
If one accepts the postulate that "linguistic data is a result" (Benveniste, 1966), decoding the noise stemming from data collection and recording becomes crucial. Depending on the research object, various factors may contribute to data alteration, including the researcher's preconceptions or the biases introduced by an OCR system (Jentsch & Porada, 2020). The key challenge lies in predicting or identifying the potential biases induced by these factors during the selection and formatting of data. This enables better control over subsequent research stages and ensures greater accuracy in the analysis process.
2. Data preparation and pre-processing
The methods employed to refine raw data and prepare it for advanced manipulation can give rise to a significant source of noise (or, conversely, of silence, if noise elimination filters are applied). This is particularly evident during the data normalization process (Al Sharou et al., 2021). When transcribing data or correcting errors, researchers must make choices that inevitably influence the nature of the data, either by reducing or enriching its content. As a result, it becomes essential to anticipate the consequences of the transformations introduced by data processing methods (Tanguy, 2012).
3. The annotation process and metadata
Initially, corpus annotation aims to enrich the data by categorizing units through a labelling process, depending on the developed analysis model (Péry-Woodley et al., 2011). However, while this process has the potential to introduce noise, it can result in detrimental silence (when missing or erroneous labels lead to incomplete results during data analysis or querying). The concept of metadata also raises questions: does categorizing data transform it into something different? Furthermore, does the absence of agreement or low agreement in annotations produced by humans reflect inter-individual variations akin to noise, or does it stem from the inherent vagueness of the categorizations themselves?
***
At each and every step of the process, key methodological questions arise: what threshold can be considered acceptable for noise? How can we differentiate between noise and methodological bias? Is it possible to estimate noise without a ground truth? Which statistical tools are specific to corpus studies and enable the definition of confidence intervals? How can we strike a balance to prevent the noise resulting from compromising research outcomes?
***
Proposals for articles may address these topics from a general point of view, offering a theoretical and methodological perspective. Alternatively, they can be based on one or more case studies that focus on specific observations, while highlighting the noise management methods employed throughout the study.
References
Al Sharou, K., Li, Z., & Specia, L. (2021). Towards a Better Understanding of Noise in Natural Language Processing. Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), 5362. https://aclanthology.org/2021.ranlp-1.7
Benveniste, É. (1966). Problèmes de linguistique générale. Gallimard.
Egbert, J., & Baker, P. (Eds.). (2019). Using corpus methods to triangulate linguistic analysis. Routledge. Fuchs, C., & Habert, B. (2004). Le traitement automatique des langues : Des modèles aux ressources.
Le Français Moderne - Revue de linguistique Française, CILF (conseil international de la langue française), LXXII: 1, online.
Goutte, C., Carpuat, M., & Foster, G. (2012). The impact of sentence alignment errors on phrase-based machine translation performance. In Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers.
Jentsch, P., & Porada, S. (2020). From Text to Data : Digitization, Text Analysis and Corpus Linguistics. In S. Schwandt (Éd.), Digital Humanities Research (1re éd., Vol. 1, p. 89128). transcript Verlag / Bielefeld University Press. https://doi.org/10.14361/9783839454190-004
Kraif, O., & Ponton, C. (2007). Du bruit, du silence et des ambiguïtés : Que faire du TAL pour
l'apprentissage des langues ? TALN 2007, 143152. https://hal.archives-ouvertes.fr/hal-01073706
Léon, J. (2018). Tal et linguistique : Application, expérimentation, instrumentalisation. ELA. Etudes de linguistique appliquee, 2(190), 195203.
Litosseliti, L. (Ed.). (2018). Research methods in linguistics. Bloomsbury Publishing.
Molinelli, P., & Putzu, I. (2015). Modelli epistemologici, metodologie della ricerca e qualità del dato. Dalla linguistica storica alla sociolinguistica storica. Franco Angeli.
Péry-Woodley, M.-P., Afantenos, S. D., Ho-Dac, L.-M., & Asher, N. (2011). La ressource ANNODIS, un
corpus enrichi d'annotations discursives. TAL, 52(3), 71101.
Sarrica, M., Mingo, I., Mazzara, B., & Leone, G. (2016). The effects of lemmatization on textual analysis conducted with IRaMuTeQ: results in comparison. JADT2016: 13ème Journées Internacionales d'Analyse Statistique de Données Textuelles.
Scaglione, F. (2018). "Lavorare"; il dato linguistico: Prospettive e limiti. Alcune considerazioni dall'esperienza dell'Atlante Linguistico della Sicilia (ALS). In G. Sampino (Éd.), Atti del convegno internazionale dei dottorandi (p. 101122).
Tanguy, L. (2012). Complexification des données et des techniques en linguistique : contribution du TAL aux solutions et aux problèmes. HDR dissertation, Université de Toulouse 2 - le Mirail.
Zalmout, N., Erdmann, A., & Habash, N. (2018). Noise-robust morphological disambiguation for dialectal Arabic. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (pp. 953-964).
Zeroual, I. (2018). Building Arabic Corpora: Concepts, Methodologies, Tools, and Experiments (Doctoral dissertation, University of Maryland, USA).
Retro-planning
* July 2023: call for publications.
* 17 November: pre-selection based on article summaries.
* March 2024: article submission deadline.
* June 2024: response to the authors.
* June-October 2024: review process with authors to submit the final version of the article.
* November-December 2024: editing process.
* January 2025: publication.
Please note that this retro-planning outlines a general timeline and may vary depending on the specific publication requirements.
Abstract submission
* Your abstract should be no longer than 1,500 words, including bibliographical references.
* Please submit your abstracts by November 10, 2023 to elisa.gugliotta(a)ilc.cnr.it and luca.pallanti(a)univ-lyon2.fr.
----- FRENCH VERSION------
Bruit de fond ou valeur ajoutée ? Gérer le bruit lors des traitements informatiques des corpus linguistiques
Sous la direction de Elisa Gugliotta, Luca Pallanti, Olivier Kraif, Iris Fabry et Martina Barletta
L'influence croissante des méthodologies liées au TAL sur la linguistique de corpus oblige les chercheurs à réinterroger les pratiques de gestion du bruit et son impact dans les résultats de recherche (Fuchs & Habert, 2004 ; Léon, 2018 ; Zalmout et al., 2018). Qu'il s'agisse de corpus en diachronie longue (ex. français médiéval), de corpus dialectaux aux ressources limitées (ex. textes oraux ou écrits en arabe dialectal, cf. arabizi), ou encore de corpus de textes éloignés de la norme (ex. corpus d'apprenants), l'analyse du bruit est une étape nécessaire pour tirer des conclusions linguistiques des données ainsi évaluées (Molinelli & Putzu, 2015 ; Scaglione, 2018 ; Litosseliti, 2018). Ce numéro thématique de la revue Corpus, qui fait suite à une journée d'étude sur le même thème organisée en avril 2023 (https://je-bruit-corpus.sciencesconf.org/), sera l'occasion de réfléchir sur les méthodes de gestion du bruit dans les domaines du TAL et de la linguistique de corpus outillée, et à son impact sur la qualité des données linguistiques (Kraif et Ponton, 2007 ; Goutte et al., 2012 ; Zeroual, 2018).
Les questions sous-jacentes à toute étude linguistique concernent la définition de l'objet de recherche, la nature des données elles-mêmes, et la manière de préserver autant que possible leurs caractéristiques dans les différents traitements (lemmatisation, normalisation, étiquetage, etc.) (Sarrica et al., 2016). Ainsi, le choix des méthodes d'identification et de contrôle du bruit, de la phase de collecte à celle d'archivage, de la préparation des données à l'annotation, joue un rôle fondamental (Egbert & Baker, 2019). La définition même du bruit est multiple, et ne va pas de soi : dans le seul champ du TAL, ce terme, souvent peu interrogé, désigne des phénomènes variables et très hétérogènes, allant des péritextes du Web - hyperliens, menus et codes informatiques - au code switching, en passant par les erreurs d'orthographe ou de grammaire qui émaillent les productions (Al Sharou et al., 2021).
Ce numéro thématique propose de mener une réflexion sur la définition du bruit, dans une perspective linguistique, et sur les pratiques des chercheurs visant à réduire la portée des biais qui en découlent, que ce soit durant la collecte, l'enregistrement ou l'annotation des données. Dans le concret de la recherche, la question du bruit se pose à chaque étape de la démarche empirique de construction et d'analyse des données :
1. Le bruit pendant la collecte et l'enregistrement des données
Si l'on accepte le postulat selon lequel " la donnée linguistique est un résultat " (Benveniste, 1966), comment décoder le bruit causé par le recueil des données et leur enregistrement ? En effet, en fonction des objets de recherche, il existe des facteurs potentiels d'altération des données, comme par exemple les préconceptions du chercheur, ou les biais introduits par un système OCR donné (Jentsch & Porada, 2020). L'enjeu consiste alors à prédire ou à déterminer les biais potentiels induits par ces facteurs lors de la sélection et la mise en forme des données pour mieux contrôler les phases de recherche successives.
2. La préparation et le prétraitement des données.
Les méthodes choisies pour affiner les données brutes et les rendre disponibles pour des manipulations avancées peuvent représenter une importante source de bruit (ou, au contraire, de silence si on applique un filtre pour éliminer le bruit) : c'est notamment le cas du processus de normalisation des données (Al Sharou et al., 2021). Qu'il s'agisse de transcrire des données ou de corriger des erreurs, le chercheur fait des choix qui impactent nécessairement la nature des données, soit en les réduisant, soit en les enrichissant. Il s'agit donc d'anticiper les conséquences des transformations produites par les méthodes de traitement des données (Tanguy, 2012).
3. Le processus d'annotation et les métadonnées
À la base, l'annotation des corpus est une étape visant l'enrichissement des données : en fonction du modèle d'analyse mis au point, le chercheur tente de catégoriser des unités à travers un processus d'étiquetage (Péry-Woodley et al., 2011). Cependant, si d'un côté ce processus peut générer du bruit, de l'autre, il peut être une cause de silence fort préjudiciable aux résultats des recherches et à leur interprétation (des étiquettes absentes ou erronées pouvant générer des résultats lacunaires lors de l'analyse ou du requêtage des données). La notion de métadonnée peut également être mise en cause: catégoriser une donnée signifie-t-il la transformer en quelque chose d'autre ? Par ailleurs, l'absence d'accord ou un faible accord dans les annotations produites par l'humain manifeste-t-il des variations interindividuelles assimilables à du bruit, ou au caractère trop vague des catégorisations en jeu ?
***
A chaque étape se posent des questions méthodologiques centrales : à partir de quel seuil peut-on considérer le bruit comme acceptable ? Comment différencier bruit et biais méthodologique ? Comment estimer le bruit sans vérité de terrain ? Quels outils statistiques spécifiques à l'étude des corpus permettent de délimiter des intervalles de confiance ? Comment atteindre l'équilibre nécessaire pour que le bruit causé par les traitements des données ne compromette pas les résultats des recherches ?
***
Les propositions d'article pourront aborder ces questions d'un point de vue général, sous un angle théorique et méthodologique, ou s'appuyer sur une ou plusieurs études de cas portant sur des observations particulières, en prenant soin de mettre en lumière les méthodes de gestion du bruit tout au long de l'étude.
Retro-planning
* Juillet 2023 : publication du l'Appel
* 17 novembre 2023 : pré-sélection sur résumé
* Mars 2024 : remise des articles. Juin 2024 : réponse aux auteurs
* Juin-octobre 2024 : navette avec les auteurs pour remise de l'article en forme définitive.
* Novembre-décembre 2024 : édition.
* Janvier 2025 : publication.
Soumission des résumés
* Votre résumé comptera 1.500 mots au maximum, références bibliographiques inclues.
* Merci de soumettre vos résumés pour le 10 novembre 2023 aux adresses elisa.gugliotta(a)ilc.cnr.it et luca.pallanti(a)univ-lyon2.fr
Références
Al Sharou, K., Li, Z., & Specia, L. (2021). Towards a Better Understanding of Noise in Natural Language Processing. Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), 5362. https://aclanthology.org/2021.ranlp-1.7
Benveniste, É. (1966). Problèmes de linguistique générale. Gallimard.
Egbert, J., & Baker, P. (Eds.). (2019). Using corpus methods to triangulate linguistic analysis. Routledge. Fuchs, C., & Habert, B. (2004). Le traitement automatique des langues : Des modèles aux ressources.
Le Français Moderne - Revue de linguistique Française, CILF (conseil international de la langue française), LXXII: 1, online.
Goutte, C., Carpuat, M., & Foster, G. (2012). The impact of sentence alignment errors on phrase-based machine translation performance. In Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers.
Jentsch, P., & Porada, S. (2020). From Text to Data : Digitization, Text Analysis and Corpus Linguistics. In S. Schwandt (Éd.), Digital Humanities Research (1re éd., Vol. 1, p. 89128). transcript Verlag / Bielefeld University Press. https://doi.org/10.14361/9783839454190-004
Kraif, O., & Ponton, C. (2007). Du bruit, du silence et des ambiguïtés : Que faire du TAL pour
l'apprentissage des langues ? TALN 2007, 143152. https://hal.archives-ouvertes.fr/hal-01073706
Léon, J. (2018). Tal et linguistique : Application, expérimentation, instrumentalisation. ELA. Etudes de linguistique appliquee, 2(190), 195203.
Litosseliti, L. (Ed.). (2018). Research methods in linguistics. Bloomsbury Publishing.
Molinelli, P., & Putzu, I. (2015). Modelli epistemologici, metodologie della ricerca e qualità del dato. Dalla linguistica storica alla sociolinguistica storica. Franco Angeli.
Péry-Woodley, M.-P., Afantenos, S. D., Ho-Dac, L.-M., & Asher, N. (2011). La ressource ANNODIS, un
corpus enrichi d'annotations discursives. TAL, 52(3), 71101.
Sarrica, M., Mingo, I., Mazzara, B., & Leone, G. (2016). The effects of lemmatization on textual analysis conducted with IRaMuTeQ: results in comparison. JADT2016: 13ème Journées Internacionales d'Analyse Statistique de Données Textuelles.
Scaglione, F. (2018). "Lavorare"; il dato linguistico: Prospettive e limiti. Alcune considerazioni dall'esperienza dell'Atlante Linguistico della Sicilia (ALS). In G. Sampino (Éd.), Atti del convegno internazionale dei dottorandi (p. 101122).
Tanguy, L. (2012). Complexification des données et des techniques en linguistique : contribution du TAL aux solutions et aux problèmes. HDR dissertation, Université de Toulouse 2 - le Mirail.
Zalmout, N., Erdmann, A., & Habash, N. (2018). Noise-robust morphological disambiguation for dialectal Arabic. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (pp. 953-964).
Zeroual, I. (2018). Building Arabic Corpora: Concepts, Methodologies, Tools, and Experiments (Doctoral dissertation, University of Maryland, USA).