AAC/CFP Corpus 26 - 2025 - https://journals.openedition.org/corpus/
<https://journals.openedition.org/corpus/>
Background noise or added value? Managing noise during computer processing of linguistic corpora
Elisa Gugliotta, Luca Pallanti, Olivier Kraif, Iris Fabry et Martina Barletta (eds.)
-------FRENCH VERSION BELOW-----
The increasing influence of NLP-related methodologies on corpus linguistics has compelled researchers to reassess their practices for managing noise and its impact on research results (Fuchs & Habert, 2004; Léon, 2018; Zalmout et al., 2018). Whether working with long-diachronic corpora (e.g., medieval French), dialectal corpora with limited resources (e.g., oral or written texts in dialectal Arabic, cf. Arabizi), or corpora of texts deviating from the norm (e.g., learner corpora), conducting noise analysis becomes an essential step in drawing linguistic conclusions from the available data (Molinelli & Putzu, 2015; Scaglione, 2018; Litosseliti, 2018). This special issue of Corpus builds upon a workshop held in April 2023 (https://je-bruit-corpus.sciencesconf.org/) and offers an opportunity to examine noise management methods in the fields of NLP and corpus linguistics, as well as their impact on the quality of linguistic data (Kraif & Ponton, 2007; Goutte et al., 2012; Zeroual, 2018).
The fundamental inquiries in any linguistic study revolve around defining the research object, understanding the nature of the data, and determining ways to preserve its inherent characteristics throughout the various processing steps (such as lemmatisation, normalisation, labelling, etc.) (Sarrica et al., 2016). Hence, selecting appropriate methods for identifying and controlling noise becomes crucial throughout the entire process, from data collection to the archiving phase, and from data preparation to annotation (Egbert & Baker, 2019). The definition of noise itself is diverse and far from self-evident. In the field of NLP alone, this term encompasses a wide range of highly heterogeneous phenomena, including web peritexts - such as hyperlinks, menus and computer codes - as well as code switching and instances of spelling or grammatical errors that punctuate productions (Al Sharou et al., 2021).
This special issue aims to delve into the definition of noise, from a linguistic perspective, and the practices employed by researchers to mitigate the biases that can arise from it. These practices are implemented during collection, recording, and annotation of data. The question of noise inevitably emerges at each stage of the empirical process involved in data construction and analysis:
1. Noise during data collection and recording
If one accepts the postulate that "linguistic data is a result" (Benveniste, 1966), decoding the noise stemming from data collection and recording becomes crucial. Depending on the research object, various factors may contribute to data alteration, including the researcher's preconceptions or the biases introduced by an OCR system (Jentsch & Porada, 2020). The key challenge lies in predicting or identifying the potential biases induced by these factors during the selection and formatting of data. This enables better control over subsequent research stages and ensures greater accuracy in the analysis process.
2. Data preparation and pre-processing
The methods employed to refine raw data and prepare it for advanced manipulation can give rise to a significant source of noise (or, conversely, of silence, if noise elimination filters are applied). This is particularly evident during the data normalization process (Al Sharou et al., 2021). When transcribing data or correcting errors, researchers must make choices that inevitably influence the nature of the data, either by reducing or enriching its content. As a result, it becomes essential to anticipate the consequences of the transformations introduced by data processing methods (Tanguy, 2012).
3. The annotation process and metadata
Initially, corpus annotation aims to enrich the data by categorizing units through a labelling process, depending on the developed analysis model (Péry-Woodley et al., 2011). However, while this process has the potential to introduce noise, it can result in detrimental silence (when missing or erroneous labels lead to incomplete results during data analysis or querying). The concept of metadata also raises questions: does categorizing data transform it into something different? Furthermore, does the absence of agreement or low agreement in annotations produced by humans reflect inter-individual variations akin to noise, or does it stem from the inherent vagueness of the categorizations themselves?
***
At each and every step of the process, key methodological questions arise: what threshold can be considered acceptable for noise? How can we differentiate between noise and methodological bias? Is it possible to estimate noise without a ground truth? Which statistical tools are specific to corpus studies and enable the definition of confidence intervals? How can we strike a balance to prevent the noise resulting from compromising research outcomes?
***
Proposals for articles may address these topics from a general point of view, offering a theoretical and methodological perspective. Alternatively, they can be based on one or more case studies that focus on specific observations, while highlighting the noise management methods employed throughout the study.
References
Al Sharou, K., Li, Z., & Specia, L. (2021). Towards a Better Understanding of Noise in Natural Language Processing. Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), 5362. https://aclanthology.org/2021.ranlp-1.7
Benveniste, É. (1966). Problèmes de linguistique générale. Gallimard.
Egbert, J., & Baker, P. (Eds.). (2019). Using corpus methods to triangulate linguistic analysis. Routledge. Fuchs, C., & Habert, B. (2004). Le traitement automatique des langues : Des modèles aux ressources.
Le Français Moderne - Revue de linguistique Française, CILF (conseil international de la langue française), LXXII: 1, online.
Goutte, C., Carpuat, M., & Foster, G. (2012). The impact of sentence alignment errors on phrase-based machine translation performance. In Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers.
Jentsch, P., & Porada, S. (2020). From Text to Data : Digitization, Text Analysis and Corpus Linguistics. In S. Schwandt (Éd.), Digital Humanities Research (1re éd., Vol. 1, p. 89128). transcript Verlag / Bielefeld University Press. https://doi.org/10.14361/9783839454190-004
Kraif, O., & Ponton, C. (2007). Du bruit, du silence et des ambiguïtés : Que faire du TAL pour
l'apprentissage des langues ? TALN 2007, 143152. https://hal.archives-ouvertes.fr/hal-01073706
Léon, J. (2018). Tal et linguistique : Application, expérimentation, instrumentalisation. ELA. Etudes de linguistique appliquee, 2(190), 195203.
Litosseliti, L. (Ed.). (2018). Research methods in linguistics. Bloomsbury Publishing.
Molinelli, P., & Putzu, I. (2015). Modelli epistemologici, metodologie della ricerca e qualità del dato. Dalla linguistica storica alla sociolinguistica storica. Franco Angeli.
Péry-Woodley, M.-P., Afantenos, S. D., Ho-Dac, L.-M., & Asher, N. (2011). La ressource ANNODIS, un
corpus enrichi d'annotations discursives. TAL, 52(3), 71101.
Sarrica, M., Mingo, I., Mazzara, B., & Leone, G. (2016). The effects of lemmatization on textual analysis conducted with IRaMuTeQ: results in comparison. JADT2016: 13ème Journées Internacionales d'Analyse Statistique de Données Textuelles.
Scaglione, F. (2018). "Lavorare"; il dato linguistico: Prospettive e limiti. Alcune considerazioni dall'esperienza dell'Atlante Linguistico della Sicilia (ALS). In G. Sampino (Éd.), Atti del convegno internazionale dei dottorandi (p. 101122).
Tanguy, L. (2012). Complexification des données et des techniques en linguistique : contribution du TAL aux solutions et aux problèmes. HDR dissertation, Université de Toulouse 2 - le Mirail.
Zalmout, N., Erdmann, A., & Habash, N. (2018). Noise-robust morphological disambiguation for dialectal Arabic. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (pp. 953-964).
Zeroual, I. (2018). Building Arabic Corpora: Concepts, Methodologies, Tools, and Experiments (Doctoral dissertation, University of Maryland, USA).
Retro-planning
* July 2023: call for publications.
* 10 November 2023: pre-selection based on article summaries.
* March 2024: article submission deadline.
* June 2024: response to the authors.
* June-October 2024: review process with authors to submit the final version of the article.
* November-December 2024: editing process.
* January 2025: publication.
Please note that this retro-planning outlines a general timeline and may vary depending on the specific publication requirements.
Abstract submission
* Your abstract should be no longer than 1,500 words, including bibliographical references.
* Please submit your abstracts by November 10, 2023 to elisa.gugliotta(a)ilc.cnr.it and luca.pallanti(a)univ-lyon2.fr.
----- FRENCH VERSION------
Bruit de fond ou valeur ajoutée ? Gérer le bruit lors des traitements informatiques des corpus linguistiques
Sous la direction de Elisa Gugliotta, Luca Pallanti, Olivier Kraif, Iris Fabry et Martina Barletta
English version below
L'influence croissante des méthodologies liées au TAL sur la linguistique de corpus oblige les chercheurs à réinterroger les pratiques de gestion du bruit et son impact dans les résultats de recherche (Fuchs & Habert, 2004 ; Léon, 2018 ; Zalmout et al., 2018). Qu'il s'agisse de corpus en diachronie longue (ex. français médiéval), de corpus dialectaux aux ressources limitées (ex. textes oraux ou écrits en arabe dialectal, cf. arabizi), ou encore de corpus de textes éloignés de la norme (ex. corpus d'apprenants), l'analyse du bruit est une étape nécessaire pour tirer des conclusions linguistiques des données ainsi évaluées (Molinelli & Putzu, 2015 ; Scaglione, 2018 ; Litosseliti, 2018). Ce numéro thématique de la revue Corpus, qui fait suite à une journée d'étude sur le même thème organisée en avril 2023 (https://je-bruit-corpus.sciencesconf.org/), sera l'occasion de réfléchir sur les méthodes de gestion du bruit dans les domaines du TAL et de la linguistique de corpus outillée, et à son impact sur la qualité des données linguistiques (Kraif et Ponton, 2007 ; Goutte et al., 2012 ; Zeroual, 2018).
Les questions sous-jacentes à toute étude linguistique concernent la définition de l'objet de recherche, la nature des données elles-mêmes, et la manière de préserver autant que possible leurs caractéristiques dans les différents traitements (lemmatisation, normalisation, étiquetage, etc.) (Sarrica et al., 2016). Ainsi, le choix des méthodes d'identification et de contrôle du bruit, de la phase de collecte à celle d'archivage, de la préparation des données à l'annotation, joue un rôle fondamental (Egbert & Baker, 2019). La définition même du bruit est multiple, et ne va pas de soi : dans le seul champ du TAL, ce terme, souvent peu interrogé, désigne des phénomènes variables et très hétérogènes, allant des péritextes du Web - hyperliens, menus et codes informatiques - au code switching, en passant par les erreurs d'orthographe ou de grammaire qui émaillent les productions (Al Sharou et al., 2021).
Ce numéro thématique propose de mener une réflexion sur la définition du bruit, dans une perspective linguistique, et sur les pratiques des chercheurs visant à réduire la portée des biais qui en découlent, que ce soit durant la collecte, l'enregistrement ou l'annotation des données. Dans le concret de la recherche, la question du bruit se pose à chaque étape de la démarche empirique de construction et d'analyse des données :
1. Le bruit pendant la collecte et l'enregistrement des données
Si l'on accepte le postulat selon lequel " la donnée linguistique est un résultat " (Benveniste, 1966), comment décoder le bruit causé par le recueil des données et leur enregistrement ? En effet, en fonction des objets de recherche, il existe des facteurs potentiels d'altération des données, comme par exemple les préconceptions du chercheur, ou les biais introduits par un système OCR donné (Jentsch & Porada, 2020). L'enjeu consiste alors à prédire ou à déterminer les biais potentiels induits par ces facteurs lors de la sélection et la mise en forme des données pour mieux contrôler les phases de recherche successives.
2. La préparation et le prétraitement des données.
Les méthodes choisies pour affiner les données brutes et les rendre disponibles pour des manipulations avancées peuvent représenter une importante source de bruit (ou, au contraire, de silence si on applique un filtre pour éliminer le bruit) : c'est notamment le cas du processus de normalisation des données (Al Sharou et al., 2021). Qu'il s'agisse de transcrire des données ou de corriger des erreurs, le chercheur fait des choix qui impactent nécessairement la nature des données, soit en les réduisant, soit en les enrichissant. Il s'agit donc d'anticiper les conséquences des transformations produites par les méthodes de traitement des données (Tanguy, 2012).
3. Le processus d'annotation et les métadonnées
À la base, l'annotation des corpus est une étape visant l'enrichissement des données : en fonction du modèle d'analyse mis au point, le chercheur tente de catégoriser des unités à travers un processus d'étiquetage (Péry-Woodley et al., 2011). Cependant, si d'un côté ce processus peut générer du bruit, de l'autre, il peut être une cause de silence fort préjudiciable aux résultats des recherches et à leur interprétation (des étiquettes absentes ou erronées pouvant générer des résultats lacunaires lors de l'analyse ou du requêtage des données). La notion de métadonnée peut également être mise en cause
: catégoriser une donnée signifie-t-il la transformer en quelque chose d'autre ? Par ailleurs, l'absence d'accord ou un faible accord dans les annotations produites par l'humain manifeste-t-il des variations interindividuelles assimilables à du bruit, ou au caractère trop vague des catégorisations en jeu ?
***
A chaque étape se posent des questions méthodologiques centrales : à partir de quel seuil peut-on considérer le bruit comme acceptable ? Comment différencier bruit et biais méthodologique ? Comment estimer le bruit sans vérité de terrain ? Quels outils statistiques spécifiques à l'étude des corpus permettent de délimiter des intervalles de confiance ? Comment atteindre l'équilibre nécessaire pour que le bruit causé par les traitements des données ne compromette pas les résultats des recherches ?
***
Les propositions d'article pourront aborder ces questions d'un point de vue général, sous un angle théorique et méthodologique, ou s'appuyer sur une ou plusieurs études de cas portant sur des observations particulières, en prenant soin de mettre en lumière les méthodes de gestion du bruit tout au long de l'étude.
Retro-planning
* Juillet 2023 : publication du l'Appel
* 10 novembre 2023 : pré-sélection sur résumé
* Mars 2024 : remise des articles. Juin 2024 : réponse aux auteurs
* Juin-octobre 2024 : navette avec les auteurs pour remise de l'article en forme définitive.
* Novembre-décembre 2024 : édition.
* Janvier 2025 : publication.
Soumission des résumés
* Votre résumé comptera 1.500 mots au maximum, références bibliographiques inclues.
* Merci de soumettre vos résumés pour le 10 novembre 2023 aux adresses elisa.gugliotta(a)ilc.cnr.it et luca.pallanti(a)univ-lyon2.fr
Références
Al Sharou, K., Li, Z., & Specia, L. (2021). Towards a Better Understanding of Noise in Natural Language Processing. Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), 5362. https://aclanthology.org/2021.ranlp-1.7
Benveniste, É. (1966). Problèmes de linguistique générale. Gallimard.
Egbert, J., & Baker, P. (Eds.). (2019). Using corpus methods to triangulate linguistic analysis. Routledge. Fuchs, C., & Habert, B. (2004). Le traitement automatique des langues : Des modèles aux ressources.
Le Français Moderne - Revue de linguistique Française, CILF (conseil international de la langue française), LXXII: 1, online.
Goutte, C., Carpuat, M., & Foster, G. (2012). The impact of sentence alignment errors on phrase-based machine translation performance. In Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers.
Jentsch, P., & Porada, S. (2020). From Text to Data : Digitization, Text Analysis and Corpus Linguistics. In S. Schwandt (Éd.), Digital Humanities Research (1re éd., Vol. 1, p. 89128). transcript Verlag / Bielefeld University Press. https://doi.org/10.14361/9783839454190-004
Kraif, O., & Ponton, C. (2007). Du bruit, du silence et des ambiguïtés : Que faire du TAL pour
l'apprentissage des langues ? TALN 2007, 143152. https://hal.archives-ouvertes.fr/hal-01073706
Léon, J. (2018). Tal et linguistique : Application, expérimentation, instrumentalisation. ELA. Etudes de linguistique appliquee, 2(190), 195203.
Litosseliti, L. (Ed.). (2018). Research methods in linguistics. Bloomsbury Publishing.
Molinelli, P., & Putzu, I. (2015). Modelli epistemologici, metodologie della ricerca e qualità del dato. Dalla linguistica storica alla sociolinguistica storica. Franco Angeli.
Péry-Woodley, M.-P., Afantenos, S. D., Ho-Dac, L.-M., & Asher, N. (2011). La ressource ANNODIS, un
corpus enrichi d'annotations discursives. TAL, 52(3), 71101.
Sarrica, M., Mingo, I., Mazzara, B., & Leone, G. (2016). The effects of lemmatization on textual analysis conducted with IRaMuTeQ: results in comparison. JADT2016: 13ème Journées Internacionales d'Analyse Statistique de Données Textuelles.
Scaglione, F. (2018). "Lavorare"; il dato linguistico: Prospettive e limiti. Alcune considerazioni dall'esperienza dell'Atlante Linguistico della Sicilia (ALS). In G. Sampino (Éd.), Atti del convegno internazionale dei dottorandi (p. 101122).
Tanguy, L. (2012). Complexification des données et des techniques en linguistique : contribution du TAL aux solutions et aux problèmes. HDR dissertation, Université de Toulouse 2 - le Mirail.
Zalmout, N., Erdmann, A., & Habash, N. (2018). Noise-robust morphological disambiguation for dialectal Arabic. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (pp. 953-964).
Zeroual, I. (2018). Building Arabic Corpora: Concepts, Methodologies, Tools, and Experiments (Doctoral dissertation, University of Maryland, USA).
Dear corpora-list members,
We are glad to announce the first SemEval shared task on Semantic Textual
Relatedness (STR): A shared task on automatically detecting the degree of
semantic relatedness (closeness in meaning) between pairs of sentences.
The semantic relatedness of two language units has long been considered
fundamental to understanding meaning (Halliday and Hasan, 1976; Miller and
Charles, 1991), and automatically determining relatedness has many
applications such as evaluating sentence representation methods, question
answering, and summarization.
Two sentences are considered semantically similar when they have a
paraphrasal or entailment relation. On the other hand, relatedness is a
much broader concept that accounts for all the commonalities between two
sentences: whether they are on the same topic, express the same view,
originate from the same time period, one elaborates on (or follows from)
the other, etc. For instance, for the following sentence pairs:
-
Pair 1: a. There was a lemon tree next to the house. b. The boy enjoyed
reading under the lemon tree.
-
Pair 2: a. There was a lemon tree next to the house. b. The boy was an
excellent football player.
Most people will agree that the sentences in pair 1 are more related than
the sentences in pair 2.
In this task, new textual datasets will be provided for Afrikaans
<https://en.wikipedia.org/wiki/Afrikaans>, Algerian Arabic
<https://en.wikipedia.org/wiki/Algerian_Arabic>, Amharic
<https://en.wikipedia.org/wiki/Amharic>, English, Hausa
<https://en.wikipedia.org/wiki/Hausa_language>, Hindi
<https://en.wikipedia.org/wiki/Hindi>, Indonesian
<https://en.wikipedia.org/wiki/Indonesian_language>, Kinyarwanda
<https://en.wikipedia.org/wiki/Kinyarwanda>, Marathi
<https://en.wikipedia.org/wiki/Marathi_language>, Moroccan Arabic
<https://en.wikipedia.org/wiki/Moroccan_Arabic>, Modern Standard Arabic
<https://en.wikipedia.org/wiki/Modern_Standard_Arabic>, Punjabi
<https://en.wikipedia.org/wiki/Punjabi_language>, Spanish
<https://en.wikipedia.org/wiki/Spanish_language>, and Telugu
<https://en.wikipedia.org/wiki/Telugu_language>.
Data
Each instance in the training, development, and test sets is a sentence
pair. The instance is labeled with a score representing the degree of
semantic textual relatedness between the two sentences. The scores can
range from 0 (maximally unrelated) to 1 (maximally related). These gold
label scores have been determined through manual annotation. Specifically,
a comparative annotation approach was used to avoid known limitations of
traditional rating scale annotation methods This comparative annotation
process (which avoids several biases of traditional rating scales) led to a
high reliability of the final relatedness rankings.
Further details about the task, the method of data annotation, how STR is
different from semantic textual similarity, applications of semantic
textual relatedness, etc. can be found in this paper:
https://aclanthology.org/2023.eacl-main.55.pdf
Tracks
Each team can provide submissions for one, two or all of the tracks shown
below:
Track A: Supervised
Participants are to submit systems that have been trained using the labeled
training datasets provided. Participating teams are allowed to use any
publicly available datasets (e.g., other relatedness and similarity
datasets or datasets in any other languages). However, they must report
additional data they used, and ideally report how impactful each resource
was on the final results.
Track B: Unsupervised
Participants are to submit systems that have been developed without the use
of any labeled datasets pertaining to semantic relatedness or semantic
similarity between units of text more than two words long in any language.
The use of unigram or bigram relatedness datasets (from any language) is
permitted.
Track C: Cross-lingual
Participants are to submit systems that have been developed without the use
of any labeled semantic similarity or semantic relatedness datasets in the
target language and with the use of labeled dataset(s) from at least one
other language. Note: Using labeled data from another track is mandatory
for submission to this track.
Deciding which track a submission should go to:
-
If a submission uses labeled data in the target language: submit to
Track A
-
If a submission does not use labeled data in the target language but
uses labeled data from another language: submit to Track C
-
If a submission does not use labeled data in any language: submit to
Track B
** Here ‘labeled data’ refers to labeled datasets pertaining to semantic
relatedness or semantic similarity between units of text more than two
words long.
Evaluation
The official evaluation metric for this task is the Spearman rank
correlation coefficient, which captures how well the system-predicted
rankings of test instances align with human judgments. You can find the
evaluation script for this shared task on our Github page
<https://github.com/semantic-textual-relatedness/Semantic_Relatedness_SemEva…>
.
Helpful Links
-
Competition Website: https://codalab.lisn.upsaclay.fr/competitions/15704
-
Task Website: <https://afrisenti-semeval.github.io/>
https://semantic-textual-relatedness.github.io
-
Twitter X: <https://twitter.com/AfriSenti2023>
https://twitter.com/SemRel2024
-
Contact organisers semrel-semeval-organisers(a)googlegroups.com
-
Google group for participants
semrel-semeval-participants(a)googlegroups.com
Important Dates
-
Training data ready: 11 September 2023
-
Evaluation Starts: 10 January 2024
-
Evaluation End: 31 January 2024
-
System Description Paper Due: February 2024
-
SemEval workshop: Summer 2024 - (co-located with a major NLP conference)
References
-
Shima Asaadi, Saif Mohammad, Svetlana Kiritchenko. 2019. Big BiRD: A
Large, Fine-Grained, Bigram Relatedness Dataset for Examining Semantic
Composition. Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language
Technologies.
-
M. A. K. Halliday and R. Hasan. 1976. Cohesion in English. London:
Longman.
-
George A Miller and Walter G Charles. 1991. Contextual Correlates of
Semantic Similarity. Language and Cognitive Processes, 6(1):1–28
-
Mohamed Abdalla, Krishnapriya Vishnubhotla, and Saif Mohammad. 2023.
What Makes Sentences Semantically Related? A Textual Relatedness Dataset
and Empirical Study. In Proceedings of the 17th Conference of the European
Chapter of the Association for Computational Linguistics, pages 782–796,
Dubrovnik, Croatia. Association for Computational Linguistics.
Task Organizers
Nedjma Ousidhoum
Shamsuddeen Hassan Muhammad
Mohamed Abdalla
Krishnapriya Vishnubhotla
Vladimir Araujo
Meriem Beloucif
Idris Abdulmumin
Seid Muhie Yimam
Nirmal Surange
Christine De Kock
Sanchit Ahuja
Oumaima Hourrane
Manish Shrivastava
Alham Fikri Aji
Thamar Solorio
Saif M. Mohammad
Dear all,
I would like to share the information about the VLSP 2023 evaluation
campaign on Vietnamese text and speech processing:
https://vlsp.org.vn/vlsp2023/eval
Registration is still open for several tasks. Thank you for your attention.
Best regards,
Huyen
--
Nguyen Thi Minh Huyen
Faculty of Mathematics, Mechanics and Informatics
VNU University of Science
334 Nguyễn Trãi, Thanh Xuân, Hà Nội
Tel: +84-24-38581530
Hello!
We're looking forward to the 8th annual Law and Corpus Linguistics Conference at Brigham Young University on Friday, October 13th. Information about the conference can be found HERE<https://corpusconference.byu.edu/2023-home/>. Please make note of the following information in preparation for the conference:
Registration
You can now register for the conference HERE<https://corpusconference.byu.edu/2023-home/registration/>. Please register early so we can make appropriate plans for meeting spaces and meals. There is no registration fee for this conference.
Pre-conference Workshop
There will be a workshop, titled Introduction to Corpus Linguistic Applications to Law, on Thursday from 9:00 - 12:00 for linguists who are interested in learning more about legal applications of corpus linguistics. The workshop is free of charge. If you are interested, please register HERE<https://corpusconference.byu.edu/2023-home/pre-conference/>. Lunch will be provided immediately following the workshop.
Meals
There will be a dinner for registered attendees on Thursday, October 12 at 6:00 pm. Please RSVP on the registration form if you plan to attend. Breakfast, lunch, and snacks will be provided on Friday.
Location and accommodation
The conference will be held in the J. Reuben Clark Law Building at Brigham Young University. We recommend the Provo Marriott Hotel or the Hyatt Place Provo for nearby accommodation.
Travel scholarships
We are offering $500 travel scholarships for law students, graduate students in linguistics, or early career professionals. If you are in one of these categories and need assistance with travel funding, please email byulawcorpus(a)law.byu.edu<mailto:byulawcorpus@law.byu.edu> with a brief letter describing how you will benefit from this scholarship and attendance at LCL 2023.
Please don't hesitate to reach out if you have additional questions. We hope to see you in October!
LCL 2023 Conference Organizing Team
CALL FOR ABSTRACTS
"SUSTAINABLE ARCHIVING, EXPLOITATION, AND DISTRIBUTION OF DYNAMIC DATA FROM SOCIAL MEDIA - TWITTER AND BEYOND"
Conference from 19 - 20 March 2024 at the German National Library (Frankfurt am Main)
Website: https://www.dnb.de/EN/twittertagung
Social media is both a source of data for and the focus of a range of research approaches in the humanities, social sciences, IT, sciences and life sciences. The development of social media over time makes it a part of our digital cultural heritage, but the process for institutions to archive and document these in ways which fully reflect its detail and complexity is still only rudimentary. One key reason for this is the unique characteristics of the data in terms of media technology, economics, social factors and aesthetics. This confronts researchers, research institutions and cultural heritage institutions with many different challenges in terms of how to archive, catalogue and provide the data for later use. One example of this is Twitter (now known as “X”). The monetisation of the platform’s internal archive (part of ongoing restructuring of the platform) has had a radical impact on research and archiving. While flexible APIs and access opportunities before early 2023 led to a boom in research activity and the creation of comprehensive collections, access for research and archive has been made increasingly difficult since then.
Archiving, cataloguing and providing dynamic data from social media present challenges which affect researchers, research institutions, libraries and archives in equal measure, and the best way to solve these problems is through collaboration and partnership. This requires wide-ranging efforts which would be impossible for a single data community or discipline. The aim of the conference is to facilitate networking between libraries, archives, research institutes and researchers in German-speaking countries who are involved in archiving and long-term use of data and digital objects from social media.
Conference presentations should focus on the following topics:
• The interaction between research and archiving
• Research data problems in Tweet-based research caused by the loss of Twitter as a data provider
• The status and maintenance of social media from an archival and cultural-historical perspective, e.g. posts, interactions and platform elements
• The consolidation of collections, corpora, and holdings such as metadata
• Initiatives to encourage archiving and cataloguing of social media data
• Concepts for providing derivative datasets from social media and how these can be used
• Ethical questions
• Legal issues
• The possibility of creating a social media data registry
Please submit your abstract (max. 1 page / 500 words) and no more than 1 page of biobibliographic data as a PDF in German or English. Up to 20 minutes are available for each presentation, plus 10 minutes for discussion.
Submission to: twarchiv(a)dnb.de <mailto:twarchiv@dnb.de>
Deadline for submission of abstracts: 31 October 2023
feedback on acceptance of paper: 30 November 2023
Conference
The conference will take place from about noon, 19 March 2024 and end in the early evening of 20 March. It will be followed by a data sprint working with a long-term corpus of German Twitter data on 21 & 22 March 2024.
More information coming soon.
Conference: 19. - 20. March 2024
Data Sprint: 21. - 22. March 2023
Venue: German National Library, Frankfurt am Main
Speakers who do not have their own resources for travel may apply for up to €300 to cover travel and accommodation costs.
Organisation: Dr. Britta Woldering, Letitia Mölck, German National Library, twarchiv(a)dnb.de <mailto:twarchiv@dnb.de>
Programme Committee (in alphabetical order)
Stefan Dietze (Heinrich Heine University Düsseldorf)
Dimitar Dimitrov (GESIS)
Christoph Eggersglüß (Philipps University Marburg, NFDI4Culture)
Philippe Genêt (German National Library, Text+)
Tatjana Scheffler (Ruhr University Bochum)
Claus-Michael Schlesinger (University Library, Humboldt University Berlin)
Britta Woldering (German National Library)
Cooperation partners:
Deutsche Nationalbibliothek
BERD@NFDI
KonsortSWD
NFDI4Culture
NFDI4Data Science
Text+
---
Jun.-Prof. Dr. Tatjana Scheffler (she/her)
GB 5/157
Ruhr-Universität Bochum
Fakultät für Philologie, Germanistik
Universitätsstraße 150
44801 Bochum
Germany
Mail: tatjana.scheffler(a)rub.de
Web: http://staff.germanistik.rub.de/digitale-forensische-linguistik/
Tel.: +49 234 32-21471
Saarland University is a campus-based university with a strong
international focus and a research-oriented profile. Numerous research
institutes on campus and the systematic promotion of collaborative
projects make Saarland University an ideal environment for innovation
and technology transfer. To further strengthen this excellence in
research and teaching, the Department of Language Science and Technology
seeks to hire a
Professor (W2 with Tenure Track to W3) for Speech Science (m|f|x)
Reference n° W2283
Six-year tenure track position, starting April 2025, with the
possibility of promotion to a permanent professorship (W3).
We are looking for a highly motivated researcher in the field of
phonetics, speech science, and speech technology, with extensive
knowledge of speech production, perception and acoustics. The successful
candidate is expected to have expertise in experimental and
computational approaches to research on spoken language. A focus on
spoken dialog and conversational speech and/or multimodal aspects of
communication is particularly welcome.
The Department of Language Science and Technology is internationally
recognized for collaborative and interdisciplinary research, and the
successful candidate is expected to contribute to relevant joint
research initiatives. A demonstrated ability to attract external funding
of research projects is therefore highly desired. Phonetics, speech
science and speech technology are core elements of our study programs on
the M.Sc. and B.Sc./B.A. level, and the successful candidate is expected
to teach the associated courses within these programs.
What we can offer you:
Tenure track professors (W2) have faculty status at Saarland University,
including the right to supervise Bachelor’s, Master’s and PhD students.
The successful candidate will focus on carrying out world-class
research, will lead their own research group, and will undertake
teaching and supervision responsibilities. Tenure track professors (W2)
with outstanding performance will receive tenure as a full professor
(W3) provided a positive tenure evaluation is made. Decisions regarding
tenure are made no later than six years after taking up the tenure track
position.
The position offers excellent working conditions in a lively and
international scientific community. Saarland University is one of the
leading centers for language science and computational linguistics in
Europe, and offers a dynamic and stimulating research environment. The
Department of Language Science and Technology organizes about 100
research staff in nine research groups in the fields of computational
linguistics, psycholinguistics, phonetics and speech science, speech
processing, and corpus linguistics
(https://www.uni-saarland.de/en/department/lst.html). The department
serves as the focal point of the Collaborative Research Center 1102
"Information Density and Linguistic Encoding"
(http://www.sfb1102.uni-saarland.de). It is part of the Saarland
Informatics Campus (https://saarland-informatics-campus.de/en), which
brings together 800 researchers and 2,000 students from 81 countries and
collaborates closely with world-class research institutions on campus,
such as the Max Planck Institute for Informatics, the Max Planck
Institute for Software Systems, and the German Research Center for
Artificial Intelligence (DFKI).
Qualifications:
The appointment will be made in accordance with the general provisions
of German public sector employment law. Applicants will have a PhD or
doctorate in an appropriate subject and will have demonstrated a proven
track record of independent academic research (e.g. as a junior or
assistant professor, or by having completed an advanced, post-doctoral
research degree (Habilitation) or equivalent academic activity at a
university or research institution). They will typically have completed
a period of postdoctoral research and have teaching experience at the
university level. They must have demonstrated outstanding research
capabilities and have the potential to successfully lead their own
research group.
The successful candidate will be expected to actively contribute to
departmental research and teaching, including introductory lectures in
phonetics and phonology, speech science, as well as more advanced
lectures. The teaching language is English (in the MSc programs) and
German (in the BSc/BA programs). We expect that the successful candidate
has, or is willing to acquire within an appropriate period, sufficient
proficiency to teach in both languages.
Your Application:
Applications should be submitted online at
www.uni-saarland.de/berufungen. No additional paper copy is required.
The application must contain:
• a cover letter and curriculum vitae (including phone number and
email address)
• a full list of publications
• a full list of third-party funding (own shares shown)
• your proposed research plan (2-5 pages)
• a teaching statement (1 page)
• copies of your degree certificates
• full-text copies of your 5 most important publications
• a list of 3 academic references (including email addresses), at
least one of whom must be a person who is outside the group of your
current or former supervisors or colleagues.
Applications must be received no later than 12 October 2023. Please
include the job reference number W2283 when you apply. Please contact
crocker(a)lst.uni-saarland.de if you have any questions.
Saarland University regards internationalization as an institution-wide
process spanning all aspects of university life and it therefore
encourages applications that align with its internationalization
strategy. Members of the university's professorial staff are therefore
expected to engage in activities that promote and foster further
internationalization. Special support will be provided for projects that
continue with or expand on collaborative interactions within existing
international cooperative networks, e.g. projects with partners in the
European University Alliance Transform4Europe (www.transform4europe.eu)
or the University of the Greater Region (www.uni-gr.eu).
Saarland University is an equal opportunity employer. In accordance with
its affirmative action policy, Saarland University is actively seeking
to increase the proportion of women in this field. Qualified women
candidates are therefore strongly encouraged to apply. Preferential
consideration will be given to applications from disabled candidates of
equal eligibility.
When you submit a job application to Saarland University you will be
transmitting personal data. Please refer to our privacy notice
(https://www.uni-saarland.de/verwaltung/datenschutz/) for information on
how we collect and process personal data in accordance with Art. 13 of
the General Data Protection Regulation (GDPR). By submitting your
application, you confirm that you have taken note of the information in
the Saarland University privacy notice.
Dear all,
We are happy to announce the release of our LLMeBench framework. The
framework is designed to accelerate and simplify evaluation and
benchmarking of large language models. It is modular, language-agnostic and
simple to extend. It currently supports interactions with LLMs through
APIs. The framework also features zero- and few-shot learning settings. It
will be open-sourced to encourage improvements and extensions from the
community.
The framework currently hosts recipes for a diverse set of Arabic NLP tasks
using OpenAI’s GPT and BLOOMZ models. Specifically, it currently serves 31
unique NLP tasks (from word-level to sentence pairs tasks) with specific
focus on Arabic tasks, using 53 publicly available datasets. It also comes
equipped with 200 prompts for these setups. It has recipes for 12 languages
including Arabic, Bangla, Bulgarian, Dutch, English, French, German,
Italian, Polish, Russian, Spanish, Turkish, and more to come.
We hope this will encourage experimentation with LLMs for multilingual
studies content.
We extend an invitation to the research community to participate and
improve the framework. We are excited to hear all your feedback and
suggestions and we thank you for your contribution.
For further details please take a look at the repository and the paper
below.
Code: https://github.com/qcri/LLMeBench
Paper: https://arxiv.org/pdf/2308.04945.pdf
Regards
Firoj
................
Firoj Alam, PhD
http://sites.google.com/site/firojalam/
Praxiling and the CORLI consortium are pleased to announce the conference "Untangling Associations - Advances in collocation and keyword analysis", which will take place on Friday 22 September 2023 at Paul-Valéry University Montpellier and online. Detailed information regarding the conference's description and program can be found below.
Conference description
For the past few decades, assessing the adequacy of association measures (AM) – be it in the domain of keywords, collocations or collostructional analysis – has been one of the most important strands of corpus linguistic research. Most of the work in this area focussed on finding the one best measure as a trade off between statistical appropriateness and ease of computational implementation, thus reflecting the practice of corpus linguists predominantly favouring data sets based on one single AM for the sake of simplicity. This one-AM-fits-all approach, however, suffers from the fact that the most consensual and widespread AMs, such as log-likelihood ratio (G2), conflate different strands of information (viz. frequency and strength of attraction), whereas data sets used to investigate association phenomena (keywords of target corpora as well as lexical or lexico-grammatical cooccurrence) should be designed to integrate several distinct dimensions, as has been recently pointed out by Stefan Th. Gries (2019, 2021).
Starting from Gries’s proposal of the approach called « tupleization », this conference will be the occasion to discuss the present state of the art and possible innovations within the realm of methodological frameworks and studies encompassing keyword, collocation or collostruction analysis. It gathers scholars from different areas of research ranging from corpus linguistics and NLP to Digital Humanities and Textometrics as practised in the French tradition of Discourse Analysis.
The conference will be held on Friday 22 September 2023 at Paul-Valéry University Montpellier (campus Saint-Charles 2, Salle des Actes 009) and online via Zoom (access link: https://univ-montp3-fr.zoom.us/j/95367941544). Attendance is free of charge.
References
Gries, Stefan Th. (2019). 15 years of collostructions: Some long overdue additions/corrections (to/of actually all sorts of corpus-linguistics measures). International Journal of Corpus Linguistics 24 (3), 385–412.
Gries, Stefan Th. (2021). A new approach to (key) keywords analysis: Using frequency, and now also dispersion. Research in Corpus Linguistics 9 (2), 1–33.
Program
10:00 - 11:00 Stefan Th. Gries: Tupleization in corpus linguistics: how and why
11:05 - 11:45 Martin Hilpert: Why are grammatical elements more evenly dispersed than lexical elements? A reanalysis with a new dispersion measure
11:45 - 12:25 Ludovic Lebart: Dealing with low frequencies or high discrepancies of lexical frequencies: How to adapt the tools of textual data analysis to corpora of poems and lyrics
12:30 - 14:15 Lunch
14:20 - 15:00 Bénédicte Pincemin: The Specificity Measure in Textometry: a Hermeneutic Use of the Fisher's Exact Test
15:00 - 15:40 Christof Schöch: Evaluating Measures of Keyness: A Perspective from Computational Literary Studies
15:40 - 16:20 Ludovic Tanguy & Filip Miletic: Measuring semantic specificities across corpora: looking for semantic shifts in Quebec English
16:20 - 17:00 Coffee break
17:00 - 18:00 Panel discussion
Contact : Sascha Diwersy (sascha.diwersy(a)univ-montp3.fr) and Céline Poudat (celine.poudat(a)univ-cotedazur.fr)
Anticipated Start Date: (Mid August, 2024)
DETAILED JOB DESCRIPTION:
The Department of Psychology at the Rochester Institute of Technology (RIT; www.rit.edu/psychology<http://www.rit.edu/psychology>) invites candidates to apply for a tenure-track Assistant Professor position starting in August 2024. We are seeking an energetic and enthusiastic psychologist who will serve as an instructor, researcher, and mentor to students in our undergraduate (Psychology, Neuroscience) and graduate programs (Masters in Experimental Psychology, Ph.D. in Cognitive Science). We are particularly looking to build a cohort of faculty who can contribute to the interdisciplinary Ph.D. program in Cognitive Science and contribute to research, mentoring, and teaching using computational and laboratory methods. Candidates should have expertise in an area of Cognitive Science such as cognitive or behavioral neuroscience, AI, computational/psycho-linguistics, cognitive psychology, comparative psychology, or related areas. We are particularly interested in individuals whose area of research expertise expands the current expertise of the faculty. Candidates who can teach courses in natural language processing or computational modeling courses are especially encouraged to apply. The Department of Psychology at RIT serves a rapidly expanding student population at a technical university. The position requires a strong commitment to teaching and mentoring, active research and publication, and a strong potential to attract external funding. Teaching and research are priorities for faculty at RIT, and all faculty are expected to mentor students through advising, research and in-class experiences. The successful candidate will be able to teach courses in our undergraduate cognitive psychology track (Memory & Attention, Language & Thought, Decision Making, Judgement & Problem Solving), will be expected to teach research methods/statistics courses at the undergraduate and graduate level, and teach and mentor students in our graduate programs. In addition, candidates must be able to do research and work effectively within the department’s existing lab space. RIT provides many opportunities for collaborative research across the institute in many diverse disciplines such as AI, Digital Humanities, Human-Centered Computing, and Cybersecurity.
We are seeking individuals who have the ability and interest in contributing to a community committed to student-centeredness; professional development and scholarship; integrity and ethics; respect, diversity and pluralism; innovation and flexibility; and teamwork and collaboration. Select to view links to RIT’s core values<http://www.rit.edu/academicaffairs/policiesmanual/p040>, honor code<http://www.rit.edu/academicaffairs/policiesmanual/p030>, and statement of diversity.<http://www.rit.edu/academicaffairs/policiesmanual/p050>
THE COLLEGE/ DEPARTMENT:
The Department of Psychology at RIT offers B.S., M.S. degrees, Advanced Certificates, minors, immersions, electives, and a new interdisciplinary Ph.D. degree program in Cognitive Science. The B.S. degree provides a general foundation in psychology with specialized training in one of five tracks: biopsychology, clinical psychology, cognitive psychology, social psychology, and developmental psychology. The M.S. degree is in Experimental Psychology, with an Advanced Certificate offered in Engineering Psychology. We offer accelerated BS/MS programs with AI, Sustainability, and Experimental Psychology. The Ph.D. degree is in Cognitive Science and the program is broadly interdisciplinary with several partner units across the university. We also offer joint B.S. degrees in Human Centered Computing and Neuroscience.
The College of Liberal Arts is one of nine colleges within Rochester Institute of Technology. The College has over 150 faculty in 13 departments in the arts, humanities and social sciences. The College currently offers fourteen undergraduate degree programs and five Master degrees, serving over 800 students.
THE UNIVERSITY:
Founded in 1829, Rochester Institute of Technology is a diverse and collaborative community of engaged, socially conscious, and intellectually curious minds. Through creativity and innovation, and an intentional blending of technology, the arts and design, we provide exceptional individuals with a wide range of academic opportunities, including a leading research program and an internationally recognized education for deaf and hard-of-hearing students. Beyond our main campus in Rochester, New York, RIT has international campuses in China, Croatia, Dubai, and Kosovo. And with more than 19,000 students and more than 125,000 graduates from all 50 states and over 100 nations, RIT is driving progress in industries and communities around the world. Find out more at www.rit.edu<http://www.rit.edu> .
REQUIRED MINIMUM QUALIFICATIONS:
* Have PhD., or PhD. expected by July 1, 2024 in cognitive psychology or cognitive science related specialty (e.g. linguistics);
* Have demonstrated ability to conduct independent research in psychology or closely related fields;
* Have consistently and recently published;
* Have demonstrated teaching ability and have taught college courses independently beyond TA;
* Have demonstrated ability to supervise student research;
* Demonstrate external research grant attainment potential;
* Demonstrate expertise in research and teaching in cognitive science;
* Show a career trajectory that emphasizes a balance between teaching and research;
* Show a fit with the Department of Psychology’s general mission, teaching, research, and resources.
* Ability to contribute in meaningful ways to the college’s continuing commitment to cultural diversity, pluralism, and individual differences.
HOW TO APPLY:
Apply online at http://careers.rit.edu/faculty; search openings, then Keyword Search 8262BR. Please submit your application, curriculum vitae, cover letter addressing the listed qualifications and upload the following attachments:
• A brief teaching philosophy
• A research statement that includes information about previous grant work, the potential for future grants, and information about one-on-one supervision of student research
• The names, addresses and phone numbers for three references
• Contribution to Diversity Statement<https://www.rit.edu/academicaffairs/facultyrecruitment/forms/Diversity_Stat…>
You can contact the chair of the search committee, Caroline DeLong, Ph.D. with questions on the position at: cmdgsh(a)rit.edu<mailto:cmdgsh@rit.edu>.
Review of applications will begin October 1, 2023 and will continue until an acceptable candidate is found.
Cecilia O. Alm, Ph.D. She/Her
Joint Program Director, MS in AI, School of Information - rit.edu/study/artificial-intelligence-ms
Director, AWARE-AI NSF Research Traineeship Program - rit.edu/nrtai/
Associate Director, Center for Human-aware AI - rit.edu/chai/
Director, Computational Linguistics and Speech Processing lab - rit.edu/clasp
Professor, Department of Psychology, CLA, Rochester Institute of Technology
Affiliated with the School of Information, Department of Computer Science, Ph.D. Program in Computing and Information Sciences, MS Program in Data Science
92 Lomb Memorial Drive
Rochester NY 14623
rit.edu/directory/coagla-cecilia-alm
W: 585-475-7327
M: 585-536-7539
coagla(a)rit.edu<mailto:coagla@rit.edu>
[RIT | Rochester Institute of Technology]<https://www.rit.edu/>
[Twitter]<https://twitter.com/AWAREAINRT> [LinkedIn] <https://www.linkedin.com/in/ceciliaoalm/>
CONFIDENTIALITY NOTICE: The contents of this email message and any attachments are intended solely for the addressee(s) and may contain confidential and/or privileged information and may be legally protected from disclosure. If you are not the intended recipient of this message or their agent, or if this message has been addressed to you in error, please immediately alert the sender by reply email and then delete this message and any attachments. If you are not the intended recipient, you are hereby notified that any use, dissemination, copying, or storage of this message or its attachments is strictly prohibited.