January 2024 - Corpora

CfP: 11th Workshop on the Representation and Processing of Sign Languages (sign-lang@LREC 2024)
by Schulder, Dr. Marc 17 Jan '24

17 Jan '24

Event: 11th Workshop on the Representation and Processing of Sign Languages (sign-lang@LREC 2024) Deadline: 22 February 2024 Website: https://www.sign-lang.uni-hamburg.de/lrec2024/ Submission page: https://softconf.com/lrec-coling2024/signlang2024/ CALL FOR PAPERS Submissions are invited for a full day workshop on sign language resources, to take place on 20 May 2024 as a satellite event of LREC-COLING 2024 in Turin, Italy. During the past years, a number of large-scale sign language corpus projects have started. Some have already been completed, but many more projects are about to start. At the same time, sign language technologies are maturing and are promising to support the time-consuming basic annotation. The workshop aims at bringing together those researchers who already work with multimodal sign language corpora (and those who see the need for empirical underpinnings of their current research) with those who develop sign language technologies. It provides the platform to compare competing approaches. As sign language resource technologies build to a large extent on methodologies and tools used in the language resource community in general, but add very specific perspectives (e.g. no writing system established, use of video as data source) and works with a different modality of human language, sign language research is able to feed back to the language resource community at large. At the same time, as the raw data are in the visual domain, the field naturally bridges into Computer Vision. Thus, researchers use Machine Learning methods on both visual and linguistic data. We invite submissions of papers to be presented either on stage (20 minutes plus 10 minutes discussion) or as posters (with or without demonstrations) on the following topics: 2024 SPECIAL TOPIC: EVALUATION OF SIGN LANGUAGE RESOURCES With the field maturing, it becomes an urgent issue to assess the quality of sign language resources for a large variety of tasks. We invite contributions on both automatic and human-based evaluation procedures for all kinds of sign language resources and tools. GENERAL ISSUES ON SIGN LANGUAGE CORPORA AND TOOLS • Avatar technology as a tool in sign language corpora and corpus data feeding into advances in avatar technology • Experiences in building sign language corpora • Elicitation methodology appropriate for corpus collection • Proposals for standards for linguistic annotation or for metadata descriptions • Experiences from linguistic research using corpora • Use of (parallel) corpora and lexicons in translation studies • Language documentation and long-term accessibility for sign language data • Annotation and visualization Tools • Linking corpora and lexicons and integrated presentation of corpus and dictionary contents • “Internet as a corpus” for sign languages • Sign language corpus mining • Crowd and community sourcing for corpus work • Multi-lingual sign language resources and connecting sign language resources to language resources for spoken languages • FAIR, CARE and OpenScience for sign language data In the tradition of LREC, oral/signed presentations and poster presentations (with or without demonstrations) have equal status, and authors are encouraged to suggest the presentation format best suited to communicate their ideas. Papers (4-8 pages) of all accepted submissions to this workshop will be published as workshop proceedings published on the conference website – independent of whether you have a poster or an oral/signed presentation. The workshop does not differentiate between long, short, or position papers. Please submit your paper through the LREC START system (https://softconf.com/lrec-coling2024/signlang2024/) not later than 22 February 2024, indicating whether you prefer an oral/signed presentation, a poster presentation or a poster presentation with demo. Unlike the main conference, the workshop will be reviewed single-blind, so submissions SHOULD NOT BE ANONYMOUS. ATTENTION Please note that you are expected to submit the full paper, not an extended abstract as in previous years! IMPORTANT DATES • Deadline for submissions: 22 February 2024 (11:59PM UTC-12:00 “anywhere on Earth”) • Notification of acceptance: 22 March, 2024 • Early bird registration ends: tbd • Camera ready version of the paper (for both oral/signed presentations and posters): 8 April 2024 • Submission of slides for interpreters' preparation (oral/signed presentations only): 10 May 2024 • This workshop: 20 May 2024 • LREC main conference: 22–24 May 2024 • LREC workshops 20, 21 & 25 May 2024

1 0

Deadline extension for ARR commitments for UncertaiNLP@EACL2024
by Tiedemann, Jörg 17 Jan '24

17 Jan '24

Due to the delay in EACL notifications we extend the deadline for ARR commitments to the First Workshop on Uncertainty-Aware NLP (UncertaiNLP) to January 20. More info on the workshop: https://uncertainlp.github.io/ ARR commitment link: https://openreview.net/group?id=eacl.org/EACL/2024/Workshop/UncertaiNLP_ARR… We welcome submissions of papers that did not receive a positive decision by the main EACL conference and accepted findings papers who would like to apply for a presentation slot in the workshop. —————————————— Jörg Tiedemann University of Helsinki https://blogs.helsinki.fi/language-technology/

1 0

Deadline extension for ARR commitments for MOOMIN@EACL2024
by Tiedemann, Jörg 17 Jan '24

17 Jan '24

Due to the delay in EACL notifications we extend the deadline for ARR commitments to the First Workshop on Modular and Open Multilingual NLP (MOOMIN) to January 20. More info on the workshop: https://moomin-workshop.github.io/ ARR commitment link: https://openreview.net/group?id=eacl.org/EACL/2024/Workshop/MOOMIN_ARR_Comm… We welcome submissions of papers that did not receive a positive decision by the main EACL conference and accepted findings papers who would like to apply for a presentation slot in the workshop. —————————————— Jörg Tiedemann University of Helsinki https://blogs.helsinki.fi/language-technology/

1 0

FreeTxt app launch
by Rayson, Paul 17 Jan '24

17 Jan '24

ENGLISH VERSION BELOW Mae'n dda gennym lansio TestunRhydd - pecyn cymorth dwyieithog ar-lein am ddim ar gyfer dadansoddi a delweddu data testun rhydd (o arolygon, holiaduron etc) yn Gymraeg a Saesneg. Mae TestunRhydd yn defnyddio rhai o’r gwasanaethau a’r methodolegau corpws o CorCenCC ac ACC (Crynhoi Testunau Cymraeg yn Awtomatig), ac yn eu hailbecynnu fel bod modd i gynulleidfaoedd a grwpiau o ddefnyddwyr newydd ddadansoddi eu data adborth eu hunain. Wedi'i gynllunio ar y cyd ag Ymddiriedolaeth Genedlaethol Cymru, Amgueddfa Cymru, Cadw, CBAC, a'r Ganolfan Dysgu Cymraeg Genedlaethol, mae TestunRhydd ar gael i unrhyw un mewn unrhyw sector yng Nghymru a'r tu hwnt. Mae TestunRhydd: * yn dangos a yw eich data yn gadarnhaol a/neu'n negyddol (dadansoddi sentiment) ac mae modd delweddu'r canlyniadau a'u lawrlwytho. * yn gadael i chi archwilio/delweddu geiriau, ymadroddion a themâu cyffredin yn eich data (mewn tablau, cymylau geiriau etc.). * yn gadael i chi grynhoi data testun-rhydd, ac archwilio'r defnydd o eiriau a'u perthnasoedd. Mae TestunRhydd ar gael fel cod agored gyda thrwydded Apache 2.0 (https://github.com/UCREL/FreeTxt-Flask), a thrwy ryngwyneb demo gwe lletyol yn: www.freetxt.app<http://www.freetxt.app/>. Mae’n ymgorffori offer cod agored eraill o’n prosiectau blaenorol fel CyTag (tagiwr rhannau ymadrodd Cymraeg), crynodebwr Cymraeg, a PyMUSAS (ar gyfer Cymraeg a Saesneg), gweler https://www.freetxt.app/about am fwy o fanylion. Datblygwyd TestunRhydd fel rhan o brosiect ymchwil ar y cyd a ariannwyd gan yr AHRC 'TestunRhydd yn cefnogi dadansoddi data arolygon a holiaduron testun-rhydd dwyieithog' gyda chydweithwyr o Brifysgol Caerdydd a Phrifysgol Caerhirfryn (Rhif y Grant AH/W004844/1). Roedd y tîm yn cynnwys PY - Dawn Knight; CY - Paul Rayson, Mo El-Haj; Cydymeithion Ymchwil - Ignatius Ezeani, Nouran Khallaf a Steve Morris. Roedd Grŵp Ymgynghorol y Prosiect yn cynnwys cynrychiolwyr o: Ymddiriedolaeth Genedlaethol Cymru, Cadw, Amgueddfa Cymru, CBAC a'r Ganolfan Dysgu Cymraeg Genedlaethol We’re excited to launch FreeTxt – a free bilingual online toolkit for analysing and visualising free-text data (from surveys, questionnaires etc.) in English and Welsh. FreeTxt draws on some of the corpus-based utilities and methodologies from CorCenCC and ACC (Welsh Automatic Text Summarisation), repackaging these to enable new audiences and user-groups to analyse their own feedback data. Co-designed in collaboration with National Trust Wales, Museum Wales, Cadw, WJEC, and National Centre for Learning Welsh, FreeTxt is accessible to anyone in any sector in Wales and beyond. FreeTxt: * indicates if your data is positive and/or negative (sentiment analysis) and provides downloadable visualisations of results. * allows you to explore/visualise common words, phrases and themes in your data (in tables, word clouds etc.). * enables you to summarise free-text data, and examine word use and relationships. FreeTxt is available open source with an Apache 2.0 licence (https://github.com/UCREL/FreeTxt-Flask), and via a hosted web demo interface at: www.freetxt.app<http://www.freetxt.app/>. It incorporates other open source tools from our previous projects such as CyTag (Welsh POS tagger), a Welsh summariser, and PyMUSAS (for English and Welsh), see https://www.freetxt.app/about for more details. FreeTxt was developed as part of an AHRC funded collaborative 'FreeTxt supporting bilingual free-text survey and questionnaire data analysis' research project involving colleagues from Cardiff University and Lancaster University (Grant Number AH/W004844/1). The team included PI - Dawn Knight; CIs - Paul Rayson, Mo El-Haj; RAs - Ignatius Ezeani, Nouran Khallaf and Steve Morris. The Project Advisory Group included representatives from: National Trust Wales, Cadw, Museum Wales, CBAC | WJEC and National Centre for Learning Welsh. -- Paul Rayson Director of UCREL and Professor of Natural Language Processing SCC Data Theme Lead School of Computing and Communications, InfoLab21, Lancaster University, Lancaster, LA1 4WA, UK. Web: https://www.research.lancs.ac.uk/portal/en/people/Paul-Rayson/ Tel: +44 1524 510357 Contact me on Teams<https://teams.microsoft.com/l/chat/0/0?users=p.rayson@lancaster.ac.uk>

1 0

leave
by Imadedine Jerbi 16 Jan '24

16 Jan '24

leave -- *Imadedine Jerbi* CTO at we.CONECT Global Leaders GmbH <http://www.we-conect.com/en/> m: +49 015214121744 <+4915214121744> | w: it-scaleup.com <https://www.it-scaleup.com/>

1 0

CODI workshop: deadline extension for direct submissions
by Chloé Braud 16 Jan '24

16 Jan '24

CODI, 5th Workshop on Computational Approaches to Discourse 2024-03-21 - EACL 2024 - Malta ** Direct Submission deadline: January 22th, 2024 ** Direct submission: We now open submissions for papers rejected at another main conference. The deadline has been updated to account for the delay in EACL notifications. Note that notifications will be sent on January 25 for direct submissions, and camera-ready will be due on January 30. Website link: https://sites.google.com/view/codi2024 CODI considers for publication papers rejected at one of the main conferences, authors will have to submit both the paper and the reviews as a supplemantary pdf file. If modifications have been made since the original submission, please submit an additional file describing briefly the modifications made. The organizers will decide on the acceptance of the papers based on the quality of the paper and its fit with the workshop. As a reminder, CODI also invites presentations of paper accepted at another main conference. They will be included in the workshop program and handbook, but will not appear in the workshop proceedings. Please submit your workshop papers (category: "direct submission") at https://softconf.com/eacl2024/CODI-2024/

1 0

January 2023 Newsletter - LDC
by Penn LDC 16 Jan '24

16 Jan '24

In this newsletter: Renew your LDC membership today New publications: KASET - Kurmanji and Sorani Kurdish Speech and Transcripts<https://catalog.ldc.upenn.edu/LDC2024S01> LORELEI Farsi Representative Language Pack<https://catalog.ldc.upenn.edu/LDC2024T01> ________________________________ Renew your LDC membership today The importance of curated resources for language-related education, research, and technology development drives LDC's mission to create them, to accept data contributions from researchers across the globe, and to broadly share such resources through the LDC Catalog. LDC members enjoy no-cost access to new corpora released annually, as well as the ability to license legacy data sets from among our 950+ holdings at reduced fees. Ensure that your data needs continue to be met by renewing your LDC membership or by joining the Consortium today. Now through March 1, 2024, 2023 members receive a 10% discount on 2024 membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC<https://www.ldc.upenn.edu/communications/newsletter/january-2022-newsletter> for more details on membership options and benefits. ________________________________ New publications: KASET - Kurmanji and Sorani Kurdish Speech and Transcripts<https://catalog.ldc.upenn.edu/LDC2024S01> consists of 147 hours of telephone conversations (289 recordings) and broadcast news (410 recordings) in two Kurdish dialects: Kurmanji Kurdish and Sorani Kurdish along with transcripts covering 60 hours of those recordings. Kurdish is spoken primarily in Turkey, Iran, Iraq, and Syria. Sorani and Kurmanji are the two widely spoken dialects of the Kurdish language. The telephone speech was generated from calls by native Kurdish speakers in the United States to North American acquaintances in their social network. The broadcast news audio was collected from multiple streaming radio and television broadcast programs (narrowband and wideband audio), many of which contained a mix of Kurmanji and Sorani Kurdish. Native speaker auditors identified a 5-10 minute span from each broadcast recording for transcription. Full telephone recordings that passed the native speaker audit were transcribed. This release includes speaker information, such as gender, year of birth, and language. 2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee. * LORELEI Farsi Representative Language Pack<https://catalog.ldc.upenn.edu/LDC2024T01> was developed by LDC and is comprised of approximately 250 million words of Farsi monolingual text, 120,000 Farsi words translated from English data, and 751,000 words of found Farsi-English parallel text. Approximately 75,000 words were annotated for named entities and up to 22,000 words were annotated for entity discovery and linking and situation frames (identifying entities, needs, and issues). Data was collected from discussion forum, news, reference, social network, and weblogs. The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage. The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10)<https://catalog.ldc.upenn.edu/LDC2020T10>. 2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee. To unsubscribe from this newsletter, log in to your LDC account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to "Receive Newsletter" under Account Options or contact LDC for assistance. Membership Coordinator Linguistic Data Consortium<ldc.upenn.edu> University of Pennsylvania T: +1-215-573-1275 E: ldc(a)ldc.upenn.edu<mailto:ldc@ldc.upenn.edu> M: 3600 Market St. Suite 810 Philadelphia, PA 19104

1 0

Shared Task on Fine-Tuning LLMs for Ukrainian (UNLP @ LREC-Coling 2024)
by Mariana Romanyshyn 16 Jan '24

16 Jan '24

*The Third Ukrainian Natural Language Processing Workshop (UNLP 2024)* <https://unlp.org.ua/> UNLP 2024 features the first *Shared Task on Fine-Tuning Large Language Models for Ukrainian*. This Shared Task aims to challenge and assess LLMs' capabilities to understand and generate Ukrainian, paving the way for LLM development in Slavic languages. *Task Description* In this shared task, your goal is to instruction-tune a large language model that can answer questions and perform tasks in Ukrainian. The model should possess knowledge of Ukrainian history, language, and literature, as well as common knowledge, and should be capable of generating fluent and factually accurate responses. The evaluation will be two-fold: accuracy of answers to multiple-choice questions and human evaluation on a selection of text generation tasks. You can find the instructions, sample data, and scripts at https://github.com/unlp-workshop/unlp-2024-shared-task. *Registration* Teams that intend to participate should register by filling in this form <https://forms.gle/MiC7pWsWbwBdSmoX9>. *Publication* Participants in the shared task are invited to submit a paper to the UNLP 2024 <https://unlp.org.ua/call-for-papers/> workshop. Submitting a paper is *not mandatory* for participating in the Shared Task. Papers must follow the workshop submission instructions and will undergo regular peer review. Their acceptance will not depend on the results obtained in the shared task, but on the quality of the paper. Accepted papers will appear in the ACL anthology and will be presented at a session of UNLP 2023 specially dedicated to the Shared Task. *Important Dates* January 15, 2024 — Shared task announcement February 15, 2024 — Release of test data to registered participants February 22, 2024 — Registration deadline February 23, 2024 — Submission of system responses March 1, 2024 — Shared Task paper due March 7, 2024 — Results of the Shared Task announced March 29, 2024 — Notification of acceptance TBD (mid-April) — Camera-ready Shared Task papers due May 25, 2024 — Workshop date *Contact* Discord for the shared task: https://discord.gg/kCc6xgWbCJ Email: info(a)unlp.org.ua Website: https://unlp.org.ua/ Twitter: https://twitter.com/UNLP_workshop Telegram: https://t.me/UNLP_workshop Facebook: https://www.facebook.com/UNLPworkshop

1 0

TEICAI workshop-direct submission from EACL
by Nina Hosseini-Kivanani 16 Jan '24

16 Jan '24

TEICAI, Towards Ethical and Inclusive Conversational AI: Language Attitudes, Linguistic Diversity, and Language Rights at EACL 2024 on Malta-March 17-22, 2024. Workshop website: https://sites.google.com/view/teicai2024 Submission link: https://softconf.com/eacl2024/TEICAI-2024/ Submission Deadline: 17 Januray 2024 (anywhere on earth) Submissions are now being accepted for papers that were previously rejected at a major conference. Authors are required to submit their paper alongside the reviews it received, provided as a supplementary PDF file. In cases where the paper has undergone revisions since its original submission, authors should also include a separate file briefly outlining the changes made. The acceptance of these papers for the workshop will be determined by the organizers, based on the paper's quality and relevance to the workshop's theme. We are also pleased to announce that our sponsor, e-COST ACTION Language in the Human-Machine Era (LITHME), is offering two to three travel grants for authors of selected accepted papers. More information about LITHME can be found at https://lithme.eu/. Workshop Organizers: Sviatlana Höhn, LuxAI, Luxembourg Nina Hosseini-Kivanani, Faculty of Science, Technology and Medicine (FSTM), University of Luxembourg, Luxembourg Dimitra Anastasiou, Luxembourg Institute of Science and Technology, Luxembourg Angela Soltan, State University of Moldova, Moldova Bettina Migge, University College Dublin, Ireland Doris Dippold, University of Surrey, UK Fred Philippy, Zortify, Luxembourg Ekaterina Kamlovskaya, Translatables Program Committee: A list of program committee members is available on the workshop website. For any preliminary questions, you're welcome to reach out to teicai2024(a)gmail.com . You can follow us on LinkedIn (TEICAI) and Twitter (teicai2024) to get more updates about the workshop. On behalf of the organizers Nina Hosseini-Kivanani University of Luxembourg

1 0

Second Call for Participation- IWSLT 2024
by Atul K. Ojha 16 Jan '24

16 Jan '24

Apologies for cross-posting. ---------------------------------------- *The International Conference on Spoken Language Translation* *21st IWSLT 2024 – **Second** Call for Participation* *August 15-16, 2024 – Bangkok, Thailand* *http://iwslt.org <http://iwslt.org/>* The International Conference on Spoken Language Translation (IWSLT) is the premier annual conference for all aspects of Spoken Language Translation. Every year, the conference organizes and sponsors open evaluation campaigns around key challenges in simultaneous and consecutive translation, under real-time/low latency or offline conditions and under low-resource or multilingual constraints. System descriptions and results from participants’ systems and scientific papers related to key algorithmic advances and best practices are presented. IWSLT is the venue of the SIGSLTs, the Special Interest Group on Spoken Language Translation of ACL, ISCA and ELRA. With a track record of 20 years, IWSLT benchmarks and proceedings serve as reference for all researchers and practitioners working on speech translation and related fields. The 21st edition of IWSLT <https://iwslt.org/2024/> will be run as an *ELRA/ACL* event and co-located with ACL 2024 <https://2024.aclweb.org/> on August 15-16, 2024. It will be run as a hybrid event. Important Dates January 15, 2024: Release of shared task training and dev data April 01-15, 2024: Evaluation period April 29, 2024: Paper submission due (all papers) June 4, 2024: Notification of acceptance June 24, 2024: Camera-ready paper due July 22, 2024: Pre-recorded video due August 15-16, 2024: Conference Evaluation The IWSLT 2024 features shared tasks <https://iwslt.org/2024/#shared-tasks> that address the following focus areas: - Speech-to-speech track - Simultaneous track - Subtitling track - Offline track - Dubbing track - Low-resource track - Indic track Training, development and test data for each shared task will be prepared and released by the respective organizers (for further information on this initiative, please refer to the website <https://iwslt.org/2024/>). Participants will receive instructions about how to submit their runs. In addition, participants have the opportunity to present their work through a system paper that will be published in the ACL Proceedings. Conference IWSLT also invites submissions of scientific papers to be published in the ACL Proceedings and presented either in oral or poster format. The conference selects high-quality, original contributions on theoretical and practical issues of spoken language translation research, technologies and applications. For further information on this initiative, please refer to the website <https://iwslt.org/2024/#paper-submission> Contact Please send an email to iwslt-evaluation-campaign(a)googlegroups.com if you have any questions related to the shared tasks. Thanks, Marine, Marcello, Alex, Jan, Sebastian, Elizabeth, Atul (IWSLT organisers)

1 0

2026

2025

2024

2023

2022

Corpora January 2024