October is back and so are HPLT datasets (we've been doing this for
three consecutive years now!)
This time, we announce the release of the massive HPLT v3.0 multilingual
dataset which can be considered a major upgrade for large-scale
multilingual corpora.
Accounting for 29 billion documents, 198 language-script combinations
and 112 trillion characters, v3.0 shows significant gains over v2,
driven by several improvements, including a new global deduplication
process:
- Unique content boosted from 52% to 73% on average.
- Data substance and robustness remains high with better extraction and
improved language identification.
- Shows increased variety and better representativeness of natural web
content.
This release provides a cleaner, more robust dataset for building
powerful LLMs and machine translation systems, including a myriad of
low- to medium-resourced languages. And we have not said our last word:
wait for more data soon because we are already working on it.
Special thanks to all the collaborators and funding bodies, including
the European Union's Horizon Europe program and UK Research and Innovation.
Explore the data and see the full analysis and evaluation highlights on
our website:
https://hplt-project.org/datasets/v3.0
--
Andrey
Language Technology Group (LTG)
University of Oslo
Dear colleagues,
We are sharing the following details about the online event Corpus linguistics & applied linguistics research 2025, hosted by the University of Murcia from 3 to 27 November 2025.
This year’s talks will focus on the impact of AI on corpus linguistics.
Speakers:
* Dr Lisa Cheung, The University of Hong Kong, & Dr Peter Crosthwaite, The University of Queensland — 3 Nov
* Prof Qing Ma, The Education University of Hong Kong — 11 Nov
* Prof Laurence Anthony, Waseda University — 25 Nov
* Prof Atsushi Mizumoto, Kansai University, Osaka — 27 Nov
Info: https://www.um.es/languagecorpora/clresearch2025/
The talks will take place via ZOOM at 11:00 a.m. (Madrid) / 5:00 p.m. (Hong Kong).
Registration link: https://umurcia.zoom.us/webinar/register/WN_NAQ8nFTgSCO2obOFNNgc1A#/registr…
For this edition, attendees who request it will receive a certificate of participation.
Corpus linguistics & applied linguistics research 2025 is organized by the University of Murcia research group E020-07 “Lenguajes de especialidad, corpus lingüísticos y lingüística inglesa aplicada a la ingeniería del conocimiento”, with support from the Faculty of Arts, the Department of English Philology at the University of Murcia, and The Education University of Hong Kong.
Follow updates on X: @languagecorpora
Watch talks from previous editions: https://www.youtube.com/@corporaappliedlinguistics8358/playlists
Kind regards from the organizing committee,
Pascual Pérez-Paredes
https://webs.um.es/pascualf
***Apologies for possible cross-posting ***
First Call for Papers: 6th International Workshop on Computational
Approaches to Historical Language Change (LChange’26)
Co-located with EACL 2026, Rabat, Morocco & Online | March 24–29, 2026
📌 Website: https://www.changeiskey.org/event/2026-eacl-lchange/
📧 Contact: lchange(a)changeiskey.org
== About the Workshop ==
The LChange workshop brings together researchers interested in
computational modeling of language change — both historical and
synchronic. Following the success of LChange in 2019, 2021, 2022, 2023
and 2024, this sixth edition will be held as a hybrid half-day workshop
at EACL 2026 conference in Rabat.
We welcome contributions addressing all aspects of computational
approaches to language change. Our goal is to foster dialogue on
state-of-the-art computational methodologies, resources, and theories
that explore the dynamic, time-varying nature of language.
In addition to paper presentations and keynotes, we offer a mentorship
program for students to engage with experienced researchers, regardless
of whether they are submitting a paper or not.
== Important Dates (tentative) ==
Direct Submission deadline: December 19, 2025
Pre-reviewed (ARR) submission deadline: January 2, 2026
Notification of acceptance: January 23, 2026
Camera-ready paper due: February 3, 2026
Workshop dates: March 24-29, 2026
== Submission Information ==
We accept the following types of submissions:
- Long papers: up to 8 pages (+ references)
- Short papers: up to 4 pages (+ references). Dataset and model
release papers should be submitted as short papers.
Final versions will be given one additional page of content so that
reviewers' comments can be taken into account.
== Review Process ==
Papers must be submitted anonymously.
All submissions will undergo double-blind peer review by at least three
reviewers, with final acceptance decisions made by the workshop organizers.
Accepted papers will be published in the workshop proceedings and
presented orally or as posters.
Call for reviewers: If you have published in the field previously and
are interested in helping out in the program committee to review papers,
please email us at lchange(a)changeiskey.org!
== Topics of Interest ==
We invite original research papers on (but not limited to):
- Novel methods for detecting diachronic semantic change and lexical
replacement
- Automatic discovery and quantitative evaluation of laws of language
change
- Computational theories and generative models of language change
- Sense-aware (semantic) change analysis
- Diachronic word sense disambiguation
- Novel methods for diachronic analysis of low-resource languages
- Novel methods for diachronic linguistic data visualization
- Novel applications and implications of language change detection
- Quantification of sociocultural influences on language change
- Cross-linguistic, phylogenetic, and developmental approaches to
language change
- Novel datasets for cross-linguistic and diachronic analyses of language
== Organizers ==
Nina Tahmasebi, University of Gothenburg
Pierluigi Cassotti, University of Gothenburg
Syrielle Montariol, UC Berkeley, École polytechnique fédérale de Lausanne
Andrey Kutuzov, University of Oslo
Netta Huebscher, University of Gothenburg
Elena Spaziani, Sapienza University of Rome
Naomi Baes, University of Melbourne
--
Andrey
Language Technology Group (LTG)
University of Oslo
Dear Colleagues,
We are looking for a PhD student in Multimodal AI for Proactive Herd Health and Dairy Farm Management (CLÁR Project) at the School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, UK. This fully funded PhD studentship is supported by SUSTAIN (https://www.sustain-cdt.ai/), the UKRI Centre for Doctoral Training in Sustainable Understandable agri-food Systems Transformed by Artificial Intelligence. SUSTAIN empowers the next generation of AI scientists to invent, develop, and deploy technologies co-created with growers and agri-food practitioners, facilitating meaningful partnerships between academia and industry.
The successful candidate will work with our multidisciplinary supervisory team and our industry partner, CattleEye, gaining access to global real-world data and industry expertise. This project offers a unique opportunity to develop a skill set in AI applied to globally significant sustainability challenges in dairy farming.
If you have a background in NLP, Data Science, ML/DL, or Computer Vision and are motivated to apply your expertise to real-world sustainability challenges in agriculture, this could be a fantastic opportunity for you.
[??] Location: Queen’s University Belfast, UK
[??] Eligibility: Open to students worldwide
[??] Funding: Fully funded PhD studentship (tuition fees, tax-free stipend, Research Training Support Grant, and additional development support)
[??] Deadline: Friday, October 17, 2025
[??] More details: https://www.findaphd.com/phds/project/project-q3138-cl-r-multimodal-ai-for-…
For enquiries, contact m.hasanuzzaman(a)qub.ac.uk<mailto:m.hasanuzzaman@qub.ac.uk> with your CV.
Best,
M
[https://mail.google.com/mail/u/0?ui=2&ik=83b4a8df94&attid=0.1&permmsgid=msg…]
Dear colleagues,
We are looking for a PhD candidate in Emotionally and Socially Aware Natural Language Processing at LIACS, Leiden University<https://liacs.leidenuniv.nl/>. The PhD will be supervised by myself, Prof. Suzan Verberne<https://www.universiteitleiden.nl/en/staffmembers/suzan-verberne#tab-1>, and Prof. Joost Broekens<https://www.universiteitleiden.nl/en/staffmembers/joost-broekens#tab-1>. The position is part of the Human-AI cluster<https://www.universiteitleiden.nl/en/science/computer-science/research/huma…>, a great environment for interdisciplinary research where AI and machine learning meet philosophy, cognitive science, and the creative arts.
This PhD position focuses on advancing AI models that don’t just optimize for accuracy, but also recognize and respond to emotions responsibly and adapt to social context. Current systems often reproduce or amplify social biases, generate toxic context, or do not respond safely to emotional cues. The goal of this PhD is to design AI systems that promote inclusivity, fairness, and emotional intelligence in human-AI interaction, with a particular focus on applications in mental well-being, education, and other socially sensitive contexts where how AI interacts with people has a big impact.
The deadline to apply is November 17, 2025.
Details on the position and the application procedure can be found in the job ad: https://www.universiteitleiden.nl/en/vacancies/2025/q4/16038-phd-candidate-…
Please send me an email if you have any questions regarding the position.
Best regards,
Flor Plaza
------------------------------------------------
Flor Miriam Plaza del Arco, Ph.D.
Assistant Professor
Leiden Institute of Advanced Computer Science (LIACS), Human AI
Leiden University
Office: BM 3.03, Gorlaeus Gebouw – BE-vleugel
Einsteinweg 55, 2333 CC Leiden, Netherlands
🌐 Website<https://fmplaza.github.io/> | 🦋 BlueSky<http://florplaza.bsky.social/>
The TALN team (Natural Language Processing (https://taln.ls2n.fr/) of the LS2N laboratory (https://www.ls2n.fr/, Nantes Université) is offering 5 research internships at the Master 2 level (duration: 5 to 6 months), starting in February 2026.
These internships will take place within the research themes of the team on Natural Language Processing (NLP) for specialized domains, in particular healthcare and science, with a focus on the study and adaptation of large language models (LLMs).
Possible topics include:
- Study and evaluation of LLMs in limited contexts (specialized data, constrained domains);
- Recommendation and navigation systems to explore the scientific literature;
- Text revision and scientific writing assistance tools;
- Verification of claims in scientific articles, including text/image multimodality (in collaboration with the IPI team);
- Clinical text analysis for estimating patient autonomy score.
Candidate profile: Master’s students in computer science, specializing in Artificial Intelligence (AI), Natural Language Processing (NLP), machine learning, or related fields, with solid programming skills (Python) and a strong interest in research.
Contacts:
- Florian Boudin - florian.boudin(a)inria.fr
- Richard Dufour - richard.dufour(a)univ-nantes.fr
The date of WoLaLa 2025 (1st International Workshop on Language and Language Models), an international, English-language workshop conference, is approaching.
The event is organized by the Research Centre for Linguistics, Eötvös Loránd University (ELTE) and will take place in Budapest on November 20–21, 2025.
Invited keynote speakers:
· Erhard Hinrichs (University of Tübingen, Germany)
· Alessandro Lenci (University of Pisa, Italy)
· András Kornai (HUN-REN SZTAKI & BME, Hungary)
The goal of the workshop is to provide a forum for presenting and discussing research at the intersection of linguistics and artificial intelligence, highlighting the latest results and challenges.
Registration fees:
· Early registration (until October 6, 2025): 70,000 HUF
· Standard registration (after October 6): 85,000 HUF
· Non-academic / industry participants: 100,000 HUF
For the workshop website, detailed program, and registration form, please visit:
👉 https://wolala.nytud.hu <https://wolala.nytud.hu/>
If you have any questions, feel free to contact us at info(a)wolala.nytud.hu
🚨 Call for Papers: WSLP 2025 – Workshop on Sign Language Processing 🚨
We are excited to announce the Workshop on Sign Language Processing (WSLP 2025), co-located with IJCNLP–AACL 2025, to be held at Victor Menezes Convention Centre, IIT Bombay, Mumbai, India on December 24th, 2025. 🌏
This one-day event will provide a dedicated forum to present and discuss advances in sign language processing and translation, with a focus on low-resource and underrepresented sign languages. Alongside paper presentations, WSLP 2025 will feature shared tasks and invited talks by leading researchers.
🔑 Important Links
📌 Workshop Website: https://exploration-lab.github.io/WSLP
📄 CFP & Submissions (OpenReview): https://openreview.net/group?id=aclweb.org/AACL-IJCNLP/2025/Workshop/WSLP
📝 ARR Commitment: https://openreview.net/group?id=aclweb.org/AACL-IJCNLP/2025/Workshop/WSLP_A…
📚 Shared Tasks: https://exploration-lab.github.io/WSLP/task/
💬 Discord (Workshop): https://discord.gg/jP7j4NmUE4
💬 Discord (Shared Task): https://discord.gg/su2rRxSjkY
🗓️ Key Dates (AoE)
Paper Submission Deadline (Extended): October 28, 2025
ARR Commitment Deadline: Oct 27, 2025
Notification of Acceptance: Nov 3, 2025
Camera-Ready Papers Due: Nov 11, 2025
Workshop: Dec 24, 2025
🔍 Topics of Interest (But not limited to)
Low-resource & underrepresented sign languages
Continuous & gloss-free SLT
Multilingual, cross-modal & zero-shot SLT
Cultural & dialectal variation in sign Language
Corpus creation, fairness, ethics & real-world applications
🧑🤝🧑 Shared Tasks at WSLP 2025
Indian Sign Language to English Translation
Isolated Sign (Gloss) Recognition
Word Presence Prediction in Sign Videos
📢 We strongly encourage submissions that focus on underrepresented sign languages and communities.
Call for Papers | ACM WebSci’26
May 21-24, 2026 | TU Braunschweig | https://websci26.org/
Important Dates
* December 10, 2025: Paper submission
* February 4, 2026: Notification
* February 28, 2026: Camera-ready versions due
* May 26-29, 2026: Conference dates
About the Web Science Conference
Web Science is an interdisciplinary field dedicated to understanding the complex and multiple impacts of the Web on society and vice versa. The interdisciplinary field is well situated to address pressing issues of our time by incorporating various scientific approaches. We welcome quantitative, qualitative, and mixed methods research, including techniques from the social sciences and computer science. In addition, we are interested in work exploring Web-based data collection, research ethics, and emerging methods. We also encourage studies that combine analyses of Web data and other types of data (e.g., from surveys or interviews) to help better understand user behavior online and offline.
Theme for Web Science 2026: Managing Risks in the Era of Generative AI - How 20 Years of Web Science Research Can Help
Web content is influencing human experiences more than ever before. The rapid deployment of artificial intelligence (including large language models) has created new risks for humans in the digital environment. These risks include customly crafted misinformation at scale, realistic AI-generated harmful content and deepfakes, as well as fraudulent activities and scams becoming more effective thanks to AI. Trust and community have been eroded during this current era of the Web, and researching means to manage these risks on the Web is as essential as ever. The Web Science community has looked at this complex socio-technical system for 20 years, exploring its structure, dynamics, and impact on society. This year’s conference especially encourages contributions investigating the risks for society on the web in the presence of artificial intelligence. Additionally, we welcome papers on a wide range of topics at the heart of Web Science.
In 2026, we will also be able to allocate a limited amount of funding for student travel provided by SIGWEB and WebIST.
Possible topics include, but are not limited to:
Understanding the Web
Trends in globalization and fragmentation of the Web
The architecture, philosophy, and evolution of the Web
Automation and AI in all its manifestations relevant to the Web
The interrelationship between the structure of the web and social behavior
Critical analyses of the Web and Web technologies
The spread of large models on the web
Making the Web Inclusive
Issues of discrimination and fairness
Intersectionality and design justice in questions of marginalization and inequality
Ethical challenges of technologies, data, algorithms, platforms, and people on the Web
Safeguarding and governance of the Web, including anonymity, security, and trust
Inclusion, literacy, and the digital divide
Human-centered security and robustness on the Web
The Web and Everyday Life
Social machines, crowd computing, and collective intelligence
Web economics, social entrepreneurship, and innovation
Legal and policy issues, including rights and accountability for the AI industry
The creator economy: Humanities, arts, and culture on the Web
Politics and social activism on the Web
Relationships, organization, and social interaction on the Web
Online education and remote learning
Health and well-being online
Social presence in online professional event spaces
The Web as a source of news and information
Doing Web Science
Data curation, Web archives, and stewardship in Web Science
Temporal and spatial dimensions of the Web as a repository of information
Analysis and modeling of human and automatic behavior (e.g., bots)
Analysis of online social and information networks
Detecting, preventing, and predicting anomalies in Web data (e.g., fake content, spam)
Novel analysis techniques for Web and social network analysis
Recommendation engines and contextual adaptation for Web tasks
Web-based information retrieval and information generation
Supporting heterogeneity across modalities, sensors, and channels on the Web.
User modeling and personalization approaches on the Web.
Format of the submissions
Please upload your submissions via EasyChair: https://easychair.org/conferences/?conf=websci26
There are two submission formats:
Full paper should be between 6 and 10 pages (including references, appendices, etc.). Full papers typically report on mature and completed projects.
Short papers should be up to 5 pages (including references, appendices, etc.) and primarily report on high-quality ongoing work that is not mature enough for a full-length publication.
All papers should adopt the current ACM SIG Conference proceedings template (acmart.cls). Please submit papers as PDF files using the ACM template, either in Microsoft Word format (available at https://www.acm.org/publications/proceedings-template under “Word Authors”) or with the ACM LaTeX template on the Overleaf platform, which is available at https://www.overleaf.com/latex/templates/association-for-computing-machiner…. In particular, please ensure that you are using the two-column version of the appropriate template.
All contributions will be judged by the Program Committee by at least three referees based on rigorous peer review standards for quality and fit for the conference. Additionally, each paper will be assigned to a Senior Program Committee member to ensure review quality.
WebSci-2026 review is double-blind. Therefore, please anonymize your submission: do not put the author(s )' names or affiliation(s) at the start of the paper, and do not include funding or other acknowledgments in papers submitted for review. References to the authors’ own prior relevant work should be included, but should not specify that this is the authors’ own work. It is up to the authors’ discretion how much to further modify the body of the paper to preserve anonymity. The requirement for anonymity does not extend outside the review process, e.g., the authors can decide how widely to distribute their papers over the Internet. Even in cases where the author’s identity is known to a reviewer, the double-blind process will serve as a symbolic reminder of the importance of evaluating the submitted work on its own merits without regard to the author’s reputation.
Authors who wish to opt out of publication proceedings will be given this option upon acceptance. This will encourage the participation of researchers from the social sciences who prefer to publish their work as journal articles. All authors of accepted papers (including those who opt out of proceedings) are expected to present their work at the conference.
ACM Publication Policies
1. By submitting your article to an ACM Publication, you acknowledge that you and your co-authors are subject to all ACM Publications Policies, including ACM’s new Policy on Research Involving Human Participants and Subjects<https://www.acm.org/publications/policies/research-involving-human-particip…>. Alleged violations of this policy or any ACM Publications Policy will be investigated by ACM and may result in a full retraction of your paper, in addition to other potential penalties, as per ACM Publications Policy.
2. Please ensure that you and your co-authors obtain an ORCID ID<https://orcid.org/register> to complete the publishing process for your accepted paper. ACM has been involved in ORCID from the start, and we have recently made a commitment to collect ORCID IDs from all of our published authors<https://authors.acm.org/author-resources/orcid-faqs>. The collection process started in 2022 and will be a requirement. We are committed to improving author discoverability, ensuring proper attribution, and contributing to ongoing community efforts around name normalization; your ORCID ID will help in these efforts.
3. For guidelines on the use of generative AI tools, please refer to https://www.acm.org/publications/policies/frequently-asked-questions
Important update on ACM's new open access publishing model for 2026 ACM Conferences!
Starting January 1, 2026, ACM will fully transition to Open Access. All ACM publications, including those from ACM-sponsored conferences, will be 100% Open Access (https://www.acm.org/publications/openaccess).
Authors will have two primary options for publishing Open Access articles with ACM: the ACM Open institutional model or by paying Article Processing Charges (APCs). With over 1,800 institutions already part of ACM Open, the majority of ACM-sponsored conference papers will not require APCs from authors or conferences (currently, around 70-75%).
Authors from institutions not participating in ACM Open must pay an APC to publish their papers, unless they qualify for a financial or discretionary waiver. To find out whether an APC applies to your article, please consult the list of participating institutions<https://libraries.acm.org/acmopen/open-participants> in ACM Open and review the APC Waivers and Discounts Policy<https://www.acm.org/publications/policies/policy-on-open-access-apc-waivers…>. Remember that waivers are rare and are granted based on specific criteria set by ACM.
Understanding that this change could present financial challenges, ACM has approved a temporary subsidy for 2026 to ease the transition and allow more time for institutions to join ACM Open. The subsidy will offer:
$250 APC for ACM/SIG members
$350 for non-members
This represents a 65% discount<https://www.acm.org/publications/openaccess>, funded directly by ACM. Authors are encouraged to help advocate for their institutions to join ACM Open during this transition period.
You can find an FAQ here: <https://www.acm.org/publications/open-access-model-for-acm-and-sig-sponsore…> Open Access Model for ACM and SIG Sponsored Conferences: Frequently Asked Questions<https://www.acm.org/publications/open-access-model-for-acm-and-sig-sponsore…>, and more information here: <https://www.acm.org/publications/openaccess> Open Access Publication & ACM<https://www.acm.org/publications/openaccess>
Program Committee Chairs:
Gianluca Demartini (The University of Queensland, Australia)
Stefan Dietze (Heinrich-Heine-University Düsseldorf & GESIS, Germany)
Jen Golbeck (University of Maryland, USA)
For any questions and queries regarding the paper submission, please contact the chairs at websci26(a)easychair.org<mailto:websci26@easychair.org>.
<https://t63605f96.emailsys1a.net/91/8138/3f626b8c5a/subscribe/form.html?_g=…>
The Corpling lab <https://gucorpling.org/corpling/> in the Department of Linguistics <https://linguistics.georgetown.edu/> at Georgetown University (Washington, DC), directed by Amir Zeldes <https://gucorpling.org/amir/> , is looking for qualified candidates for a PhD in Computational Linguistics, to start in Fall 2026. Applicants will be most competitive for the PhD program if they have prior experience in Linguistics, Computer Science, or a related field; are proficient in basic programming; and articulate a clear area of research specialization that they wish to pursue, which aligns with research at the lab. Georgetown also offers a 2-year MS in Computational Linguistics for students wishing to gain a foundation in the field. The application deadline for the PhD program is December 1, 2025 – see details here:
�
https://linguistics.georgetown.edu/programs/apply/
�
The lab focuses on research about computational models of discourse, annotated corpus resource creation and Natural Language Processing, especially for low-resource languages and applications in the Digital Humanities. Some recent research topics have included:
�
* Understanding relative salience in discourse using LLMs
* Modeling discourse relations expressing causality, argumentation and temporality
* The study of different types of anaphoric relations between entities and events
* Creating and benchmarking models for multilingual discourse understanding
* Building datasets and tools for under-resourced and historical languages (especially for Coptic <https://copticscriptorium.org/> )
�
For more on our research see the lab’s website and Amir’s homepage linked above.
�
Students in the Computational Linguistics program benefit from a range of courses in NLP and computational modeling techniques, as well as foundational courses in Linguistics. Applicants must hold at least a bachelor’s degree by Fall 2026. PhD students in the Department (domestic as well as international) benefit from a 5-year funding package including a stipend, tuition scholarship and health insurance. For questions please contact amir.zeldes(a)georgetown.edu <mailto:amir.zeldes@georgetown.edu> .
�