[Corpora-List] Release of the massive HPLT v3.0 multilingual dataset

7 Oct 2025


      October is back and so are HPLT datasets (we've been doing this for 
three consecutive years now!)
This time, we announce the release of the massive HPLT v3.0 multilingual 
dataset which can be considered a major upgrade for large-scale 
multilingual corpora.
Accounting for 29 billion documents, 198 language-script combinations 
and 112 trillion characters, v3.0 shows significant gains over v2, 
driven by several improvements, including a new global deduplication 
process:
- Unique content boosted from 52% to 73% on average.
- Data substance and robustness remains high with better extraction and 
improved language identification.
- Shows increased variety and better representativeness of natural web 
content.
This release provides a cleaner, more robust dataset for building 
powerful LLMs and machine translation systems, including a myriad of 
low- to medium-resourced languages. And we have not said our last word: 
wait for more data soon because we are already working on it.
Special thanks to all the collaborators and funding bodies, including 
the European Union's Horizon Europe program and UK Research and Innovation.
Explore the data and see the full analysis and evaluation highlights on 
our website:
https://hplt-project.org/datasets/v3.0
-- 
Andrey
Language Technology Group (LTG)
University of Oslo

2025

2024

2023

2022

[Corpora-List] Release of the massive HPLT v3.0 multilingual dataset