October is back and so are HPLT datasets (we've been doing this for three consecutive years now!) This time, we announce the release of the massive HPLT v3.0 multilingual dataset which can be considered a major upgrade for large-scale multilingual corpora.
Accounting for 29 billion documents, 198 language-script combinations and 112 trillion characters, v3.0 shows significant gains over v2, driven by several improvements, including a new global deduplication process:
- Unique content boosted from 52% to 73% on average. - Data substance and robustness remains high with better extraction and improved language identification. - Shows increased variety and better representativeness of natural web content.
This release provides a cleaner, more robust dataset for building powerful LLMs and machine translation systems, including a myriad of low- to medium-resourced languages. And we have not said our last word: wait for more data soon because we are already working on it.
Special thanks to all the collaborators and funding bodies, including the European Union's Horizon Europe program and UK Research and Innovation.
Explore the data and see the full analysis and evaluation highlights on our website: https://hplt-project.org/datasets/v3.0