HPLT Datasets version 2.0 released - Corpora

3 Oct 2024


      (Apologies for cross-posting)
Version 2.0 of the HPLT Datasets is now published, with web-derived 
corpora in 193 languages.
These collections are available under the Creative Commons CC0 license 
and bring significant improvements compared to previous releases 
(version 1.2). Similarly to 1.2, the release comes in two variants: 
de-duplicated (21 TB in size) and cleaned (15 TB in size). The cleaned 
variant contains the same documents as de-duplicated minus those 
filtered out by our cleaning heuristics. The cleaned variant is the 
recommended one, unless you want to try your own cleaning pipelines.
Download the corpora here:
https://hplt-project.org/datasets/v2.0
Similar to the previous releases, version 2.0 datasets are hosted by 
Sigma2 NIRD Data Lake (https://www.sigma2.no/service/nird-data-lake), 
and text extraction pipeline was run on LUMI supercomputer 
(https://lumi-supercomputer.eu/).
*What's new*
- The size of the source web collections has increased 2.5x: 4.5 
petabytes of compressed web data in total, mostly from Internet Archive 
(https://archive.org/), but also from Common Crawl 
(https://commoncrawl.org/).
- The text extraction pipeline now uses Trafilatura 
(https://trafilatura.readthedocs.io/), which results in more efficient 
boilerplate removal: thus, less noise in the data.
- Language identification now uses a refined version of OpenLID 
(https://aclanthology.org/2023.acl-short.75/).
- This, in turn, allowed us to publish data in 193 languages, compared 
to 75 languages in version 1.2.
- We switched from two-letter ISO 639-1 language codes to three-letter 
ISO 639-3 language codes, augmented with a postfix denoting writing 
system. For example, `pol_Latn` is Polish written in Latin script. 
Mapping from the old to the new codes is available at 
https://github.com/hplt-project/warc2text-runner/blob/main/stats/_langs/lang....
- The documents are now annotated with their compliance to the 
robots.txt file of the original website. This metadata field can be used 
to filter out documents explicitly forbidden for crawling by website 
owners, making the resulting corpora somewhat less prone to copyright 
violations . The cleaned variant contains only robots.txt compliant 
documents. More details at 
https://github.com/hplt-project/monotextor-slurm/blob/main/README.md#robotst...
- De duplication is done at collection-level, not at dataset level.
- Documents have also been annotated for PII information with 
multilingual-pii-tool 
(https://github.com/mmanteli/multilingual-PII-tool). These are 
identified in the form of Unicode character offsets for every match.
- Segment-level language-model-based scores have been replaced by 
document quality scores computed with web-docs-scorer.
- Filtering and cleaning criteria have been simplified 
(https://github.com/hplt-project/monotextor-slurm?tab=readme-ov-file#filters).
HPLT Monolingual Datasets version 2.0 (the de-duplicated variant) 
feature about 7.6 trillion whitespace-separated words and about 52 
trillion characters extracted from 21 billion documents, compared to 5.6 
trillion words and 42 trillion characters extracted from 5 billion 
documents in version 1.2. All in all, you can expect less noise and 
boilerplate, less duplicates, more unique documents, and generally 
better quality texts to train language models on or for other NLP tasks.
*How was this dataset produced*
You may want to read section 3 of our Deliverable HPLT pipeline and 
tools 
(https://hplt-project.org/HPLT_D7_2___HPLT_pipelines_and_tools.pdf) to 
have a full description on how did we produced this dataset. If you 
don't have much time for reading, maybe this chart is enough for your 
purposes:
https://hplt-project.org/_next/static/media/dataset-pipeline-light.c2521ee1....
Each language is accompanied with an HPLT Analytics report. These 
automated reports provide useful information and statistics about the 
clean version of the HPLT v.2.0 datasets. They are the result of running 
the HPLT Analytics Tool 
(https://github.com/hplt-project/data-analytics-tool) on them. They are 
helpful for inspecting the datasets even before downloading them.
*What is HPLT?*
HPLT (High Performance Language Technologies) is an EU Horizon Europe 
funded project which aims at collecting large quantities of data in many 
languages and training powerful and efficient language and translation 
models. An important feature of HPLT is openness and transparency: all 
the artifacts of the project are publicly available under permissive 
licenses.
https://hplt-project.org/
-- 
Andrey
Language Technology Group (LTG)
University of Oslo