(Apologies for cross-posting)
Version 2.0 of the HPLT Datasets is now published, with web-derived corpora in 193 languages.
These collections are available under the Creative Commons CC0 license and bring significant improvements compared to previous releases (version 1.2). Similarly to 1.2, the release comes in two variants: de-duplicated (21 TB in size) and cleaned (15 TB in size). The cleaned variant contains the same documents as de-duplicated minus those filtered out by our cleaning heuristics. The cleaned variant is the recommended one, unless you want to try your own cleaning pipelines.
Download the corpora here: https://hplt-project.org/datasets/v2.0
Similar to the previous releases, version 2.0 datasets are hosted by Sigma2 NIRD Data Lake (https://www.sigma2.no/service/nird-data-lake), and text extraction pipeline was run on LUMI supercomputer (https://lumi-supercomputer.eu/).
*What's new*
- The size of the source web collections has increased 2.5x: 4.5 petabytes of compressed web data in total, mostly from Internet Archive (https://archive.org/), but also from Common Crawl (https://commoncrawl.org/). - The text extraction pipeline now uses Trafilatura (https://trafilatura.readthedocs.io/), which results in more efficient boilerplate removal: thus, less noise in the data. - Language identification now uses a refined version of OpenLID (https://aclanthology.org/2023.acl-short.75/). - This, in turn, allowed us to publish data in 193 languages, compared to 75 languages in version 1.2. - We switched from two-letter ISO 639-1 language codes to three-letter ISO 639-3 language codes, augmented with a postfix denoting writing system. For example, `pol_Latn` is Polish written in Latin script. Mapping from the old to the new codes is available at https://github.com/hplt-project/warc2text-runner/blob/main/stats/_langs/lang.... - The documents are now annotated with their compliance to the robots.txt file of the original website. This metadata field can be used to filter out documents explicitly forbidden for crawling by website owners, making the resulting corpora somewhat less prone to copyright violations . The cleaned variant contains only robots.txt compliant documents. More details at https://github.com/hplt-project/monotextor-slurm/blob/main/README.md#robotst... - De duplication is done at collection-level, not at dataset level. - Documents have also been annotated for PII information with multilingual-pii-tool (https://github.com/mmanteli/multilingual-PII-tool). These are identified in the form of Unicode character offsets for every match. - Segment-level language-model-based scores have been replaced by document quality scores computed with web-docs-scorer. - Filtering and cleaning criteria have been simplified (https://github.com/hplt-project/monotextor-slurm?tab=readme-ov-file#filters).
HPLT Monolingual Datasets version 2.0 (the de-duplicated variant) feature about 7.6 trillion whitespace-separated words and about 52 trillion characters extracted from 21 billion documents, compared to 5.6 trillion words and 42 trillion characters extracted from 5 billion documents in version 1.2. All in all, you can expect less noise and boilerplate, less duplicates, more unique documents, and generally better quality texts to train language models on or for other NLP tasks.
*How was this dataset produced*
You may want to read section 3 of our Deliverable HPLT pipeline and tools (https://hplt-project.org/HPLT_D7_2___HPLT_pipelines_and_tools.pdf) to have a full description on how did we produced this dataset. If you don't have much time for reading, maybe this chart is enough for your purposes: https://hplt-project.org/_next/static/media/dataset-pipeline-light.c2521ee1....
Each language is accompanied with an HPLT Analytics report. These automated reports provide useful information and statistics about the clean version of the HPLT v.2.0 datasets. They are the result of running the HPLT Analytics Tool (https://github.com/hplt-project/data-analytics-tool) on them. They are helpful for inspecting the datasets even before downloading them.
*What is HPLT?*
HPLT (High Performance Language Technologies) is an EU Horizon Europe funded project which aims at collecting large quantities of data in many languages and training powerful and efficient language and translation models. An important feature of HPLT is openness and transparency: all the artifacts of the project are publicly available under permissive licenses.