This is a schematic step-by-step description of data processing (text extraction and cleaning) pipeline used to create HPLT v2 datasets.
Each step is accompanied by a link to the corresponding code base.
See more details in the Deliverable Report 7.2.
- Internet Archive downloader
- Helper scripts for CommonCrawl downloading
- LUMI-specific scripts for CommonCrawl downloading directly to LUMIO
The output of this stage consists of WARC files.
Load the required LUMI modules:
source preplumicpu.sh
Install with pip in a virtual environment. Use --system-site-packages to reuse packages installed in cray-python when possible, which may be better optimized for LUMI. Install only extra dependencies from two/requirements_LUMIextra.txt
python -m venv --system-site-packages venv
source venv/bin/activate
pip install -r requirements_LUMIextra.txt
pip install .
Download the language identification model weights:
stage2download.sh
You might want to install on your local machine or a cluster other than LUMI. Install using pip all the requirements, including those coming from cray-python module on LUMI:
python -m venv venv
source venv/bin/activate
pip install -r requirements_LUMIall.txt
pip install .
stage2download.sh
This stage extracts htmls, pdfs and various metadata from WARC files.
TBD: instructions for stage1 running on LUMI
Stage2 does text extraction with boilerplate removal (Trafilatura) and language identification (fasterText with the openLID model). It is executed on 100 LUMI compute nodes, in 250 parallel processes on each.
Prepare LUMI environment:
source preplumicpu.sh
source venv/bin/activate
Prepare a list of HTML files to process from LUMIO:
rclone ls --include="html.zst" lumio:htmlsample | sed -r 's!( *[0-9]+\s+)!\1 lumio:htmlsample/!' >lumio.paths
NB! If you want to process local files, please create an rclone endpoint with the type 'alias' for the parent folder of all of these files and provide a list of files in the format endpoint:path. The code supports only paths in this format. It strips endpoint: and reconstructs path under the specified OUTPUT directory.
Specify the account and the partition SLURM should use:
export SLURM_ACCOUNT=project_465001890
export SLURM_MEM_PER_NODE=0 # same as --mem=0, requests all memory since standard nodes are allocated fully
export SLURM_PARTITION=standard
export SLURM_TIMELIMIT=0-48:00:00
export SLURM_ACCOUNT=project_465001890
export SLURM_MEM_PER_CPU=1750M # same as --mem-per-cpu=1750M, recommended for the small partition in the LUMI docs to avoid extra billing for larger memory nodes
export SLURM_PARTITION=small
export SLURM_TIMELIMIT=0-72:00:00
Run processing in 100 parallel nodes max, 50 GB of input HTMLs per SLURM job:
stage2nodeparallel_batched.sh 100 lumio.paths 50 ~/hplt/three/html_test
The code for text extraction in this repository is based on
- Stage 1: Extracting HTML and metadata from WARC files (
warc2thml) - Stage 2: Extracting raw text (
html2text)- Trafilatura (running text extraction and boilerplate removal)
- Document language identification with OpenLid
The output of this stage is plain text and metadata (separately) in JSONL format.
The output of this stage is plain text merged with metadata in JSONL format.
It comes in the deduplicated and cleaned varieties.