Skip to content

hplt-project/HPLT-textpipes

Repository files navigation

HPLT TextPipes

This is a schematic step-by-step description of data processing (text extraction and cleaning) pipeline used to create HPLT v2 datasets.

Each step is accompanied by a link to the corresponding code base.

See more details in the Deliverable Report 7.2.

Data ingestion

The output of this stage consists of WARC files.

Text extraction

Installation on LUMI

Load the required LUMI modules:

source preplumicpu.sh

Install with pip in a virtual environment. Use --system-site-packages to reuse packages installed in cray-python when possible, which may be better optimized for LUMI. Install only extra dependencies from two/requirements_LUMIextra.txt

python -m venv --system-site-packages venv
source venv/bin/activate
pip install -r requirements_LUMIextra.txt
pip install .  

Download the language identification model weights:

stage2download.sh

Install on other systems (not tested!)

You might want to install on your local machine or a cluster other than LUMI. Install using pip all the requirements, including those coming from cray-python module on LUMI:

python -m venv venv
source venv/bin/activate
pip install -r  requirements_LUMIall.txt
pip install .
stage2download.sh

Stage1 (a.k.a. warc2html)

This stage extracts htmls, pdfs and various metadata from WARC files.

TBD: instructions for stage1 running on LUMI

Stage2 (a.k.a. html2text)

Stage2 does text extraction with boilerplate removal (Trafilatura) and language identification (fasterText with the openLID model). It is executed on 100 LUMI compute nodes, in 250 parallel processes on each.

Prepare LUMI environment:

source preplumicpu.sh
source venv/bin/activate

Prepare a list of HTML files to process from LUMIO:

rclone ls --include="html.zst" lumio:htmlsample | sed -r 's!( *[0-9]+\s+)!\1 lumio:htmlsample/!' >lumio.paths

NB! If you want to process local files, please create an rclone endpoint with the type 'alias' for the parent folder of all of these files and provide a list of files in the format endpoint:path. The code supports only paths in this format. It strips endpoint: and reconstructs path under the specified OUTPUT directory.

Specify the account and the partition SLURM should use:

export SLURM_ACCOUNT=project_465001890
export SLURM_MEM_PER_NODE=0  # same as --mem=0, requests all memory since standard nodes are allocated fully
export SLURM_PARTITION=standard
export SLURM_TIMELIMIT=0-48:00:00
export SLURM_ACCOUNT=project_465001890
export SLURM_MEM_PER_CPU=1750M  # same as --mem-per-cpu=1750M, recommended for the small partition in the LUMI docs to avoid extra billing for larger memory nodes
export SLURM_PARTITION=small
export SLURM_TIMELIMIT=0-72:00:00

Run processing in 100 parallel nodes max, 50 GB of input HTMLs per SLURM job:

stage2nodeparallel_batched.sh 100 lumio.paths 50 ~/hplt/three/html_test

Older versions

The code for text extraction in this repository is based on

The output of this stage is plain text and metadata (separately) in JSONL format.

Deduplication, cleaning and filtering

The output of this stage is plain text merged with metadata in JSONL format. It comes in the deduplicated and cleaned varieties.

About

Step-by-step schematic description of data processing in HPLT

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 8