Releases · hplt-project/warc2text-runner

04 May 15:29

nvanva

v3.1.3 - Stage2 for HPLT3 Latest

Latest

Stage2 version used to process mostly all data for HPLT3

staging html.zst files from LUMIO to the local disk before processing
the first file from each batch is staged in a separate SLURM task requesting 1 CPU only, the rest are staged while the previous file is being processed, files staged unsuccessfully are processed directly from LUMIO => almost all core-hours on LUMI are spent on actual stage2 processing resulting in reduced core-hours consumption

Full Changelog: v3.1.2...v3.1.3

Assets 2

25 Apr 20:36

nvanva

v3.1.2

More detailed info in the logs
Staging the next html.zst to the output folder in the background while processing the current html.zst.

Full Changelog: v3.1.1...v3.1.2

Assets 2

22 Apr 08:24

nvanva

v3.1.1

Fixed timeout for processing with Trafilatura not working due to TimeoutError catched inside Trafilatura, increased timeout to 10s.
QC: comparing the number of lines in metadata.zst, text.zst and lang.zst after stage2. Dumping .done files near html.zst files that were processed and checking them to avoid reprocessing already processed files.
Streaming with s3cmd instead of rclone to avoid crashes on long html.zst files.
More balanced batches with specialized bin packing algorithms (first fit decreasing).
Support for both local (through rclone alias endpoints) and lumio: paths. Any rclone endpoints should work now.
cputime2.sh improved: now prints the distribution of processing time over batches.
Code for automatic staging of html.zst and metadata.zst to the local FS on the login/compute node right before processing. Finally, metadata.zst are downloaded but html.zst are streamed.
More details in logs.

Full Changelog: v3.0.0...v3.1.1

Assets 2