Releases: hplt-project/warc2text-runner
Releases · hplt-project/warc2text-runner
v3.1.3 - Stage2 for HPLT3
Stage2 version used to process mostly all data for HPLT3
- staging html.zst files from LUMIO to the local disk before processing
- the first file from each batch is staged in a separate SLURM task requesting 1 CPU only, the rest are staged while the previous file is being processed, files staged unsuccessfully are processed directly from LUMIO => almost all core-hours on LUMI are spent on actual stage2 processing resulting in reduced core-hours consumption
Full Changelog: v3.1.2...v3.1.3
v3.1.2
- More detailed info in the logs
- Staging the next html.zst to the output folder in the background while processing the current html.zst.
Full Changelog: v3.1.1...v3.1.2
v3.1.1
- Fixed timeout for processing with Trafilatura not working due to TimeoutError catched inside Trafilatura, increased timeout to 10s.
- QC: comparing the number of lines in metadata.zst, text.zst and lang.zst after stage2. Dumping .done files near html.zst files that were processed and checking them to avoid reprocessing already processed files.
- Streaming with s3cmd instead of rclone to avoid crashes on long html.zst files.
- More balanced batches with specialized bin packing algorithms (first fit decreasing).
- Support for both local (through rclone alias endpoints) and lumio: paths. Any rclone endpoints should work now.
- cputime2.sh improved: now prints the distribution of processing time over batches.
- Code for automatic staging of html.zst and metadata.zst to the local FS on the login/compute node right before processing. Finally, metadata.zst are downloaded but html.zst are streamed.
- More details in logs.
Full Changelog: v3.0.0...v3.1.1
Stage2 (html2text) for the 3rd HPLT data release
v3.0.0 minor changes
v3.0.0-alpha.4
A number of issues fixed after testing on the 1% sample
v3.0.0-alpha.2
What's Changed
- The language identification model updated by @laurieburchell in #21
- Hyperparameters of Trafilatura selected
v3.0.0-alpha.1
HTML2text updates:
- Moved to Trafilatura 2.0.0
- Additional extraction of text with markup using xml outputs from Trafilatura
- Extraction of HTML language tags
- Streaming input HTMLs directly from LUMIO
Code running stage2 on LUMI for the second data release
See two/README.MD to learn to reproduce stage2 for the second data release
updated langid
langid update: preprocessing, new model
better selected blocksize for trafilatura
v2.0.0-alpha.2
Full Changelog: v2.0.0-alpha.1...v2.0.0-alpha.2
Now pip installable.