Skip to content

Releases: hplt-project/warc2text-runner

v3.1.3 - Stage2 for HPLT3

04 May 15:29
Compare
Choose a tag to compare

Stage2 version used to process mostly all data for HPLT3

  • staging html.zst files from LUMIO to the local disk before processing
  • the first file from each batch is staged in a separate SLURM task requesting 1 CPU only, the rest are staged while the previous file is being processed, files staged unsuccessfully are processed directly from LUMIO => almost all core-hours on LUMI are spent on actual stage2 processing resulting in reduced core-hours consumption

Full Changelog: v3.1.2...v3.1.3

v3.1.2

25 Apr 20:36
Compare
Choose a tag to compare
  • More detailed info in the logs
  • Staging the next html.zst to the output folder in the background while processing the current html.zst.

Full Changelog: v3.1.1...v3.1.2

v3.1.1

22 Apr 08:24
Compare
Choose a tag to compare
  • Fixed timeout for processing with Trafilatura not working due to TimeoutError catched inside Trafilatura, increased timeout to 10s.
  • QC: comparing the number of lines in metadata.zst, text.zst and lang.zst after stage2. Dumping .done files near html.zst files that were processed and checking them to avoid reprocessing already processed files.
  • Streaming with s3cmd instead of rclone to avoid crashes on long html.zst files.
  • More balanced batches with specialized bin packing algorithms (first fit decreasing).
  • Support for both local (through rclone alias endpoints) and lumio: paths. Any rclone endpoints should work now.
  • cputime2.sh improved: now prints the distribution of processing time over batches.
  • Code for automatic staging of html.zst and metadata.zst to the local FS on the login/compute node right before processing. Finally, metadata.zst are downloaded but html.zst are streamed.
  • More details in logs.

Full Changelog: v3.0.0...v3.1.1

Stage2 (html2text) for the 3rd HPLT data release

14 Apr 16:28
Compare
Choose a tag to compare

v3.0.0-alpha.4

08 Apr 12:27
Compare
Choose a tag to compare

A number of issues fixed after testing on the 1% sample

v3.0.0-alpha.2

03 Apr 20:35
Compare
Choose a tag to compare

What's Changed

  • The language identification model updated by @laurieburchell in #21
  • Hyperparameters of Trafilatura selected

v3.0.0-alpha.1

21 Mar 00:00
Compare
Choose a tag to compare
v3.0.0-alpha.1 Pre-release
Pre-release

HTML2text updates:

  1. Moved to Trafilatura 2.0.0
  2. Additional extraction of text with markup using xml outputs from Trafilatura
  3. Extraction of HTML language tags
  4. Streaming input HTMLs directly from LUMIO

Code running stage2 on LUMI for the second data release

13 May 16:26
Compare
Choose a tag to compare

See two/README.MD to learn to reproduce stage2 for the second data release

updated langid

12 May 20:05
fb881ca
Compare
Choose a tag to compare
updated langid Pre-release
Pre-release

langid update: preprocessing, new model
better selected blocksize for trafilatura

v2.0.0-alpha.2

06 May 22:13
Compare
Choose a tag to compare
v2.0.0-alpha.2 Pre-release
Pre-release

Full Changelog: v2.0.0-alpha.1...v2.0.0-alpha.2

Now pip installable.