[feat] Add data cleaning in `fast-llm prepare` #112

tscholak · 2025-01-14T13:41:40Z

🎯 Goal (What & Why)

Add a comprehensive data cleaning stage to the fast-llm prepare command.

fast-llm prepare currently downloads and tokenizes HuggingFace datasets into Fast-LLM's .bin / .idx format using a distributed torchrun setup. However, it performs no data cleaning, which limits training quality and poses risks around PII and malicious content.

This ticket adds a required data cleaning phase that is configurable, fast, and integrated into the distributed preprocessing loop. The goal is to improve model quality, reduce noise, follow best practices (see OLMo-2), and meet responsible AI standards by removing PII and malware from the training corpus.

🚀 Execution Plan

Step 1: What is the smallest working version?

Extend fast-llm prepare to apply a modular and configurable cleaning pipeline during preprocessing.
All cleaning steps must be integrated into the existing torchrun CPU-only distributed setup, preserving parallelism.

Step 2: Required cleaning filters (all must be implemented):

Length filtering
- Remove documents exceeding a configurable max length (in characters or tokens).
n-gram repetition
- Remove documents with ≥N repeated n-grams (default: 32), as in OLMo-2.
Frequency-based filtering
- Remove documents where:
  - The most frequent word exceeds X% of total tokens (default: 30%).
  - The top-2 most frequent words together exceed Y% of total tokens (default: 50%).
Binary content filtering
- Remove documents that contain mostly binary data.
Numerical content filtering
- Remove documents with a high percentage of numeric tokens (default: configurable threshold, e.g., 50%).
PII redaction
- Integrate Microsoft Presidio for detection and redaction or removal of documents containing sensitive personal information.
Malware removal
- Integrate ClamAV to scan documents and remove any that trigger detections.

All thresholds and filter behaviors must be exposed via the CLI and config files. Document-level logs or counters should be maintained for each filter to aid debugging and analysis.

📌 Acceptance Criteria

All listed filters are implemented and integrated into fast-llm prepare.
Cleaning is fully configurable, both via CLI and YAML config files.
The implementation works with the existing distributed CPU setup (torchrun + Gloo).
Performance remains acceptable.
Logs report how many documents are removed by each filter.
Code is tested and documented.
PR includes a performance/impact summary and example CLI usage.

🛠️ Project Management

Assign the project to the Fast-LLM project.
Set the Estimate field (in days) in the GitHub project.
Use the Size field to categorize the PR size (Small/Medium/Large).
Assign an owner when opening the issue.

The text was updated successfully, but these errors were encountered:

jlamypoirier · 2025-01-28T21:55:58Z

@tscholak Is this still relevant? Let's describe or close

tscholak · 2025-03-23T20:48:54Z

@jlamypoirier it's more relevant than ever.

bigximik · 2025-04-01T14:58:32Z

How Should We Manage Model Downloading and Loading for Presidio and Virus Database Handling for ClamAV?

More details are available here.

The likely approach is to specify a cache folder and download the necessary files if they are not already present. However, we need to determine whether this should happen per run or persist across multiple runs. If the latter, we must address potential conflicts between different parallel executions.

tscholak added the enhancement New feature or request label Jan 14, 2025

jlamypoirier added the need update label Jan 28, 2025

tscholak changed the title ~~[feat] Adopt OLMo-2 data cleaning in fast-llm prepare~~ [feat] Add data cleaning in fast-llm prepare Mar 23, 2025

tscholak assigned bigximik Mar 23, 2025

tscholak removed the need update label Mar 24, 2025

bigximik mentioned this issue Mar 26, 2025

Add data cleaning in fast-llm prepare, concept #210

Draft

25 tasks

bigximik added the need update label Apr 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Add data cleaning in `fast-llm prepare` #112

[feat] Add data cleaning in `fast-llm prepare` #112

tscholak commented Jan 14, 2025 •

edited

Loading

jlamypoirier commented Jan 28, 2025

tscholak commented Mar 23, 2025

bigximik commented Apr 1, 2025 •

edited

Loading

[feat] Add data cleaning in fast-llm prepare #112

[feat] Add data cleaning in fast-llm prepare #112

Comments

tscholak commented Jan 14, 2025 • edited Loading

🎯 Goal (What & Why)

🚀 Execution Plan

Step 1: What is the smallest working version?

Step 2: Required cleaning filters (all must be implemented):

📌 Acceptance Criteria

🛠️ Project Management

jlamypoirier commented Jan 28, 2025

tscholak commented Mar 23, 2025

bigximik commented Apr 1, 2025 • edited Loading

[feat] Add data cleaning in `fast-llm prepare` #112

[feat] Add data cleaning in `fast-llm prepare` #112

tscholak commented Jan 14, 2025 •

edited

Loading

bigximik commented Apr 1, 2025 •

edited

Loading