You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add a comprehensive data cleaning stage to the fast-llm prepare command.
fast-llm prepare currently downloads and tokenizes HuggingFace datasets into Fast-LLM's .bin / .idx format using a distributed torchrun setup. However, it performs no data cleaning, which limits training quality and poses risks around PII and malicious content.
This ticket adds a required data cleaning phase that is configurable, fast, and integrated into the distributed preprocessing loop. The goal is to improve model quality, reduce noise, follow best practices (see OLMo-2), and meet responsible AI standards by removing PII and malware from the training corpus.
🚀 Execution Plan
Step 1: What is the smallest working version?
Extend fast-llm prepare to apply a modular and configurable cleaning pipeline during preprocessing.
All cleaning steps must be integrated into the existing torchrun CPU-only distributed setup, preserving parallelism.
Step 2: Required cleaning filters (all must be implemented):
Length filtering
Remove documents exceeding a configurable max length (in characters or tokens).
n-gram repetition
Remove documents with ≥N repeated n-grams (default: 32), as in OLMo-2.
Frequency-based filtering
Remove documents where:
The most frequent word exceeds X% of total tokens (default: 30%).
The top-2 most frequent words together exceed Y% of total tokens (default: 50%).
Binary content filtering
Remove documents that contain mostly binary data.
Numerical content filtering
Remove documents with a high percentage of numeric tokens (default: configurable threshold, e.g., 50%).
PII redaction
Integrate Microsoft Presidio for detection and redaction or removal of documents containing sensitive personal information.
Malware removal
Integrate ClamAV to scan documents and remove any that trigger detections.
All thresholds and filter behaviors must be exposed via the CLI and config files. Document-level logs or counters should be maintained for each filter to aid debugging and analysis.
📌 Acceptance Criteria
All listed filters are implemented and integrated into fast-llm prepare.
Cleaning is fully configurable, both via CLI and YAML config files.
The implementation works with the existing distributed CPU setup (torchrun + Gloo).
Performance remains acceptable.
Logs report how many documents are removed by each filter.
Code is tested and documented.
PR includes a performance/impact summary and example CLI usage.
🛠️ Project Management
Assign the project to the Fast-LLM project.
Set the Estimate field (in days) in the GitHub project.
Use the Size field to categorize the PR size (Small/Medium/Large).
Assign an owner when opening the issue.
The text was updated successfully, but these errors were encountered:
The likely approach is to specify a cache folder and download the necessary files if they are not already present. However, we need to determine whether this should happen per run or persist across multiple runs. If the latter, we must address potential conflicts between different parallel executions.
🎯 Goal (What & Why)
Add a comprehensive data cleaning stage to the
fast-llm prepare
command.fast-llm prepare
currently downloads and tokenizes HuggingFace datasets into Fast-LLM's.bin
/.idx
format using a distributed torchrun setup. However, it performs no data cleaning, which limits training quality and poses risks around PII and malicious content.This ticket adds a required data cleaning phase that is configurable, fast, and integrated into the distributed preprocessing loop. The goal is to improve model quality, reduce noise, follow best practices (see OLMo-2), and meet responsible AI standards by removing PII and malware from the training corpus.
🚀 Execution Plan
Step 1: What is the smallest working version?
fast-llm prepare
to apply a modular and configurable cleaning pipeline during preprocessing.torchrun
CPU-only distributed setup, preserving parallelism.Step 2: Required cleaning filters (all must be implemented):
All thresholds and filter behaviors must be exposed via the CLI and config files. Document-level logs or counters should be maintained for each filter to aid debugging and analysis.
📌 Acceptance Criteria
fast-llm prepare
.🛠️ Project Management
Estimate
field (in days) in the GitHub project.Size
field to categorize the PR size (Small/Medium/Large).The text was updated successfully, but these errors were encountered: