Skip to content

[feat] Add data cleaning in fast-llm prepare #112

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 of 4 tasks
tscholak opened this issue Jan 14, 2025 · 3 comments
Open
3 of 4 tasks

[feat] Add data cleaning in fast-llm prepare #112

tscholak opened this issue Jan 14, 2025 · 3 comments
Assignees
Labels
enhancement New feature or request need update

Comments

@tscholak
Copy link
Collaborator

tscholak commented Jan 14, 2025

🎯 Goal (What & Why)

Add a comprehensive data cleaning stage to the fast-llm prepare command.

fast-llm prepare currently downloads and tokenizes HuggingFace datasets into Fast-LLM's .bin / .idx format using a distributed torchrun setup. However, it performs no data cleaning, which limits training quality and poses risks around PII and malicious content.

This ticket adds a required data cleaning phase that is configurable, fast, and integrated into the distributed preprocessing loop. The goal is to improve model quality, reduce noise, follow best practices (see OLMo-2), and meet responsible AI standards by removing PII and malware from the training corpus.

🚀 Execution Plan

Step 1: What is the smallest working version?

  • Extend fast-llm prepare to apply a modular and configurable cleaning pipeline during preprocessing.
  • All cleaning steps must be integrated into the existing torchrun CPU-only distributed setup, preserving parallelism.

Step 2: Required cleaning filters (all must be implemented):

  • Length filtering
    • Remove documents exceeding a configurable max length (in characters or tokens).
  • n-gram repetition
    • Remove documents with ≥N repeated n-grams (default: 32), as in OLMo-2.
  • Frequency-based filtering
    • Remove documents where:
      • The most frequent word exceeds X% of total tokens (default: 30%).
      • The top-2 most frequent words together exceed Y% of total tokens (default: 50%).
  • Binary content filtering
    • Remove documents that contain mostly binary data.
  • Numerical content filtering
    • Remove documents with a high percentage of numeric tokens (default: configurable threshold, e.g., 50%).
  • PII redaction
    • Integrate Microsoft Presidio for detection and redaction or removal of documents containing sensitive personal information.
  • Malware removal
    • Integrate ClamAV to scan documents and remove any that trigger detections.

All thresholds and filter behaviors must be exposed via the CLI and config files. Document-level logs or counters should be maintained for each filter to aid debugging and analysis.

📌 Acceptance Criteria

  • All listed filters are implemented and integrated into fast-llm prepare.
  • Cleaning is fully configurable, both via CLI and YAML config files.
  • The implementation works with the existing distributed CPU setup (torchrun + Gloo).
  • Performance remains acceptable.
  • Logs report how many documents are removed by each filter.
  • Code is tested and documented.
  • PR includes a performance/impact summary and example CLI usage.

🛠️ Project Management

  • Assign the project to the Fast-LLM project.
  • Set the Estimate field (in days) in the GitHub project.
  • Use the Size field to categorize the PR size (Small/Medium/Large).
  • Assign an owner when opening the issue.
@tscholak tscholak added the enhancement New feature or request label Jan 14, 2025
@jlamypoirier
Copy link
Collaborator

@tscholak Is this still relevant? Let's describe or close

@tscholak tscholak changed the title [feat] Adopt OLMo-2 data cleaning in fast-llm prepare [feat] Add data cleaning in fast-llm prepare Mar 23, 2025
@tscholak
Copy link
Collaborator Author

@jlamypoirier it's more relevant than ever.

@bigximik
Copy link
Contributor

bigximik commented Apr 1, 2025

How Should We Manage Model Downloading and Loading for Presidio and Virus Database Handling for ClamAV?

More details are available here.

The likely approach is to specify a cache folder and download the necessary files if they are not already present. However, we need to determine whether this should happen per run or persist across multiple runs. If the latter, we must address potential conflicts between different parallel executions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request need update
Projects
None yet
Development

No branches or pull requests

3 participants