Ciemia - Climate & Nature Content Filter for FineWeb-Edu

A streaming pipeline to filter climate and nature-related content from the HuggingFace FineWeb-Edu dataset using FastText classifiers and GPT-based weak supervision.

Overview

This tool:

Streams FineWeb-Edu from HuggingFace (no full download required)
Filters for English content using FastText language detection
Applies keyword filtering for climate/nature topics
Uses a trained FastText classifier for high-precision filtering
Uploads filtered data to your HuggingFace dataset repository in shards

Installation

# Clone or navigate to the project
cd Ciemia

# Install dependencies
pip install -r requirements.txt

# Download the FastText language identification model
python cli.py download-lid

Configuration

Create a .env file with your API keys:

OPENAI_API_KEY=your_openai_api_key_here
HF_TOKEN=your_huggingface_write_token_here

OPENAI_API_KEY: Required for GPT labeling (Step 2)
HF_TOKEN: Required for uploading to HuggingFace (Step 5). Must have write permissions.

Usage

Quick Start (If models are already trained)

If you already have a trained classifier (models/fasttext_climate.bin), you can directly run:

python cli.py stream-filter-upload --repo-id your-username/your-dataset-name

Full Pipeline

Step 1: Sample and Label with GPT

Sample candidates from FineWeb-Edu and label them with GPT for weak supervision:

python cli.py sample-label --num-samples 10000

Options:

--num-samples, -n: Number of samples to label (default: 10000)
--output, -o: Output path (default: data/gpt_labels_10k.jsonl)
--model, -m: OpenAI model (default: gpt-4o-mini)
--max-chars: Max characters per sample (default: 2000)
--resume/--no-resume: Resume from existing labels (default: resume)

Step 2: Build Training Files

Convert GPT labels to FastText training format:

python cli.py build-training

Options:

--labels, -l: Input labels file (default: data/gpt_labels_10k.jsonl)
--train-output: Training file path (default: data/fasttext_train.txt)
--valid-output: Validation file path (default: data/fasttext_valid.txt)
--min-chars: Minimum text length (default: 50)
--valid-ratio: Validation split ratio (default: 0.1)

Step 3: Train FastText Classifier

Train a binary classifier on the labeled data:

python cli.py train-fasttext

Options:

--train, -t: Training file (default: data/fasttext_train.txt)
--valid, -v: Validation file (default: data/fasttext_valid.txt)
--output, -o: Model output path (default: models/fasttext_climate.bin)
--lr: Learning rate (default: 0.5)
--epoch: Number of epochs (default: 25)
--word-ngrams: Max word n-grams (default: 2)
--dim: Word vector dimension (default: 100)
--threshold: Classification threshold for evaluation (default: 0.5)

Step 4: Stream Filter and Upload

Filter the full dataset and upload to HuggingFace:

python cli.py stream-filter-upload --repo-id your-username/your-dataset-name

Options:

--repo-id, -r: HuggingFace dataset repo ID (required)
--classifier, -c: Classifier model path (default: models/fasttext_climate.bin)
--threshold: Classification probability threshold (default: 0.5)
--buffer-size: Records per shard before upload (default: 5000)
--max-records: Maximum records to upload (default: unlimited)
--resume/--no-resume: Resume from previous state (default: resume)

Utility Commands

Download the FastText language ID model:

python cli.py download-lid

Project Structure

Ciemia/
├── cli.py                      # Main CLI entry point
├── requirements.txt            # Python dependencies
├── .env                        # API keys (create this)
├── data/
│   ├── keywords.txt            # Climate/nature keywords for filtering
│   ├── gpt_labels_10k.jsonl    # GPT-labeled samples
│   ├── fasttext_train.txt      # Training data
│   ├── fasttext_valid.txt      # Validation data
│   ├── state.json              # Upload resume state
│   └── temp_shards/            # Temporary shard files
├── models/
│   ├── lid.176.bin             # FastText language ID model
│   └── fasttext_climate.bin    # Trained climate classifier
├── prompts/
│   └── climate_yesno.txt       # GPT labeling prompt
└── src/
    ├── __init__.py
    ├── streaming_filters.py    # English + keyword filtering
    ├── gpt_labeler.py          # GPT labeling logic
    ├── fasttext_trainer.py     # Classifier training + prediction
    └── uploader.py             # HuggingFace upload logic

Pipeline Architecture

FineWeb-Edu (streaming)
        │
        ▼
┌───────────────────┐
│  Stage A: English │  FastText lid.176.bin (prob >= 0.9)
│  Language Filter  │
└───────────────────┘
        │
        ▼
┌───────────────────┐
│  Stage B: Keyword │  112 climate/nature keywords
│  Filter           │
└───────────────────┘
        │
        ▼
┌───────────────────┐
│  Stage C: Climate │  Trained FastText classifier
│  Classifier       │  (prob >= threshold)
└───────────────────┘
        │
        ▼
┌───────────────────┐
│  Upload to        │  Sharded JSONL.gz files
│  HuggingFace      │
└───────────────────┘

Keywords

The pipeline uses 112 keywords organized into categories:

Climate Strong (52): climate change, greenhouse gas, IPCC, Paris Agreement, etc.
Climate Weak (23): adaptation, climate policy, food security, etc.
Nature Strong (34): biodiversity, ecosystem, conservation, deforestation, etc.
Nature Weak (3): ecological, ecosystem health, biodiversity loss

Edit data/keywords.txt to customize the keyword filter.

Resume Support

The upload process supports resuming from interruptions:

State is saved to data/state.json after each shard upload
Run with --resume (default) to continue from where you left off
Run with --no-resume to start fresh

Output Format

Each uploaded record contains:

{
  "text": "The full document text...",
  "id": "Original document ID",
  "url": "Source URL",
  "climate_prob": 0.9234,
  "source": "fineweb"
}

Example Workflow

# 1. Setup
pip install -r requirements.txt
python cli.py download-lid

# 2. Create .env with your keys
echo "OPENAI_API_KEY=sk-..." > .env
echo "HF_TOKEN=hf_..." >> .env

# 3. Sample and label (takes ~1-2 hours for 10k samples)
python cli.py sample-label

# 4. Build training files
python cli.py build-training

# 5. Train classifier
python cli.py train-fasttext

# 6. Upload to HuggingFace (can run indefinitely)
python cli.py stream-filter-upload --repo-id username/fineweb-climate-filtered

# Or with limits for testing
python cli.py stream-filter-upload --repo-id username/dataset --max-records 1000 --buffer-size 500

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
models		models
prompts		prompts
src		src
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
ciemia.log		ciemia.log
cli.py		cli.py
objective.md		objective.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ciemia - Climate & Nature Content Filter for FineWeb-Edu

Overview

Installation

Configuration

Usage

Quick Start (If models are already trained)

Full Pipeline

Step 1: Sample and Label with GPT

Step 2: Build Training Files

Step 3: Train FastText Classifier

Step 4: Stream Filter and Upload

Utility Commands

Project Structure

Pipeline Architecture

Keywords

Resume Support

Output Format

Example Workflow

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Ciemia - Climate & Nature Content Filter for FineWeb-Edu

Overview

Installation

Configuration

Usage

Quick Start (If models are already trained)

Full Pipeline

Step 1: Sample and Label with GPT

Step 2: Build Training Files

Step 3: Train FastText Classifier

Step 4: Stream Filter and Upload

Utility Commands

Project Structure

Pipeline Architecture

Keywords

Resume Support

Output Format

Example Workflow

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages