Skip to content

RedTeamML: A modular toolkit for stress-testing large language models (LLMs) with structured adversarial evaluations.

License

Notifications You must be signed in to change notification settings

n33levo/REDTeamML_toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RedTeamML

RedTeamML is a modular toolkit for stress-testing large language models (LLMs) with structured adversarial evaluations. It covers multi-level perturbation generators, shared benchmarking workflows, rich logging and reporting, as well as adaptive adversarial search that can plug into defence strategies.

Why another toolkit?

Existing projects such as IntelLabs/LLMart, confident-ai/DeepTeam, and caspiankeyes/AART focus on specific attack modalities or orchestration patterns. RedTeamML aims to be a batteries-included framework that unifies:

  • Multi-level attacks – character, token, phrase, semantic, and goal-oriented prompt mutations.
  • Hybrid strategies – configurable black-box and white-box attack loops with curriculum scheduling and latent-space perturbations.
  • Extensible APIs – pluggable attack and defence interfaces plus registry-based discovery.
  • Shared benchmarking – reusable adversarial suites to compare models or versions consistently.
  • Actionable analytics – structured failure logging, robustness curves, tabular/JSON reports.
  • Scalability hooks – batching, parallel execution, and adaptive loop feedback mechanisms.

Key Features

  • Generate adversarial inputs across character-, token-, phrase-, semantic-, and prompt-level operators.
  • Toggle black-box sampling/heuristics or white-box latent FGSM strategies with budget control.
  • Log every interaction in JSONL with tagged failure modes (e.g. hallucination, policy_breach) and metadata.
  • Compute and plot robustness curves (failure rate vs. perturbation budget), comparing multiple models or releases.
  • Benchmark models on shared adversarial suites and export results as JSON/CSV/PNG.
  • Run adaptive adversarial loops that escalate attack strength based on feedback signals.
  • Register defences (input filtering, detectors, mitigation policies) via hook interfaces for online evaluation.
  • Publish extensible plugin API for third-party attack/defence packages.
  • Trigger meta-adversarial fine-tuning using logged failures and uncertainty scores to iteratively harden models.
  • Ensemble detection mixes rule-based tagging, keyword heuristics, transformer toxicity classification, and entropy-based uncertainty thresholds.

Limitations to Keep in Mind

  • Discrete text perturbations can be brittle or unrealistic; semantic validation is an ongoing challenge.
  • Black-box access constrains attack strength and increases query cost; rate limiting must be respected.
  • Failure definitions for generative tasks remain ambiguous, often requiring task- or domain-specific scoring.
  • Defence overfitting is common; point fixes may not generalise beyond curated attack suites.
  • Latency and cost of ensemble scorers or external detectors need budgeting when operating at scale.

Proposed Hybrid Approach

RedTeamML adopts a hybrid methodology to mitigate these challenges:

  1. Latent-space perturbations guided by semantic similarity hooks, keeping outputs realistic.
  2. Curriculum scheduling across attack families to discover vulnerabilities at increasing intensities.
  3. Ensemble detection combining heuristic filters, classifier confidence, and uncertainty scoring derived from model entropy.
  4. Meta-adversarial training hooks that recycle discovered failures into fine-tuning loops, targeting general robustness rather than attack-specific patches.

Parallel evaluation and adaptation

Set batch_size, parallelism (thread or process), and max_workers in your runner config to fan out prompts. The runner keeps curriculum feedback intact while distributing batches.

White-box latent FGSM attack

Add the latent_fgsm attack to configs targeting models that expose white-box features (e.g. Hugging Face causal LMs). The attack perturbs embeddings via FGSM and decodes nearest neighbours, nudging prompts toward high-risk directives.

Meta-adversarial fine-tuning

After an evaluation run, you can recycle failures into a safety fine-tune:

rtml meta-train runs/<log-name>.jsonl --model-id distilgpt2 --epochs 1 --output-dir runs/distilgpt2-adapted

The routine performs quick gradient updates pushing the model toward a templated safe response for flagged prompts—handy for smoke-testing guardrail refreshes.

Quick Start

pip install -e .
rtml attack examples/basic_attack.yaml

See examples/ for configuration templates and docs/architecture.md for a deeper technical overview.

Installation & Environment

  • Python 3.9+
  • Install a compatible PyTorch build before pip install -e . (see pytorch.org for platform-specific wheels)
  • Transformers are installed automatically along with the rest of the toolkit
  • Local Hugging Face models download on first use; no API keys required unless you swap in hosted providers.
  • Optional analytics commands rely on Polars + PyArrow (python -m pip install polars pyarrow).
python3 -m venv .venv
source .venv/bin/activate
# Example CPU wheel; adjust to match your environment
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install -e .

Running Evaluations

Run a full adversarial sweep (character → prompt, including white-box latent FGSM) against the Hugging Face distilgpt2 baseline:

rtml attack examples/hf_attack.yaml

This writes a JSONL log to runs/ containing every probe, model response, detector metadata, and uncertainty scores.

For provider matrices:

# Cerebras model suite (batch size 2, threaded)
rtml attack examples/cerebras_suite.yaml

# Groq LPU target
rtml attack examples/groq_attack.yaml

# Google Gemini API
rtml attack examples/gemini_attack.yaml

# Aggregate comparison
rtml compare runs/basic_suite-*.jsonl --csv runs/plots/compare_summary.csv --plot runs/plots/compare_summary.png

Additional tooling:

  • rtml export-parquet runs/*.jsonl --out-dir parquet_logs – flatten logs into Parquet for downstream analytics (Polars/pyarrow backend).
  • rtml throughput examples/groq_attack.yaml --batch-size 1 --batch-size 2 --batch-size 4 --plot runs/plots/throughput.png – benchmark QPS vs. failure rate across batch sizes; add --stream for Groq and Gemini streaming benchmarks.
  • rtml reinforce examples/hf_attack.yaml --meta-model-id distilgpt2 --plot runs/plots/reinforce.png – attack → meta-train → re-attack loop with before/after failure deltas.

Reports and Visuals

rtml report runs/<log-name>.jsonl \
  --csv runs/plots/latest.csv \
  --json-path runs/plots/latest.json \
  --plot runs/plots/latest.png

Outputs:

  • Terminal summary grouped by attack level (failure rate vs mean budget)
  • CSV + JSON summaries for downstream dashboards
  • Matplotlib robustness curve at runs/plots/latest.png

Meta-Adversarial Adaptation

Recycle the logged failures into a quick safety fine-tune:

rtml meta-train runs/<log-name>.jsonl \
  --model-id distilgpt2 \
  --epochs 1 \
  --learning-rate 5e-5 \
  --output-dir runs/distilgpt2-adapted

The trainer pulls flagged prompts, pairs them with templated safe completions, and runs a lightweight gradient update. Adapted weights drop into runs/distilgpt2-adapted.

Using Cerebras Inference API

  1. Export your Cerebras key (replace the placeholder with your token):

    export CEREBRAS_API_KEY=csx-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  2. Launch an evaluation with a hosted model (defaults to llama3.1-8b):

    rtml attack examples/cerebras_attack.yaml

The adapter hits https://api.cerebras.ai/v1/chat/completions, so no extra plumbing is required beyond the API key.

Using Groq LPU Inference

  1. Export your Groq key:

    export GROQ_API_KEY=gsk_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  2. Run the Groq-targeted evaluation (OpenAI-compatible REST):

    rtml attack examples/groq_attack.yaml

Pair the resulting logs with rtml compare --csv runs/plots/groq_compare.csv <logs...> to contrast Groq latency/failure profiles against other providers.

Using Google Gemini API

  1. Export your Gemini API key:

    export GEMINI_API_KEY=AIzaSyXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  2. Run the Gemini evaluation:

    rtml attack examples/gemini_attack.yaml

Gemini provides excellent safety guardrails and supports streaming for throughput benchmarking with --stream flag.

Automation

The previous nightly GitHub Action has been retired. For now run the CLI workflows locally (or wire them into your own CI) with commands such as:

rtml attack examples/hf_attack.yaml
rtml compare runs/*.jsonl --csv runs/plots/latest.csv
rtml export-parquet runs/*.jsonl

Extended provider summaries are captured in docs/provider_comparison.md and the accompanying plot runs/plots/compare_summary.png.

Latest Regression Check

Example workflow:

  1. rtml attack examples/hf_attack.yaml
  2. rtml report runs/<log-name>.jsonl --plot runs/plots/distilgpt2_latest.png --csv runs/plots/distilgpt2_latest.csv --json-path runs/plots/distilgpt2_latest.json
  3. rtml meta-train runs/<log-name>.jsonl --model-id distilgpt2 --epochs 1 --output-dir runs/distilgpt2-adapted

Typical outcomes:

  • 30+ adversarial interactions covering character/token/semantic/prompt levels (including white-box latent FGSM).
  • Failure rates vary by attack family; character-level attacks typically show 0–40% failure rates.
  • Robustness curves + CSV/JSON exports saved to runs/plots/.
  • Meta-adversarial trainer typically converges with loss ~3.0, saving adapted weights to output directory.
  • Hosted provider runs (Cerebras, Groq, Gemini) show varying failure patterns; prompt-level attacks are typically most successful.
  • Use rtml compare to aggregate failure/latency metrics across multiple logs for provider benchmarking.
  • Provider comparisons are summarized in docs/provider_comparison.md.

Known warnings: datetime.utcnow deprecation and transformers' return_all_scores notice; both are non-blocking and slated for cleanup.

Roadmap

  • Integrate RL-based prompt optimisation for adaptive loops.
  • Add hosted evaluation notebook templates.
  • Provide defence zoo (toxicity filters, jailbreak detectors, uncertainty wrappers).
  • Support distributed execution backends (Ray, Celery) and multi-node logging.

License

MIT

About

RedTeamML: A modular toolkit for stress-testing large language models (LLMs) with structured adversarial evaluations.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages