RedTeamML is a modular toolkit for stress-testing large language models (LLMs) with structured adversarial evaluations. It covers multi-level perturbation generators, shared benchmarking workflows, rich logging and reporting, as well as adaptive adversarial search that can plug into defence strategies.
Existing projects such as IntelLabs/LLMart, confident-ai/DeepTeam, and caspiankeyes/AART focus on specific attack modalities or orchestration patterns. RedTeamML aims to be a batteries-included framework that unifies:
- Multi-level attacks – character, token, phrase, semantic, and goal-oriented prompt mutations.
- Hybrid strategies – configurable black-box and white-box attack loops with curriculum scheduling and latent-space perturbations.
- Extensible APIs – pluggable attack and defence interfaces plus registry-based discovery.
- Shared benchmarking – reusable adversarial suites to compare models or versions consistently.
- Actionable analytics – structured failure logging, robustness curves, tabular/JSON reports.
- Scalability hooks – batching, parallel execution, and adaptive loop feedback mechanisms.
- Generate adversarial inputs across character-, token-, phrase-, semantic-, and prompt-level operators.
- Toggle black-box sampling/heuristics or white-box latent FGSM strategies with budget control.
- Log every interaction in JSONL with tagged failure modes (e.g. hallucination, policy_breach) and metadata.
- Compute and plot robustness curves (failure rate vs. perturbation budget), comparing multiple models or releases.
- Benchmark models on shared adversarial suites and export results as JSON/CSV/PNG.
- Run adaptive adversarial loops that escalate attack strength based on feedback signals.
- Register defences (input filtering, detectors, mitigation policies) via hook interfaces for online evaluation.
- Publish extensible plugin API for third-party attack/defence packages.
- Trigger meta-adversarial fine-tuning using logged failures and uncertainty scores to iteratively harden models.
- Ensemble detection mixes rule-based tagging, keyword heuristics, transformer toxicity classification, and entropy-based uncertainty thresholds.
- Discrete text perturbations can be brittle or unrealistic; semantic validation is an ongoing challenge.
- Black-box access constrains attack strength and increases query cost; rate limiting must be respected.
- Failure definitions for generative tasks remain ambiguous, often requiring task- or domain-specific scoring.
- Defence overfitting is common; point fixes may not generalise beyond curated attack suites.
- Latency and cost of ensemble scorers or external detectors need budgeting when operating at scale.
RedTeamML adopts a hybrid methodology to mitigate these challenges:
- Latent-space perturbations guided by semantic similarity hooks, keeping outputs realistic.
- Curriculum scheduling across attack families to discover vulnerabilities at increasing intensities.
- Ensemble detection combining heuristic filters, classifier confidence, and uncertainty scoring derived from model entropy.
- Meta-adversarial training hooks that recycle discovered failures into fine-tuning loops, targeting general robustness rather than attack-specific patches.
Set batch_size, parallelism (thread or process), and max_workers in your runner config to fan out prompts. The runner keeps curriculum feedback intact while distributing batches.
Add the latent_fgsm attack to configs targeting models that expose white-box features (e.g. Hugging Face causal LMs). The attack perturbs embeddings via FGSM and decodes nearest neighbours, nudging prompts toward high-risk directives.
After an evaluation run, you can recycle failures into a safety fine-tune:
rtml meta-train runs/<log-name>.jsonl --model-id distilgpt2 --epochs 1 --output-dir runs/distilgpt2-adaptedThe routine performs quick gradient updates pushing the model toward a templated safe response for flagged prompts—handy for smoke-testing guardrail refreshes.
pip install -e .
rtml attack examples/basic_attack.yamlSee examples/ for configuration templates and docs/architecture.md for a deeper technical overview.
- Python 3.9+
- Install a compatible PyTorch build before
pip install -e .(see pytorch.org for platform-specific wheels) - Transformers are installed automatically along with the rest of the toolkit
- Local Hugging Face models download on first use; no API keys required unless you swap in hosted providers.
- Optional analytics commands rely on Polars + PyArrow (
python -m pip install polars pyarrow).
python3 -m venv .venv
source .venv/bin/activate
# Example CPU wheel; adjust to match your environment
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install -e .Run a full adversarial sweep (character → prompt, including white-box latent FGSM) against the Hugging Face distilgpt2 baseline:
rtml attack examples/hf_attack.yamlThis writes a JSONL log to runs/ containing every probe, model response, detector metadata, and uncertainty scores.
For provider matrices:
# Cerebras model suite (batch size 2, threaded)
rtml attack examples/cerebras_suite.yaml
# Groq LPU target
rtml attack examples/groq_attack.yaml
# Google Gemini API
rtml attack examples/gemini_attack.yaml
# Aggregate comparison
rtml compare runs/basic_suite-*.jsonl --csv runs/plots/compare_summary.csv --plot runs/plots/compare_summary.pngAdditional tooling:
rtml export-parquet runs/*.jsonl --out-dir parquet_logs– flatten logs into Parquet for downstream analytics (Polars/pyarrow backend).rtml throughput examples/groq_attack.yaml --batch-size 1 --batch-size 2 --batch-size 4 --plot runs/plots/throughput.png– benchmark QPS vs. failure rate across batch sizes; add--streamfor Groq and Gemini streaming benchmarks.rtml reinforce examples/hf_attack.yaml --meta-model-id distilgpt2 --plot runs/plots/reinforce.png– attack → meta-train → re-attack loop with before/after failure deltas.
rtml report runs/<log-name>.jsonl \
--csv runs/plots/latest.csv \
--json-path runs/plots/latest.json \
--plot runs/plots/latest.pngOutputs:
- Terminal summary grouped by attack level (failure rate vs mean budget)
- CSV + JSON summaries for downstream dashboards
- Matplotlib robustness curve at
runs/plots/latest.png
Recycle the logged failures into a quick safety fine-tune:
rtml meta-train runs/<log-name>.jsonl \
--model-id distilgpt2 \
--epochs 1 \
--learning-rate 5e-5 \
--output-dir runs/distilgpt2-adaptedThe trainer pulls flagged prompts, pairs them with templated safe completions, and runs a lightweight gradient update. Adapted weights drop into runs/distilgpt2-adapted.
-
Export your Cerebras key (replace the placeholder with your token):
export CEREBRAS_API_KEY=csx-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX -
Launch an evaluation with a hosted model (defaults to
llama3.1-8b):rtml attack examples/cerebras_attack.yaml
The adapter hits https://api.cerebras.ai/v1/chat/completions, so no extra plumbing is required beyond the API key.
-
Export your Groq key:
export GROQ_API_KEY=gsk_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX -
Run the Groq-targeted evaluation (OpenAI-compatible REST):
rtml attack examples/groq_attack.yaml
Pair the resulting logs with rtml compare --csv runs/plots/groq_compare.csv <logs...> to contrast Groq latency/failure profiles against other providers.
-
Export your Gemini API key:
export GEMINI_API_KEY=AIzaSyXXXXXXXXXXXXXXXXXXXXXXXXXXXXX -
Run the Gemini evaluation:
rtml attack examples/gemini_attack.yaml
Gemini provides excellent safety guardrails and supports streaming for throughput benchmarking with --stream flag.
The previous nightly GitHub Action has been retired. For now run the CLI workflows locally (or wire them into your own CI) with commands such as:
rtml attack examples/hf_attack.yaml
rtml compare runs/*.jsonl --csv runs/plots/latest.csv
rtml export-parquet runs/*.jsonlExtended provider summaries are captured in docs/provider_comparison.md and the accompanying plot runs/plots/compare_summary.png.
Example workflow:
rtml attack examples/hf_attack.yamlrtml report runs/<log-name>.jsonl --plot runs/plots/distilgpt2_latest.png --csv runs/plots/distilgpt2_latest.csv --json-path runs/plots/distilgpt2_latest.jsonrtml meta-train runs/<log-name>.jsonl --model-id distilgpt2 --epochs 1 --output-dir runs/distilgpt2-adapted
Typical outcomes:
- 30+ adversarial interactions covering character/token/semantic/prompt levels (including white-box latent FGSM).
- Failure rates vary by attack family; character-level attacks typically show 0–40% failure rates.
- Robustness curves + CSV/JSON exports saved to
runs/plots/. - Meta-adversarial trainer typically converges with loss ~3.0, saving adapted weights to output directory.
- Hosted provider runs (Cerebras, Groq, Gemini) show varying failure patterns; prompt-level attacks are typically most successful.
- Use
rtml compareto aggregate failure/latency metrics across multiple logs for provider benchmarking. - Provider comparisons are summarized in
docs/provider_comparison.md.
Known warnings: datetime.utcnow deprecation and transformers' return_all_scores notice; both are non-blocking and slated for cleanup.
- Integrate RL-based prompt optimisation for adaptive loops.
- Add hosted evaluation notebook templates.
- Provide defence zoo (toxicity filters, jailbreak detectors, uncertainty wrappers).
- Support distributed execution backends (Ray, Celery) and multi-node logging.
MIT