-
Notifications
You must be signed in to change notification settings - Fork 1
Mutation Testing
Mutation testing answers: How sensitive are your tests?
If you perturb the agent — change a prompt, remove a tool, swap a model — do your tests catch it?
You have 30 tests, all passing at 90%+. You feel confident. But then you accidentally delete a critical tool from the agent's configuration, and... all 30 tests still pass.
Your test suite was testing the happy path but never verified that the agent actually uses that tool.
Mutation testing would have caught this. The "tool removal" mutant would survive, telling you that no test depends on that tool.
- Generate mutants: Apply mutation operators to create perturbed versions of the agent
- Run tests: Run your test suite against each mutant
-
Check outcomes:
- If tests FAIL → mutant is killed ✅ (good, test suite is sensitive)
- If tests still PASS → mutant survives ❌ (bad, test suite has a blind spot)
-
Compute mutation score:
killed / total
AgentAssay includes twelve operators organized into four categories:
| Operator | What it does |
|---|---|
| Synonym Substitution | Replaces key terms with synonyms (e.g., "analyze" → "examine") |
| Instruction Order | Shuffles the order of instructions |
| Noise Injection | Adds irrelevant or distracting text |
| Instruction Drop | Removes one instruction |
| Operator | What it does |
|---|---|
| Tool Removal | Removes one tool from available set |
| Tool Reorder | Changes the order of tools in the list |
| Tool Noise | Alters tool descriptions or parameter schemas |
| Operator | What it does |
|---|---|
| Model Swap | Replaces the model (e.g., large → small) |
| Model Version | Changes model version (stable → preview) |
| Operator | What it does |
|---|---|
| Context Truncation | Truncates conversation history or context window |
| Context Noise | Injects irrelevant information into context |
| Context Permutation | Reorders items in the context |
Mutation Score = killed mutants / total mutants
| Score | Interpretation |
|---|---|
| >= 80% | 🟢 Strong — Test suite detects most perturbations |
| 50-79% | 🟡 Moderate — Some blind spots exist |
| < 50% | 🔴 Weak — Tests pass regardless of significant changes |
Each surviving mutant tells you something specific:
| Surviving Mutant | What It Means | Fix |
|---|---|---|
| Synonym substitution | Tests check outputs, not reasoning | Add contract-based tests |
| Tool removal | No test requires that tool | Add scenario that needs the tool |
| Model swap | Tests not sensitive to quality differences | Add harder scenarios |
| Context truncation | Tests use short contexts | Add long-context scenarios |
# Run all mutation operators
agentassay mutate --config agent.yaml --scenario qa.yaml
# Run only specific categories
agentassay mutate -c agent.yaml -s qa.yaml --operators prompt,tool
# Save results
agentassay mutate -c agent.yaml -s qa.yaml -o mutations.jsonfrom agentassay.mutation import MutationRunner, DEFAULT_OPERATORS
runner = MutationRunner(
agent_callable=my_agent,
operators=DEFAULT_OPERATORS,
trials_per_mutant=30,
)
result = runner.run_suite(scenario)
print(f"Mutation score: {result.mutation_score:.2%}")
print(f"Killed: {result.killed} / {result.total}")
# Inspect surviving mutants
for mutant in result.survivors:
print(f" SURVIVED: {mutant.operator_name}")
print(f" Description: {mutant.description}")
print(f" Pass rate: {mutant.pass_rate:.1%}")Since agents are non-deterministic, killing is a statistical question:
A mutant is killed if:
- The mutated agent's pass rate is statistically significantly lower than the original
- Uses Fisher's exact test with α = 0.05
This avoids false kills from random variation.
Initial run:
agentassay mutate -c agent.yaml -s suite.yamlOutput:
Mutation Score: 45% (9/20 killed)
SURVIVING MUTANTS:
- Tool Removal (search): No test requires search tool
- Tool Removal (rank): No test requires rank tool
- Context Truncation: Tests use short contexts only
...
Fix: Add scenarios that require search and rank, add long-context scenario.
Re-run:
agentassay mutate -c agent.yaml -s suite-v2.yamlOutput:
Mutation Score: 85% (17/20 killed)
SURVIVING MUTANTS:
- Context Permutation: Tests not sensitive to context order
...
Much better!
Mutation testing is expensive — it runs N mutations × M trials per mutation. Use adaptive budgeting:
from agentassay.efficiency import AdaptiveBudgetOptimizer
optimizer = AdaptiveBudgetOptimizer()
# Calibrate once on the original agent
estimate = optimizer.calibrate(calibration_traces)
# Use the recommended N for all mutants
runner = MutationRunner(
agent_callable=my_agent,
operators=DEFAULT_OPERATORS,
trials_per_mutant=estimate.recommended_n, # Adaptive!
)Savings: 40-60% compared to fixed-N mutation testing.
- Coverage Model — Combine mutation testing with coverage metrics
- Token-Efficient Testing — Reduce mutation testing costs
- Statistical Methods — Understand stochastic mutant killing
Part of Qualixar | Author: Varun Pratap Bhardwaj
Getting Started
Core Concepts
- Token-Efficient Testing
- Behavioral Fingerprinting
- Statistical Methods
- Coverage Model
- Mutation Testing
Guides
Reference