Skip to content

Mutation Testing

Varun Pratap Bhardwaj edited this page Mar 6, 2026 · 1 revision

Mutation Testing

Mutation testing answers: How sensitive are your tests?

If you perturb the agent — change a prompt, remove a tool, swap a model — do your tests catch it?

The Core Question

You have 30 tests, all passing at 90%+. You feel confident. But then you accidentally delete a critical tool from the agent's configuration, and... all 30 tests still pass.

Your test suite was testing the happy path but never verified that the agent actually uses that tool.

Mutation testing would have caught this. The "tool removal" mutant would survive, telling you that no test depends on that tool.

How It Works

  1. Generate mutants: Apply mutation operators to create perturbed versions of the agent
  2. Run tests: Run your test suite against each mutant
  3. Check outcomes:
    • If tests FAIL → mutant is killed ✅ (good, test suite is sensitive)
    • If tests still PASS → mutant survives ❌ (bad, test suite has a blind spot)
  4. Compute mutation score: killed / total

The Twelve Mutation Operators

AgentAssay includes twelve operators organized into four categories:

Category 1: Prompt Mutations

Operator What it does
Synonym Substitution Replaces key terms with synonyms (e.g., "analyze" → "examine")
Instruction Order Shuffles the order of instructions
Noise Injection Adds irrelevant or distracting text
Instruction Drop Removes one instruction

Category 2: Tool Mutations

Operator What it does
Tool Removal Removes one tool from available set
Tool Reorder Changes the order of tools in the list
Tool Noise Alters tool descriptions or parameter schemas

Category 3: Model Mutations

Operator What it does
Model Swap Replaces the model (e.g., large → small)
Model Version Changes model version (stable → preview)

Category 4: Context Mutations

Operator What it does
Context Truncation Truncates conversation history or context window
Context Noise Injects irrelevant information into context
Context Permutation Reorders items in the context

Interpreting the Mutation Score

Mutation Score = killed mutants / total mutants
Score Interpretation
>= 80% 🟢 Strong — Test suite detects most perturbations
50-79% 🟡 Moderate — Some blind spots exist
< 50% 🔴 Weak — Tests pass regardless of significant changes

Analyzing Surviving Mutants

Each surviving mutant tells you something specific:

Surviving Mutant What It Means Fix
Synonym substitution Tests check outputs, not reasoning Add contract-based tests
Tool removal No test requires that tool Add scenario that needs the tool
Model swap Tests not sensitive to quality differences Add harder scenarios
Context truncation Tests use short contexts Add long-context scenarios

Running Mutation Testing

From CLI

# Run all mutation operators
agentassay mutate --config agent.yaml --scenario qa.yaml

# Run only specific categories
agentassay mutate -c agent.yaml -s qa.yaml --operators prompt,tool

# Save results
agentassay mutate -c agent.yaml -s qa.yaml -o mutations.json

From Python

from agentassay.mutation import MutationRunner, DEFAULT_OPERATORS

runner = MutationRunner(
    agent_callable=my_agent,
    operators=DEFAULT_OPERATORS,
    trials_per_mutant=30,
)

result = runner.run_suite(scenario)

print(f"Mutation score: {result.mutation_score:.2%}")
print(f"Killed: {result.killed} / {result.total}")

# Inspect surviving mutants
for mutant in result.survivors:
    print(f"  SURVIVED: {mutant.operator_name}")
    print(f"    Description: {mutant.description}")
    print(f"    Pass rate: {mutant.pass_rate:.1%}")

Stochastic Mutation Killing

Since agents are non-deterministic, killing is a statistical question:

A mutant is killed if:

  • The mutated agent's pass rate is statistically significantly lower than the original
  • Uses Fisher's exact test with α = 0.05

This avoids false kills from random variation.

Example: Improving a Test Suite

Initial run:

agentassay mutate -c agent.yaml -s suite.yaml

Output:

Mutation Score: 45% (9/20 killed)

SURVIVING MUTANTS:
  - Tool Removal (search): No test requires search tool
  - Tool Removal (rank): No test requires rank tool
  - Context Truncation: Tests use short contexts only
  ...

Fix: Add scenarios that require search and rank, add long-context scenario.

Re-run:

agentassay mutate -c agent.yaml -s suite-v2.yaml

Output:

Mutation Score: 85% (17/20 killed)

SURVIVING MUTANTS:
  - Context Permutation: Tests not sensitive to context order
  ...

Much better!

Cost Considerations

Mutation testing is expensive — it runs N mutations × M trials per mutation. Use adaptive budgeting:

from agentassay.efficiency import AdaptiveBudgetOptimizer

optimizer = AdaptiveBudgetOptimizer()

# Calibrate once on the original agent
estimate = optimizer.calibrate(calibration_traces)

# Use the recommended N for all mutants
runner = MutationRunner(
    agent_callable=my_agent,
    operators=DEFAULT_OPERATORS,
    trials_per_mutant=estimate.recommended_n,  # Adaptive!
)

Savings: 40-60% compared to fixed-N mutation testing.


Next Steps


Part of Qualixar | Author: Varun Pratap Bhardwaj

Clone this wiki locally