Mutation Testing

Mutation testing answers: How sensitive are your tests?

If you perturb the agent — change a prompt, remove a tool, swap a model — do your tests catch it?

The Core Question

You have 30 tests, all passing at 90%+. You feel confident. But then you accidentally delete a critical tool from the agent's configuration, and... all 30 tests still pass.

Your test suite was testing the happy path but never verified that the agent actually uses that tool.

Mutation testing would have caught this. The "tool removal" mutant would survive, telling you that no test depends on that tool.

How It Works

Generate mutants: Apply mutation operators to create perturbed versions of the agent
Run tests: Run your test suite against each mutant
Check outcomes:
- If tests FAIL → mutant is killed ✅ (good, test suite is sensitive)
- If tests still PASS → mutant survives ❌ (bad, test suite has a blind spot)
Compute mutation score: killed / total

The Twelve Mutation Operators

AgentAssay includes twelve operators organized into four categories:

Category 1: Prompt Mutations

Operator	What it does
Synonym Substitution	Replaces key terms with synonyms (e.g., "analyze" → "examine")
Instruction Order	Shuffles the order of instructions
Noise Injection	Adds irrelevant or distracting text
Instruction Drop	Removes one instruction

Category 2: Tool Mutations

Operator	What it does
Tool Removal	Removes one tool from available set
Tool Reorder	Changes the order of tools in the list
Tool Noise	Alters tool descriptions or parameter schemas

Category 3: Model Mutations

Operator	What it does
Model Swap	Replaces the model (e.g., large → small)
Model Version	Changes model version (stable → preview)

Category 4: Context Mutations

Operator	What it does
Context Truncation	Truncates conversation history or context window
Context Noise	Injects irrelevant information into context
Context Permutation	Reorders items in the context

Interpreting the Mutation Score

Mutation Score = killed mutants / total mutants

Score	Interpretation
>= 80%	🟢 Strong — Test suite detects most perturbations
50-79%	🟡 Moderate — Some blind spots exist
< 50%	🔴 Weak — Tests pass regardless of significant changes

Analyzing Surviving Mutants

Each surviving mutant tells you something specific:

Surviving Mutant	What It Means	Fix
Synonym substitution	Tests check outputs, not reasoning	Add contract-based tests
Tool removal	No test requires that tool	Add scenario that needs the tool
Model swap	Tests not sensitive to quality differences	Add harder scenarios
Context truncation	Tests use short contexts	Add long-context scenarios

Running Mutation Testing

From CLI

# Run all mutation operators
agentassay mutate --config agent.yaml --scenario qa.yaml

# Run only specific categories
agentassay mutate -c agent.yaml -s qa.yaml --operators prompt,tool

# Save results
agentassay mutate -c agent.yaml -s qa.yaml -o mutations.json

From Python

from agentassay.mutation import MutationRunner, DEFAULT_OPERATORS

runner = MutationRunner(
    agent_callable=my_agent,
    operators=DEFAULT_OPERATORS,
    trials_per_mutant=30,
)

result = runner.run_suite(scenario)

print(f"Mutation score: {result.mutation_score:.2%}")
print(f"Killed: {result.killed} / {result.total}")

# Inspect surviving mutants
for mutant in result.survivors:
    print(f"  SURVIVED: {mutant.operator_name}")
    print(f"    Description: {mutant.description}")
    print(f"    Pass rate: {mutant.pass_rate:.1%}")

Stochastic Mutation Killing

Since agents are non-deterministic, killing is a statistical question:

A mutant is killed if:

The mutated agent's pass rate is statistically significantly lower than the original
Uses Fisher's exact test with α = 0.05

This avoids false kills from random variation.

Example: Improving a Test Suite

Initial run:

agentassay mutate -c agent.yaml -s suite.yaml

Output:

Mutation Score: 45% (9/20 killed)

SURVIVING MUTANTS:
  - Tool Removal (search): No test requires search tool
  - Tool Removal (rank): No test requires rank tool
  - Context Truncation: Tests use short contexts only
  ...

Fix: Add scenarios that require search and rank, add long-context scenario.

Re-run:

agentassay mutate -c agent.yaml -s suite-v2.yaml

Output:

Mutation Score: 85% (17/20 killed)

SURVIVING MUTANTS:
  - Context Permutation: Tests not sensitive to context order
  ...

Much better!

Cost Considerations

Mutation testing is expensive — it runs N mutations × M trials per mutation. Use adaptive budgeting:

from agentassay.efficiency import AdaptiveBudgetOptimizer

optimizer = AdaptiveBudgetOptimizer()

# Calibrate once on the original agent
estimate = optimizer.calibrate(calibration_traces)

# Use the recommended N for all mutants
runner = MutationRunner(
    agent_callable=my_agent,
    operators=DEFAULT_OPERATORS,
    trials_per_mutant=estimate.recommended_n,  # Adaptive!
)

Savings: 40-60% compared to fixed-N mutation testing.

Next Steps

Coverage Model — Combine mutation testing with coverage metrics
Token-Efficient Testing — Reduce mutation testing costs
Statistical Methods — Understand stochastic mutant killing

Part of Qualixar | Author: Varun Pratap Bhardwaj

Home

Getting Started

Core Concepts

Guides

Reference

qualixar.com | arXiv:2603.02601

Mutation Testing

Mutation Testing

The Core Question

How It Works

The Twelve Mutation Operators

Category 1: Prompt Mutations

Category 2: Tool Mutations

Category 3: Model Mutations

Category 4: Context Mutations

Interpreting the Mutation Score

Analyzing Surviving Mutants

Running Mutation Testing

From CLI

From Python

Stochastic Mutation Killing

Example: Improving a Test Suite

Cost Considerations

Next Steps

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally