-
Notifications
You must be signed in to change notification settings - Fork 1
Statistical Methods
AgentAssay uses rigorous statistical methods to deliver three-valued verdicts (PASS/FAIL/INCONCLUSIVE) with confidence intervals.
Unlike traditional binary pass/fail testing, AgentAssay returns one of three verdicts:
| Verdict | Meaning |
|---|---|
| PASS | The agent meets the quality threshold with statistical confidence |
| FAIL | The agent fails to meet the threshold with statistical confidence |
| INCONCLUSIVE | Not enough evidence to determine PASS or FAIL |
AI agents are non-deterministic. Running the same input twice can produce different outputs. A single pass/fail verdict ignores this fundamental uncertainty.
Example: Your agent passes 18 out of 20 trials (90% pass rate). Is that:
- PASS if threshold is 80%?
- FAIL if threshold is 95%?
- INCONCLUSIVE if the confidence interval is [74%, 97%]?
The answer depends on the confidence interval, not just the point estimate.
AgentAssay uses Wilson score intervals for proportions (pass rates). Wilson intervals have better coverage properties than normal approximations, especially for small sample sizes.
20 trials, 18 passed:
- Point estimate: 90%
- 95% Wilson CI: [69.6%, 97.2%]
If your threshold is 80%:
- Lower bound (69.6%) < threshold → INCONCLUSIVE
If your threshold is 70%:
- Lower bound (69.6%) < threshold but upper bound (97.2%) > threshold by a lot → PASS (borderline)
The verdict function uses the confidence interval to decide.
Used to compare two pass rates (baseline vs. current).
Null hypothesis: The two pass rates are the same.
Test statistic: Exact hypergeometric distribution (not an approximation).
When to use: Comparing baseline and current agent versions.
Example:
from agentassay.statistics import fishers_exact_test
baseline_results = [True] * 28 + [False] * 2 # 93% (28/30)
current_results = [True] * 20 + [False] * 10 # 67% (20/30)
p_value, effect_size = fishers_exact_test(baseline_results, current_results)
if p_value < 0.05:
print("REGRESSION DETECTED")
else:
print("NO REGRESSION")Allows early stopping when the evidence is strong enough.
How it works:
- Run trials one at a time
- After each trial, compute the likelihood ratio
- If ratio exceeds upper threshold → PASS, stop
- If ratio falls below lower threshold → FAIL, stop
- Otherwise, run another trial
Savings: SPRT typically requires 30-50% fewer samples than fixed-sample testing while maintaining the same error guarantees.
Example:
from agentassay.statistics import SPRT
sprt = SPRT(alpha=0.05, beta=0.10, p0=0.70, p1=0.85)
for result in trial_results:
decision = sprt.update(result)
if decision == "PASS":
print(f"PASS after {sprt.num_samples} trials")
break
elif decision == "FAIL":
print(f"FAIL after {sprt.num_samples} trials")
break
# else: INCONCLUSIVE, continueMeasures how different two proportions are, in standardized units.
h = 2 * (arcsin(sqrt(p1)) - arcsin(sqrt(p2)))
| h | Interpretation |
|---|---|
| < 0.2 | Small effect |
| 0.2 - 0.5 | Medium effect |
| > 0.5 | Large effect |
Use case: Decide if a detected regression is practically significant (not just statistically significant).
Example: p1 = 0.95, p2 = 0.92 → h = 0.15 (small effect, may not matter in practice)
Statistical power is the probability of detecting a true regression when one exists.
Power = 1 - β
Where β is the Type II error rate (false negative).
AgentAssay computes the minimum sample size needed to achieve a target power:
from agentassay.statistics import compute_sample_size
n = compute_sample_size(
baseline_rate=0.90,
delta=0.10, # Detect 10% drop
alpha=0.05,
power=0.80,
)
print(f"Need {n} samples to detect 10% drop with 80% power")Combines variance estimation + sample size calculation:
from agentassay.efficiency import AdaptiveBudgetOptimizer
optimizer = AdaptiveBudgetOptimizer(alpha=0.05, beta=0.10)
# Run small calibration set
estimate = optimizer.calibrate(calibration_traces)
print(f"Recommended N: {estimate.recommended_n}")
print(f"Expected power: {estimate.power:.2f}")| Error | Description | Controlled By |
|---|---|---|
| Type I (α) | False positive: detecting regression when there isn't one | Significance level (default: 0.05) |
| Type II (β) | False negative: missing a real regression | Power = 1 - β (default: 0.80, so β = 0.20) |
AgentAssay lets you tune both:
AssayConfig(
significance_level=0.01, # Stricter (fewer false positives)
power=0.90, # Higher power (fewer false negatives)
)| Scenario | α | Power | Why |
|---|---|---|---|
| CI/CD gate (production) | 0.01 | 0.90 | Very strict, don't deploy broken code |
| Nightly regression | 0.05 | 0.80 | Balanced |
| Exploratory testing | 0.10 | 0.70 | Fast feedback, less rigor |
AgentAssay provides formal guarantees:
- If you set α = 0.05, at most 5% of PASS verdicts will be false positives (in the long run).
- If you set power = 0.80, you will detect 80% of real regressions (that exceed the minimum effect size).
- Confidence intervals have the stated coverage (e.g., 95% CIs contain the true value 95% of the time).
These guarantees hold regardless of agent complexity — as long as the trials are independent.
- Token-Efficient Testing — See how these methods combine for cost savings
- Coverage Model — Understand the five coverage dimensions
- CI/CD Integration — Use deployment gates with statistical verdicts
Part of Qualixar | Author: Varun Pratap Bhardwaj
Getting Started
Core Concepts
- Token-Efficient Testing
- Behavioral Fingerprinting
- Statistical Methods
- Coverage Model
- Mutation Testing
Guides
Reference