Statistical Methods

AgentAssay uses rigorous statistical methods to deliver three-valued verdicts (PASS/FAIL/INCONCLUSIVE) with confidence intervals.

Three-Valued Verdicts

Unlike traditional binary pass/fail testing, AgentAssay returns one of three verdicts:

Verdict	Meaning
PASS	The agent meets the quality threshold with statistical confidence
FAIL	The agent fails to meet the threshold with statistical confidence
INCONCLUSIVE	Not enough evidence to determine PASS or FAIL

Why Three Values?

AI agents are non-deterministic. Running the same input twice can produce different outputs. A single pass/fail verdict ignores this fundamental uncertainty.

Example: Your agent passes 18 out of 20 trials (90% pass rate). Is that:

PASS if threshold is 80%?
FAIL if threshold is 95%?
INCONCLUSIVE if the confidence interval is [74%, 97%]?

The answer depends on the confidence interval, not just the point estimate.

Confidence Intervals

AgentAssay uses Wilson score intervals for proportions (pass rates). Wilson intervals have better coverage properties than normal approximations, especially for small sample sizes.

Example

20 trials, 18 passed:

Point estimate: 90%
95% Wilson CI: [69.6%, 97.2%]

If your threshold is 80%:

Lower bound (69.6%) < threshold → INCONCLUSIVE

If your threshold is 70%:

Lower bound (69.6%) < threshold but upper bound (97.2%) > threshold by a lot → PASS (borderline)

The verdict function uses the confidence interval to decide.

Hypothesis Tests

Fisher's Exact Test (Default for Regression)

Used to compare two pass rates (baseline vs. current).

Null hypothesis: The two pass rates are the same.

Test statistic: Exact hypergeometric distribution (not an approximation).

When to use: Comparing baseline and current agent versions.

Example:

from agentassay.statistics import fishers_exact_test

baseline_results = [True] * 28 + [False] * 2   # 93% (28/30)
current_results = [True] * 20 + [False] * 10    # 67% (20/30)

p_value, effect_size = fishers_exact_test(baseline_results, current_results)

if p_value < 0.05:
    print("REGRESSION DETECTED")
else:
    print("NO REGRESSION")

Sequential Probability Ratio Test (SPRT)

Allows early stopping when the evidence is strong enough.

How it works:

Run trials one at a time
After each trial, compute the likelihood ratio
If ratio exceeds upper threshold → PASS, stop
If ratio falls below lower threshold → FAIL, stop
Otherwise, run another trial

Savings: SPRT typically requires 30-50% fewer samples than fixed-sample testing while maintaining the same error guarantees.

Example:

from agentassay.statistics import SPRT

sprt = SPRT(alpha=0.05, beta=0.10, p0=0.70, p1=0.85)

for result in trial_results:
    decision = sprt.update(result)
    if decision == "PASS":
        print(f"PASS after {sprt.num_samples} trials")
        break
    elif decision == "FAIL":
        print(f"FAIL after {sprt.num_samples} trials")
        break
    # else: INCONCLUSIVE, continue

Cohen's h (Effect Size)

Measures how different two proportions are, in standardized units.

h = 2 * (arcsin(sqrt(p1)) - arcsin(sqrt(p2)))

h	Interpretation
< 0.2	Small effect
0.2 - 0.5	Medium effect
> 0.5	Large effect

Use case: Decide if a detected regression is practically significant (not just statistically significant).

Example: p1 = 0.95, p2 = 0.92 → h = 0.15 (small effect, may not matter in practice)

Power Analysis

Statistical power is the probability of detecting a true regression when one exists.

Power = 1 - β

Where β is the Type II error rate (false negative).

AgentAssay computes the minimum sample size needed to achieve a target power:

from agentassay.statistics import compute_sample_size

n = compute_sample_size(
    baseline_rate=0.90,
    delta=0.10,          # Detect 10% drop
    alpha=0.05,
    power=0.80,
)

print(f"Need {n} samples to detect 10% drop with 80% power")

Adaptive Budget Optimization

Combines variance estimation + sample size calculation:

from agentassay.efficiency import AdaptiveBudgetOptimizer

optimizer = AdaptiveBudgetOptimizer(alpha=0.05, beta=0.10)

# Run small calibration set
estimate = optimizer.calibrate(calibration_traces)

print(f"Recommended N: {estimate.recommended_n}")
print(f"Expected power: {estimate.power:.2f}")

Error Types

Error	Description	Controlled By
Type I (α)	False positive: detecting regression when there isn't one	Significance level (default: 0.05)
Type II (β)	False negative: missing a real regression	Power = 1 - β (default: 0.80, so β = 0.20)

AgentAssay lets you tune both:

AssayConfig(
    significance_level=0.01,  # Stricter (fewer false positives)
    power=0.90,               # Higher power (fewer false negatives)
)

Choosing Parameters

Scenario	α	Power	Why
CI/CD gate (production)	0.01	0.90	Very strict, don't deploy broken code
Nightly regression	0.05	0.80	Balanced
Exploratory testing	0.10	0.70	Fast feedback, less rigor

Statistical Guarantees

AgentAssay provides formal guarantees:

If you set α = 0.05, at most 5% of PASS verdicts will be false positives (in the long run).
If you set power = 0.80, you will detect 80% of real regressions (that exceed the minimum effect size).
Confidence intervals have the stated coverage (e.g., 95% CIs contain the true value 95% of the time).

These guarantees hold regardless of agent complexity — as long as the trials are independent.

Next Steps

Token-Efficient Testing — See how these methods combine for cost savings
Coverage Model — Understand the five coverage dimensions
CI/CD Integration — Use deployment gates with statistical verdicts

Part of Qualixar | Author: Varun Pratap Bhardwaj

Home

Getting Started

Core Concepts

Guides

Reference

qualixar.com | arXiv:2603.02601

Statistical Methods

Statistical Methods

Three-Valued Verdicts

Why Three Values?

Confidence Intervals

Example

Hypothesis Tests

Fisher's Exact Test (Default for Regression)

Sequential Probability Ratio Test (SPRT)

Cohen's h (Effect Size)

Power Analysis

Adaptive Budget Optimization

Error Types

Choosing Parameters

Statistical Guarantees

Next Steps

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally