Skip to content

Statistical Methods

Varun Pratap Bhardwaj edited this page Mar 6, 2026 · 1 revision

Statistical Methods

AgentAssay uses rigorous statistical methods to deliver three-valued verdicts (PASS/FAIL/INCONCLUSIVE) with confidence intervals.

Three-Valued Verdicts

Unlike traditional binary pass/fail testing, AgentAssay returns one of three verdicts:

Verdict Meaning
PASS The agent meets the quality threshold with statistical confidence
FAIL The agent fails to meet the threshold with statistical confidence
INCONCLUSIVE Not enough evidence to determine PASS or FAIL

Why Three Values?

AI agents are non-deterministic. Running the same input twice can produce different outputs. A single pass/fail verdict ignores this fundamental uncertainty.

Example: Your agent passes 18 out of 20 trials (90% pass rate). Is that:

  • PASS if threshold is 80%?
  • FAIL if threshold is 95%?
  • INCONCLUSIVE if the confidence interval is [74%, 97%]?

The answer depends on the confidence interval, not just the point estimate.

Confidence Intervals

AgentAssay uses Wilson score intervals for proportions (pass rates). Wilson intervals have better coverage properties than normal approximations, especially for small sample sizes.

Example

20 trials, 18 passed:

  • Point estimate: 90%
  • 95% Wilson CI: [69.6%, 97.2%]

If your threshold is 80%:

  • Lower bound (69.6%) < threshold → INCONCLUSIVE

If your threshold is 70%:

  • Lower bound (69.6%) < threshold but upper bound (97.2%) > threshold by a lot → PASS (borderline)

The verdict function uses the confidence interval to decide.

Hypothesis Tests

Fisher's Exact Test (Default for Regression)

Used to compare two pass rates (baseline vs. current).

Null hypothesis: The two pass rates are the same.

Test statistic: Exact hypergeometric distribution (not an approximation).

When to use: Comparing baseline and current agent versions.

Example:

from agentassay.statistics import fishers_exact_test

baseline_results = [True] * 28 + [False] * 2   # 93% (28/30)
current_results = [True] * 20 + [False] * 10    # 67% (20/30)

p_value, effect_size = fishers_exact_test(baseline_results, current_results)

if p_value < 0.05:
    print("REGRESSION DETECTED")
else:
    print("NO REGRESSION")

Sequential Probability Ratio Test (SPRT)

Allows early stopping when the evidence is strong enough.

How it works:

  1. Run trials one at a time
  2. After each trial, compute the likelihood ratio
  3. If ratio exceeds upper threshold → PASS, stop
  4. If ratio falls below lower threshold → FAIL, stop
  5. Otherwise, run another trial

Savings: SPRT typically requires 30-50% fewer samples than fixed-sample testing while maintaining the same error guarantees.

Example:

from agentassay.statistics import SPRT

sprt = SPRT(alpha=0.05, beta=0.10, p0=0.70, p1=0.85)

for result in trial_results:
    decision = sprt.update(result)
    if decision == "PASS":
        print(f"PASS after {sprt.num_samples} trials")
        break
    elif decision == "FAIL":
        print(f"FAIL after {sprt.num_samples} trials")
        break
    # else: INCONCLUSIVE, continue

Cohen's h (Effect Size)

Measures how different two proportions are, in standardized units.

h = 2 * (arcsin(sqrt(p1)) - arcsin(sqrt(p2)))
h Interpretation
< 0.2 Small effect
0.2 - 0.5 Medium effect
> 0.5 Large effect

Use case: Decide if a detected regression is practically significant (not just statistically significant).

Example: p1 = 0.95, p2 = 0.92 → h = 0.15 (small effect, may not matter in practice)

Power Analysis

Statistical power is the probability of detecting a true regression when one exists.

Power = 1 - β

Where β is the Type II error rate (false negative).

AgentAssay computes the minimum sample size needed to achieve a target power:

from agentassay.statistics import compute_sample_size

n = compute_sample_size(
    baseline_rate=0.90,
    delta=0.10,          # Detect 10% drop
    alpha=0.05,
    power=0.80,
)

print(f"Need {n} samples to detect 10% drop with 80% power")

Adaptive Budget Optimization

Combines variance estimation + sample size calculation:

from agentassay.efficiency import AdaptiveBudgetOptimizer

optimizer = AdaptiveBudgetOptimizer(alpha=0.05, beta=0.10)

# Run small calibration set
estimate = optimizer.calibrate(calibration_traces)

print(f"Recommended N: {estimate.recommended_n}")
print(f"Expected power: {estimate.power:.2f}")

Error Types

Error Description Controlled By
Type I (α) False positive: detecting regression when there isn't one Significance level (default: 0.05)
Type II (β) False negative: missing a real regression Power = 1 - β (default: 0.80, so β = 0.20)

AgentAssay lets you tune both:

AssayConfig(
    significance_level=0.01,  # Stricter (fewer false positives)
    power=0.90,               # Higher power (fewer false negatives)
)

Choosing Parameters

Scenario α Power Why
CI/CD gate (production) 0.01 0.90 Very strict, don't deploy broken code
Nightly regression 0.05 0.80 Balanced
Exploratory testing 0.10 0.70 Fast feedback, less rigor

Statistical Guarantees

AgentAssay provides formal guarantees:

  1. If you set α = 0.05, at most 5% of PASS verdicts will be false positives (in the long run).
  2. If you set power = 0.80, you will detect 80% of real regressions (that exceed the minimum effect size).
  3. Confidence intervals have the stated coverage (e.g., 95% CIs contain the true value 95% of the time).

These guarantees hold regardless of agent complexity — as long as the trials are independent.


Next Steps


Part of Qualixar | Author: Varun Pratap Bhardwaj

Clone this wiki locally