Skip to content

lucianareynaud/llm-eval-gate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llm-eval-gate

Evidence-based deployment quality gate for LLM systems. Evaluates telemetry against latency, error, and cost policies to produce auditable go/no-go release decisions.

Why this exists

Production LLM deployments require operational decisions grounded in observable evidence, not vibes. This tool reads structured telemetry (JSONL), applies configurable threshold policies, and returns a binary decision with per-check evidence. Designed for CI/CD pipelines where a release must be blocked or approved based on latency, reliability, and cost metrics.

What this repository demonstrates

A minimal, testable quality gate that:

  • Reads JSONL telemetry produced by llmscope or compatible instrumentation
  • Computes three aggregate metrics: p95 latency, error rate, average cost per request
  • Evaluates against configurable thresholds (global or per-route)
  • Returns structured pass/fail with evidence for each check
  • Exits with code 0 (go), 1 (no-go), or 2 (config error) for pipeline integration
  • Supports both CLI and library API usage

This is not a semantic evaluation framework, observability backend, or dashboard. It is a policy evaluator that turns telemetry into a release decision.

Architecture

llmscope (instrumentation) → telemetry.jsonl → llm-eval-gate (policy) → go / no-go

The gate operates post-hoc on telemetry artifacts. It does not instrument requests, route traffic, or collect metrics. It reads evidence and applies policy.

Core components

QualityGate

Loads telemetry, computes metrics, evaluates thresholds. Returns GateResult with structured checks and failures.

Policy loader

Parses YAML policy files with global defaults and per-route overrides. Routes inherit from global — only override what you need.

CLI

Wraps the library API for CI/CD integration. Supports text and JSON output, route filtering, and multi-route evaluation.

Repository structure

src/llm_eval_gate/
  __init__.py         # Public API exports
  __main__.py         # CLI entry point
  gate.py             # QualityGate, Thresholds, evaluation logic
  policy_loader.py    # YAML policy parsing and threshold resolution

tests/
  test_gate.py        # Gate evaluation tests
  test_policy_cli.py  # Policy loading and CLI integration tests

gate.yaml             # Example policy configuration
pyproject.toml        # Package metadata and dependencies

Local setup

python3 -m venv .venv && source .venv/bin/activate

# With llmscope (for envelope type validation)
pip3 install -e ../llmscope -e ".[dev]"

# Without llmscope (gate works with raw JSONL dicts)
pip3 install -e ".[dev]"

Dependencies: PyYAML (required), llmscope (optional), pytest (dev).

CLI usage

# Evaluate with default thresholds
python3 -m llm_eval_gate evaluate \
    --telemetry artifacts/logs/telemetry.jsonl

# Evaluate with policy YAML
python3 -m llm_eval_gate evaluate \
    --telemetry artifacts/logs/telemetry.jsonl \
    --policy gate.yaml

# Evaluate a specific route
python3 -m llm_eval_gate evaluate \
    --telemetry artifacts/logs/telemetry.jsonl \
    --policy gate.yaml \
    --route /answer-routed

# JSON output for CI/CD pipelines
python3 -m llm_eval_gate evaluate \
    --telemetry artifacts/logs/telemetry.jsonl \
    --policy gate.yaml \
    --format json

Exit codes: 0 = go, 1 = no-go, 2 = configuration error.

Policy YAML

Thresholds are configurable via YAML. Per-route sections inherit from global — only override what you need.

global:
  max_p95_latency_ms: 2000.0
  max_error_rate: 0.05
  max_avg_cost_usd: 0.01
  min_sample_size: 10

routes:
  /answer-routed:
    max_p95_latency_ms: 3000.0
    max_avg_cost_usd: 0.02
  /conversation-turn:
    max_p95_latency_ms: 5000.0

When routes are defined, the CLI evaluates each route with its own thresholds plus a global evaluation. Any failure blocks the release.

Library API

from llm_eval_gate import QualityGate, Thresholds

gate = QualityGate(
    telemetry_path="artifacts/logs/telemetry.jsonl",
    thresholds=Thresholds(
        max_p95_latency_ms=2000.0,
        max_error_rate=0.05,
        max_avg_cost_usd=0.01,
    ),
)
result = gate.evaluate()
print(result.summary())

Output:

Quality Gate: PASS (150 samples)
  [ok] p95_latency: 834.5000 ms (threshold: 2000.0000)
  [ok] error_rate: 0.0133 ratio (threshold: 0.0500)
  [ok] avg_cost: 0.0042 USD (threshold: 0.0100)

Or load thresholds from YAML:

from llm_eval_gate import load_gate_policy, QualityGate

policy = load_gate_policy("gate.yaml")
for route in policy.routes:
    gate = QualityGate(
        telemetry_path="telemetry.jsonl",
        thresholds=policy.get_thresholds(route),
        route_filter=route,
    )
    result = gate.evaluate()
    print(f"{route}: {'go' if result.passed else 'no-go'}")

Validation

python3 -m pytest tests/ -q

Tests cover:

  • Gate evaluation (pass/fail scenarios for latency, error rate, cost)
  • Policy loading and threshold inheritance
  • CLI integration (exit codes, output formats, route filtering)
  • Multi-route evaluation
  • Edge cases (missing files, insufficient samples)

CI/CD integration

# GitHub Actions example
- name: Quality gate
  run: |
    python3 -m llm_eval_gate evaluate \
      --telemetry artifacts/logs/telemetry.jsonl \
      --policy gate.yaml \
      --format json

The gate exits with code 1 on failure, blocking the pipeline.

What this evaluates

Check Metric Why
p95_latency 95th percentile latency in ms User experience degradation
error_rate Error count / total requests System reliability
avg_cost Mean cost per request in USD FinOps budget control

These three metrics cover the Pareto set of deployment decisions for LLM systems: is it fast enough, reliable enough, and cheap enough?

Current limitations

  • No baseline comparison: Evaluates absolute thresholds only. Cannot detect relative regression against a previous build.
  • No policy composition: Single pass/fail decision. No support for warnings, soft failures, or composite rules.
  • No cohort segmentation: Evaluates aggregates. Cannot segment by model, provider, tenant, or experiment arm.
  • No statistical testing: Uses simple percentile and average calculations. No confidence intervals or significance tests.
  • No real-time evaluation: Operates on static telemetry files. Not designed for streaming or live traffic.

These are intentional scope boundaries for v0.1.0. See roadmap for planned expansions.

Near-term roadmap

v0.2.0 — Baseline comparison

Compare candidate telemetry against a baseline run. Block if error rate increases by >N% even if still below absolute threshold. Detect regressions that aggregated metrics miss.

v0.3.0 — Policy composition

Composite policies: "global pass if all routes pass, but hard-block if checkout error_rate > X, warn if avg_cost increases >N% vs baseline." Moves from threshold checking to release governance.

v0.4.0 — Cohort segmentation

Segment evaluation by model, provider, tenant, experiment arm. Prevent approval when the global average passes but a critical segment fails.

Related projects

  • llmscope — Observable cost control for production LLMs. The instrumentation layer that produces the telemetry this gate evaluates.
  • caseledger — Policy-bounded decision traces for AI-assisted financial operations. Demonstrates llmscope + llm-eval-gate in a regulated finance workflow.

About

Evidence-based quality gate for LLM deployments: evaluates telemetry against latency, cost, and error policies to produce auditable go/no-go release decisions for CI/CD.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages