Evidence-based deployment quality gate for LLM systems. Evaluates telemetry against latency, error, and cost policies to produce auditable go/no-go release decisions.
Production LLM deployments require operational decisions grounded in observable evidence, not vibes. This tool reads structured telemetry (JSONL), applies configurable threshold policies, and returns a binary decision with per-check evidence. Designed for CI/CD pipelines where a release must be blocked or approved based on latency, reliability, and cost metrics.
A minimal, testable quality gate that:
- Reads JSONL telemetry produced by llmscope or compatible instrumentation
- Computes three aggregate metrics: p95 latency, error rate, average cost per request
- Evaluates against configurable thresholds (global or per-route)
- Returns structured pass/fail with evidence for each check
- Exits with code 0 (go), 1 (no-go), or 2 (config error) for pipeline integration
- Supports both CLI and library API usage
This is not a semantic evaluation framework, observability backend, or dashboard. It is a policy evaluator that turns telemetry into a release decision.
llmscope (instrumentation) → telemetry.jsonl → llm-eval-gate (policy) → go / no-go
The gate operates post-hoc on telemetry artifacts. It does not instrument requests, route traffic, or collect metrics. It reads evidence and applies policy.
Loads telemetry, computes metrics, evaluates thresholds. Returns GateResult with structured checks and failures.
Parses YAML policy files with global defaults and per-route overrides. Routes inherit from global — only override what you need.
Wraps the library API for CI/CD integration. Supports text and JSON output, route filtering, and multi-route evaluation.
src/llm_eval_gate/
__init__.py # Public API exports
__main__.py # CLI entry point
gate.py # QualityGate, Thresholds, evaluation logic
policy_loader.py # YAML policy parsing and threshold resolution
tests/
test_gate.py # Gate evaluation tests
test_policy_cli.py # Policy loading and CLI integration tests
gate.yaml # Example policy configuration
pyproject.toml # Package metadata and dependencies
python3 -m venv .venv && source .venv/bin/activate
# With llmscope (for envelope type validation)
pip3 install -e ../llmscope -e ".[dev]"
# Without llmscope (gate works with raw JSONL dicts)
pip3 install -e ".[dev]"Dependencies: PyYAML (required), llmscope (optional), pytest (dev).
# Evaluate with default thresholds
python3 -m llm_eval_gate evaluate \
--telemetry artifacts/logs/telemetry.jsonl
# Evaluate with policy YAML
python3 -m llm_eval_gate evaluate \
--telemetry artifacts/logs/telemetry.jsonl \
--policy gate.yaml
# Evaluate a specific route
python3 -m llm_eval_gate evaluate \
--telemetry artifacts/logs/telemetry.jsonl \
--policy gate.yaml \
--route /answer-routed
# JSON output for CI/CD pipelines
python3 -m llm_eval_gate evaluate \
--telemetry artifacts/logs/telemetry.jsonl \
--policy gate.yaml \
--format jsonExit codes: 0 = go, 1 = no-go, 2 = configuration error.
Thresholds are configurable via YAML. Per-route sections inherit from global — only override what you need.
global:
max_p95_latency_ms: 2000.0
max_error_rate: 0.05
max_avg_cost_usd: 0.01
min_sample_size: 10
routes:
/answer-routed:
max_p95_latency_ms: 3000.0
max_avg_cost_usd: 0.02
/conversation-turn:
max_p95_latency_ms: 5000.0When routes are defined, the CLI evaluates each route with its own thresholds plus a global evaluation. Any failure blocks the release.
from llm_eval_gate import QualityGate, Thresholds
gate = QualityGate(
telemetry_path="artifacts/logs/telemetry.jsonl",
thresholds=Thresholds(
max_p95_latency_ms=2000.0,
max_error_rate=0.05,
max_avg_cost_usd=0.01,
),
)
result = gate.evaluate()
print(result.summary())Output:
Quality Gate: PASS (150 samples)
[ok] p95_latency: 834.5000 ms (threshold: 2000.0000)
[ok] error_rate: 0.0133 ratio (threshold: 0.0500)
[ok] avg_cost: 0.0042 USD (threshold: 0.0100)
Or load thresholds from YAML:
from llm_eval_gate import load_gate_policy, QualityGate
policy = load_gate_policy("gate.yaml")
for route in policy.routes:
gate = QualityGate(
telemetry_path="telemetry.jsonl",
thresholds=policy.get_thresholds(route),
route_filter=route,
)
result = gate.evaluate()
print(f"{route}: {'go' if result.passed else 'no-go'}")python3 -m pytest tests/ -qTests cover:
- Gate evaluation (pass/fail scenarios for latency, error rate, cost)
- Policy loading and threshold inheritance
- CLI integration (exit codes, output formats, route filtering)
- Multi-route evaluation
- Edge cases (missing files, insufficient samples)
# GitHub Actions example
- name: Quality gate
run: |
python3 -m llm_eval_gate evaluate \
--telemetry artifacts/logs/telemetry.jsonl \
--policy gate.yaml \
--format jsonThe gate exits with code 1 on failure, blocking the pipeline.
| Check | Metric | Why |
|---|---|---|
p95_latency |
95th percentile latency in ms | User experience degradation |
error_rate |
Error count / total requests | System reliability |
avg_cost |
Mean cost per request in USD | FinOps budget control |
These three metrics cover the Pareto set of deployment decisions for LLM systems: is it fast enough, reliable enough, and cheap enough?
- No baseline comparison: Evaluates absolute thresholds only. Cannot detect relative regression against a previous build.
- No policy composition: Single pass/fail decision. No support for warnings, soft failures, or composite rules.
- No cohort segmentation: Evaluates aggregates. Cannot segment by model, provider, tenant, or experiment arm.
- No statistical testing: Uses simple percentile and average calculations. No confidence intervals or significance tests.
- No real-time evaluation: Operates on static telemetry files. Not designed for streaming or live traffic.
These are intentional scope boundaries for v0.1.0. See roadmap for planned expansions.
Compare candidate telemetry against a baseline run. Block if error rate increases by >N% even if still below absolute threshold. Detect regressions that aggregated metrics miss.
Composite policies: "global pass if all routes pass, but hard-block if checkout error_rate > X, warn if avg_cost increases >N% vs baseline." Moves from threshold checking to release governance.
Segment evaluation by model, provider, tenant, experiment arm. Prevent approval when the global average passes but a critical segment fails.
- llmscope — Observable cost control for production LLMs. The instrumentation layer that produces the telemetry this gate evaluates.
- caseledger — Policy-bounded decision traces for AI-assisted financial operations. Demonstrates llmscope + llm-eval-gate in a regulated finance workflow.