llm-eval-gate

Evidence-based deployment quality gate for LLM systems. Evaluates telemetry against latency, error, and cost policies to produce auditable go/no-go release decisions.

Why this exists

Production LLM deployments require operational decisions grounded in observable evidence, not vibes. This tool reads structured telemetry (JSONL), applies configurable threshold policies, and returns a binary decision with per-check evidence. Designed for CI/CD pipelines where a release must be blocked or approved based on latency, reliability, and cost metrics.

What this repository demonstrates

A minimal, testable quality gate that:

Reads JSONL telemetry produced by llmscope or compatible instrumentation
Computes three aggregate metrics: p95 latency, error rate, average cost per request
Evaluates against configurable thresholds (global or per-route)
Returns structured pass/fail with evidence for each check
Exits with code 0 (go), 1 (no-go), or 2 (config error) for pipeline integration
Supports both CLI and library API usage

This is not a semantic evaluation framework, observability backend, or dashboard. It is a policy evaluator that turns telemetry into a release decision.

Architecture

llmscope (instrumentation) → telemetry.jsonl → llm-eval-gate (policy) → go / no-go

The gate operates post-hoc on telemetry artifacts. It does not instrument requests, route traffic, or collect metrics. It reads evidence and applies policy.

Core components

QualityGate

Loads telemetry, computes metrics, evaluates thresholds. Returns GateResult with structured checks and failures.

Policy loader

Parses YAML policy files with global defaults and per-route overrides. Routes inherit from global — only override what you need.

CLI

Wraps the library API for CI/CD integration. Supports text and JSON output, route filtering, and multi-route evaluation.

Repository structure

src/llm_eval_gate/
  __init__.py         # Public API exports
  __main__.py         # CLI entry point
  gate.py             # QualityGate, Thresholds, evaluation logic
  policy_loader.py    # YAML policy parsing and threshold resolution

tests/
  test_gate.py        # Gate evaluation tests
  test_policy_cli.py  # Policy loading and CLI integration tests

gate.yaml             # Example policy configuration
pyproject.toml        # Package metadata and dependencies

Local setup

python3 -m venv .venv && source .venv/bin/activate

# With llmscope (for envelope type validation)
pip3 install -e ../llmscope -e ".[dev]"

# Without llmscope (gate works with raw JSONL dicts)
pip3 install -e ".[dev]"

Dependencies: PyYAML (required), llmscope (optional), pytest (dev).

CLI usage

# Evaluate with default thresholds
python3 -m llm_eval_gate evaluate \
    --telemetry artifacts/logs/telemetry.jsonl

# Evaluate with policy YAML
python3 -m llm_eval_gate evaluate \
    --telemetry artifacts/logs/telemetry.jsonl \
    --policy gate.yaml

# Evaluate a specific route
python3 -m llm_eval_gate evaluate \
    --telemetry artifacts/logs/telemetry.jsonl \
    --policy gate.yaml \
    --route /answer-routed

# JSON output for CI/CD pipelines
python3 -m llm_eval_gate evaluate \
    --telemetry artifacts/logs/telemetry.jsonl \
    --policy gate.yaml \
    --format json

Exit codes: 0 = go, 1 = no-go, 2 = configuration error.

Policy YAML

Thresholds are configurable via YAML. Per-route sections inherit from global — only override what you need.

global:
  max_p95_latency_ms: 2000.0
  max_error_rate: 0.05
  max_avg_cost_usd: 0.01
  min_sample_size: 10

routes:
  /answer-routed:
    max_p95_latency_ms: 3000.0
    max_avg_cost_usd: 0.02
  /conversation-turn:
    max_p95_latency_ms: 5000.0

When routes are defined, the CLI evaluates each route with its own thresholds plus a global evaluation. Any failure blocks the release.

Library API

from llm_eval_gate import QualityGate, Thresholds

gate = QualityGate(
    telemetry_path="artifacts/logs/telemetry.jsonl",
    thresholds=Thresholds(
        max_p95_latency_ms=2000.0,
        max_error_rate=0.05,
        max_avg_cost_usd=0.01,
    ),
)
result = gate.evaluate()
print(result.summary())

Output:

Quality Gate: PASS (150 samples)
  [ok] p95_latency: 834.5000 ms (threshold: 2000.0000)
  [ok] error_rate: 0.0133 ratio (threshold: 0.0500)
  [ok] avg_cost: 0.0042 USD (threshold: 0.0100)

Or load thresholds from YAML:

from llm_eval_gate import load_gate_policy, QualityGate

policy = load_gate_policy("gate.yaml")
for route in policy.routes:
    gate = QualityGate(
        telemetry_path="telemetry.jsonl",
        thresholds=policy.get_thresholds(route),
        route_filter=route,
    )
    result = gate.evaluate()
    print(f"{route}: {'go' if result.passed else 'no-go'}")

Validation

python3 -m pytest tests/ -q

Tests cover:

Gate evaluation (pass/fail scenarios for latency, error rate, cost)
Policy loading and threshold inheritance
CLI integration (exit codes, output formats, route filtering)
Multi-route evaluation
Edge cases (missing files, insufficient samples)

CI/CD integration

# GitHub Actions example
- name: Quality gate
  run: |
    python3 -m llm_eval_gate evaluate \
      --telemetry artifacts/logs/telemetry.jsonl \
      --policy gate.yaml \
      --format json

The gate exits with code 1 on failure, blocking the pipeline.

What this evaluates

Check	Metric	Why
`p95_latency`	95th percentile latency in ms	User experience degradation
`error_rate`	Error count / total requests	System reliability
`avg_cost`	Mean cost per request in USD	FinOps budget control

These three metrics cover the Pareto set of deployment decisions for LLM systems: is it fast enough, reliable enough, and cheap enough?

Current limitations

No baseline comparison: Evaluates absolute thresholds only. Cannot detect relative regression against a previous build.
No policy composition: Single pass/fail decision. No support for warnings, soft failures, or composite rules.
No cohort segmentation: Evaluates aggregates. Cannot segment by model, provider, tenant, or experiment arm.
No statistical testing: Uses simple percentile and average calculations. No confidence intervals or significance tests.
No real-time evaluation: Operates on static telemetry files. Not designed for streaming or live traffic.

These are intentional scope boundaries for v0.1.0. See roadmap for planned expansions.

Near-term roadmap

v0.2.0 — Baseline comparison

Compare candidate telemetry against a baseline run. Block if error rate increases by >N% even if still below absolute threshold. Detect regressions that aggregated metrics miss.

v0.3.0 — Policy composition

Composite policies: "global pass if all routes pass, but hard-block if checkout error_rate > X, warn if avg_cost increases >N% vs baseline." Moves from threshold checking to release governance.

v0.4.0 — Cohort segmentation

Segment evaluation by model, provider, tenant, experiment arm. Prevent approval when the global average passes but a critical segment fails.

Related projects

llmscope — Observable cost control for production LLMs. The instrumentation layer that produces the telemetry this gate evaluates.
caseledger — Policy-bounded decision traces for AI-assisted financial operations. Demonstrates llmscope + llm-eval-gate in a regulated finance workflow.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src/llm_eval_gate		src/llm_eval_gate
tests		tests
.gitignore		.gitignore
README.md		README.md
gate.yaml		gate.yaml
llm-eval-gate-roadmap.md		llm-eval-gate-roadmap.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-eval-gate

Why this exists

What this repository demonstrates

Architecture

Core components

QualityGate

Policy loader

CLI

Repository structure

Local setup

CLI usage

Policy YAML

Library API

Validation

CI/CD integration

What this evaluates

Current limitations

Near-term roadmap

v0.2.0 — Baseline comparison

v0.3.0 — Policy composition

v0.4.0 — Cohort segmentation

Related projects

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llm-eval-gate

Why this exists

What this repository demonstrates

Architecture

Core components

QualityGate

Policy loader

CLI

Repository structure

Local setup

CLI usage

Policy YAML

Library API

Validation

CI/CD integration

What this evaluates

Current limitations

Near-term roadmap

v0.2.0 — Baseline comparison

v0.3.0 — Policy composition

v0.4.0 — Cohort segmentation

Related projects

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages