Skip to content

Latest commit

 

History

History
43 lines (22 loc) · 4.95 KB

File metadata and controls

43 lines (22 loc) · 4.95 KB

llm-eval-gate Roadmap

Roadmap

llm-eval-gate is being built in layers. Phase 1 focuses on what makes the project immediately useful in a real deployment pipeline: CLI execution, YAML-based policy loading, and route-segmented thresholds. The goal of this phase is not to maximize feature surface area, but to close a small, legible, and reliable core that turns telemetry into a reproducible operational decision.

The next phases expand the project toward a policy engine for evidence-based release decisions. The priority is not to add abstract complexity, but to improve decision quality without sacrificing auditability or compatibility with today’s simple workflow.

Phase 2 — More expressive policy semantics

The next step is to enrich policy semantics beyond absolute thresholds over aggregates. This includes simple rule composition, differentiation between blocking failures and warnings, and more precise scoping by operational segment such as route, model, provider, or environment. The goal here is to support decisions that are more useful than a single global pass/fail outcome, without turning the project into a generic rules framework.

This phase also includes an important improvement in operational legibility: making it clearer why a decision failed and which segment caused the failure. The focus is not on renaming structures that already exist, but on making the output more actionable for CI/CD, canary promotion, and human review.

Phase 3 — Baseline comparison and relative regression

Absolute thresholds are sufficient for an initial gate, but they do not capture relative regression against a known baseline. The product direction includes comparing a candidate against a baseline to detect material degradation even when the absolute value is still below the allowed ceiling. This makes it possible to answer questions that are closer to real release decisions, such as whether a new version degraded error rate, latency, or cost relative to the previous version.

This phase may include comparison by cohort, time window, or reference build, but it should only be implemented when there is a real consumer asking for this kind of decision. The intent is to avoid introducing statistical complexity and configuration surface area before there is enough operational pressure to justify them.

Phase 4 — Economic provenance and richer segmentation

Today the gate operates on aggregated telemetry metrics. The stronger product direction is to incorporate economic provenance and richer segmentation in order to reduce causal confusion in cost and quality decisions. This includes distinguishing different economic regimes, separating relevant segments, and avoiding situations where a global average hides the behavior of a critical route or cohort.

In practice, this means evolving the project to answer not only whether a build passed, but in what context it passed, where it failed, and which operational or economic dimension is driving the decision. This layer is especially important for connecting observability, FinOps, and release governance without making a dashboard a required intermediary.

Phase 5 — Stronger integration with release automation

Once the decision logic is consolidated, the next natural step is to expand the integration surface with automation. This includes more stable output formats for pipeline consumers, more direct integration wrappers for CI, and more explicit conventions for use in promotion, blocking, and automated rollback flows.

The intent here is not to turn the project into a delivery platform, but to make it a reliable component within existing pipelines. The center remains a small, evidence-oriented evaluation engine, not a full orchestration system.

Phase 6 — Policy scope beyond aggregated thresholds

In the long term, llm-eval-gate may evolve from a gate based on aggregated metrics into a more complete policy evaluation component for LLM workloads. That opens space for more sophisticated criteria, richer segmentation, and release contracts that are closer to real operations. Even so, the direction remains deliberately narrow: turning observable evidence into a promotion-or-block decision, with an auditable trail and low cognitive overhead.

What is not part of the direction

This project is not intended to become a semantic evaluation framework, a prompt benchmarking suite, an observability dashboard, or a generic rules engine system. Better tools already exist for those problems. The product direction remains focused on a narrower and more useful question: is the available telemetry sufficient to approve or reject a change with a clear justification?

Evolution principle

The project’s evolution strategy is to add capability only when it materially improves release decision quality. Complexity without clear decision-making value will be treated as debt, not progress. The goal is to preserve a small, auditable, PyPI-ready core, and expand only when there is evidence of real usage demanding the next layer.