feat(benchmarks): scaling, scenarios, CI quality gate, and release trend#727
feat(benchmarks): scaling, scenarios, CI quality gate, and release trend#727dgenio wants to merge 2 commits into
Conversation
Coordinated maturation of the benchmark subsystem (issues #369, #418, #491, #554, #687, #688). All additions are deterministic and offline; new scripts reuse the installed package only. - #369 large-catalog benchmark: benchmarks/large_catalog.py routes 300+ tools across 8 namespaces with near-duplicate distractors and destructive tools; reports recall@1/3/5, MRR, ChoiceCard-vs-naive token reduction, and deny filtering. Writes a latency-free committed scorecard (+ gitignored JSON); --check gates drift, --strict gates regression-guard thresholds. - #418 scenario benchmark: benchmarks/scenario_routing.py contrasts naive all-tools prompting vs bounded ChoiceCard routing over a curated dataset, emitting a committed comparison report. - #491 quality gate: scripts/benchmark_gate.py + benchmarks/gating.yaml turn the informational delta into a gating CI job; recall@k / MRR / precision@k / token-savings regressions beyond their bands fail CI. Latency never gates; the benchmark-accepted label downgrades a failure to a warning. - #554 release trend: scripts/render_trend.py captures a deterministic, latency-free snapshot per release (benchmarks/results/history/<v>.json) and renders benchmarks/trend.md (make trend / trend-check). - #687 scaling docs: docs/benchmarks/scaling-matrix.md documents the 10k methodology, reproducible commands, and interpretation. - #688 scheduled smoke: .github/workflows/benchmark-scale.yml runs the routing-scale profiler weekly and uploads a per-run trend artifact. Wiring: Makefile targets, ci.yml benchmark-gate job, mkdocs nav, AGENTS.md commands, docs/benchmarks.md, CHANGELOG. New tests cover the gate, trend, large-catalog, and scenario modules. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01H79GcrysRhSctnSYJEkuGk
There was a problem hiding this comment.
Pull request overview
This PR significantly expands and hardens the benchmarks/ subsystem by adding new deterministic/offline benchmark scenarios (large catalog + scenario routing), introducing a CI quality-regression gate with tolerance bands, and publishing a release-over-release trend page and scaling documentation. It also wires these benchmarks into docs navigation, Makefile targets, and CI/scheduled workflows to make benchmark drift and regressions visible and enforceable.
Changes:
- Add new deterministic benchmarks: a 300+ tool “large catalog” quality benchmark and a scenario routing benchmark (naive vs ChoiceCards), with committed markdown artifacts and
--checkdrift gates. - Introduce a new benchmark quality CI gate (
scripts/benchmark_gate.py+benchmarks/gating.yaml) and a weekly non-gating routing-scale smoke workflow. - Add release history snapshots + trend renderer (
scripts/render_trend.py→benchmarks/trend.md) and update docs/nav + Makefile/AGENTS/CHANGELOG wiring.
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test_scenario_routing.py |
Determinism and drift-gate tests for the scenario routing benchmark. |
tests/test_render_trend.py |
Unit tests for release snapshot extraction, ordering, deterministic rendering, and committed drift check. |
tests/test_large_catalog_benchmark.py |
Structural/determinism tests for the large-catalog benchmark (reduced size in CI). |
tests/test_benchmark_gate.py |
Coverage of gating semantics (bands, skipped/new cells, override behavior, CLI exit codes). |
scripts/render_trend.py |
New script to snapshot deterministic metrics per release and render benchmarks/trend.md (+ --check). |
scripts/benchmark_gate.py |
New script to fail CI when deterministic quality metrics regress beyond configured bands. |
mkdocs.yml |
Adds benchmarks sub-nav (overview + scaling matrix + routing-scale profile). |
Makefile |
Adds targets for large-catalog, scenario benchmark, and trend render/check. |
docs/benchmarks/scaling-matrix.md |
New scaling methodology page tying together scale-related benchmarks. |
docs/benchmarks.md |
Adds “Scaling and trend” section and links to new benchmark outputs. |
CHANGELOG.md |
Documents the benchmark subsystem maturation and new gates/artifacts. |
benchmarks/trend.md |
New committed, generated trend page (from release history snapshots). |
benchmarks/scenarios/routing_choicecard.json |
New scenario dataset for naive vs ChoiceCard routing comparisons. |
benchmarks/scenario_routing.py |
New deterministic scenario benchmark script + --check drift behavior. |
benchmarks/scenario_routing.md |
New committed, generated scenario report. |
benchmarks/results/history/0.16.0.json |
First per-release deterministic snapshot for trend history. |
benchmarks/large_catalog.py |
New deterministic 300+ tool benchmark with distractors/destructive tools + scorecard + JSON output. |
benchmarks/large_catalog_scorecard.md |
New committed, generated large-catalog scorecard. |
benchmarks/gating.yaml |
New tolerance-band config for quality gating and override label name. |
AGENTS.md |
Documents new benchmark/trend Makefile commands. |
.github/workflows/ci.yml |
Adds the new benchmark-gate CI job (quality-regression enforcement). |
.github/workflows/benchmark-scale.yml |
Adds weekly scheduled non-gating routing-scale smoke benchmark with artifacts. |
| --base benchmarks/results/base.json \ | ||
| --head benchmarks/results/head.json \ | ||
| --gating-config benchmarks/gating.yaml $OVERRIDE | ||
| edit-mode: replace |
| def load_gating_config(path: Path | None) -> GatingConfig: | ||
| """Load bands from *path*; fall back to :data:`DEFAULT_BANDS` when absent. | ||
|
|
||
| Only the ``quality`` metrics whose band is a positive number are gated; | ||
| a metric set to ``gating: false`` (or omitted) is treated as informational. | ||
| """ | ||
| if path is None or not path.exists(): | ||
| return GatingConfig(bands=dict(DEFAULT_BANDS)) | ||
| import yaml # lazy: keeps the import off the no-config path | ||
|
|
||
| raw = yaml.safe_load(path.read_text(encoding="utf-8")) or {} | ||
| quality = raw.get("quality", {}) if isinstance(raw, dict) else {} | ||
| bands: dict[str, float] = {} | ||
| for metric, spec in (quality or {}).items(): | ||
| if not isinstance(spec, dict): | ||
| continue | ||
| band = spec.get("max_regression_pp") | ||
| if isinstance(band, (int, float)) and band >= 0: | ||
| bands[str(metric)] = float(band) | ||
| override = str(raw.get("override_label", "benchmark-accepted")) if isinstance(raw, dict) else "" | ||
| return GatingConfig(bands=bands or dict(DEFAULT_BANDS), override_label=override) |
| "- `reduction` is how much smaller the ChoiceCard prompt is than dumping", | ||
| " every tool schema — the headline routing benefit at scale.", |
| - **Token reduction is the headline benefit.** Bounded `ChoiceCard`s shrink | ||
| the routing prompt by ~95–97% versus exposing every tool schema — the gap | ||
| widens as the catalog grows, which is exactly when naive all-tools prompting | ||
| becomes untenable. |
…ine wording - ci.yml: remove a stray `edit-mode: replace` line that had been orphaned inside the benchmark-gate job's bash `run` block (would have executed as a command and failed the job); restore it as the intended input of the benchmark-comment sticky-comment step. - benchmark_gate.py: clarify load_gating_config docstring — bands are gated when non-negative (0pp means "no regression tolerated"), matching the `>= 0` check. - scenario_routing.py + scaling-matrix.md: the naive baseline is each tool's name + description, not full JSON schemas; reword the report/docs to match what is measured (and regenerate scenario_routing.md). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01H79GcrysRhSctnSYJEkuGk
Benchmark delta (vs
|
| size | recall@k (head Δ vs base) | MRR (head Δ vs base) | p99 (ms) |
|---|---|---|---|
| 50 | ✅ 0.5649 (+0.0000) | ✅ 0.4978 (+0.0000) | ✅ 0.546 (base 0.759) |
| 83 | ✅ 0.3825 (+0.0000) | ✅ 0.3242 (+0.0000) | ✅ 0.738 (base 1.134) |
| 1000 | ✅ 0.1475 (+0.0000) | ✅ 0.1456 (+0.0000) | ✅ 39.174 (base 41.711) |
Per-backend × per-size matrix
| backend | size | recall@k (Δ) | MRR (Δ) | p99 (ms) |
|---|---|---|---|---|
| bm25 | 100 | ✅ 0.3825 (+0.0000) | ✅ 0.3399 (+0.0000) | ✅ 6.510 (base 8.140) |
| bm25 | 500 | ✅ 0.2250 (+0.0000) | ✅ 0.2165 (+0.0000) | ✅ 29.479 (base 38.989) |
| bm25 | 1000 | ✅ 0.1575 (+0.0000) | ✅ 0.1525 (+0.0000) | ✅ 86.777 (base 111.716) |
| embedding_hashing | 100 | ✅ 0.5175 (+0.0000) | ✅ 0.4360 (+0.0000) | ✅ 7.567 (base 7.225) |
| embedding_hashing | 500 | ✅ 0.2700 (+0.0000) | ✅ 0.2674 (+0.0000) | ✅ 41.610 (base 44.182) |
| embedding_hashing | 1000 | ✅ 0.2000 (+0.0000) | ✅ 0.1931 (+0.0000) | ✅ 101.199 (base 98.277) |
| embedding_st | 100 | skipped (skipped: missing sentence-transformers) | — | — |
| embedding_st | 500 | skipped (skipped: missing sentence-transformers) | — | — |
| embedding_st | 1000 | skipped (skipped: missing sentence-transformers) | — | — |
| fuzzy | 100 | skipped (skipped: missing rapidfuzz) | — | — |
| fuzzy | 500 | skipped (skipped: missing rapidfuzz) | — | — |
| fuzzy | 1000 | skipped (skipped: missing rapidfuzz) | — | — |
| tfidf | 100 | ✅ 0.3825 (+0.0000) | ✅ 0.3220 (+0.0000) | ✅ 1.054 (base 1.102) |
| tfidf | 500 | ✅ 0.2325 (+0.0000) | ✅ 0.2314 (+0.0000) | ✅ 10.083 (base 11.492) |
| tfidf | 1000 | ✅ 0.1475 (+0.0000) | ✅ 0.1456 (+0.0000) | ✅ 36.813 (base 50.755) |
Context pipeline (per scenario)
| scenario | tokens | dropped | dedup |
|---|---|---|---|
| large_catalog | 1480 (base 1514, Δ-34) | 0 (base 0, Δ+0) | 0 (base 0, Δ+0) |
| long_conversation | 2500 (base 2548, Δ-48) | 0 (base 0, Δ+0) | 0 (base 0, Δ+0) |
| mixed_payload | 488 (base 497, Δ-9) | 0 (base 0, Δ+0) | 0 (base 0, Δ+0) |
| short_conversation | 487 (base 496, Δ-9) | 0 (base 0, Δ+0) | 0 (base 0, Δ+0) |
| stress_conversation | 6590 (base 6651, Δ-61) | 11 (base 7, Δ+4) | 4 (base 4, Δ+0) |
| tiny_payload | 256 (base 267, Δ-11) | 0 (base 0, Δ+0) | 0 (base 0, Δ+0) |
Numbers come from make benchmark / make benchmark-matrix.
Latency is hardware-dependent — treat the markers as a rough guide.
See benchmarks/scorecard.md for the full picture.
Summary
Coordinated maturation of the benchmark subsystem — the most coherent group from the open backlog (same code area
benchmarks/+scripts/, shared metric-extraction layer, single review surface). All additions are deterministic and offline; new scripts reuse the installed package only (mirroringbenchmarks/smoke_eval.py).Closes #369
Closes #418
Closes #491
Closes #554
Closes #687
Closes #688
Changes
benchmarks/large_catalog.pyroutes 300+ tools across 8 namespaces with near-duplicate distractor variants and destructive (side-effecting) tools; reports recall@1/3/5, MRR, ChoiceCard-vs-naive token reduction, and allow/deny filtering of destructive tools. Writes a latency-free committed scorecard (benchmarks/large_catalog_scorecard.md) plus a gitignored JSON.--checkgates scorecard drift,--strictgates regression-guard thresholds.make benchmark-large-catalog[-check].benchmarks/scenario_routing.pycontrasts naive all-tools prompting vs boundedChoiceCardrouting across a curated dataset (benchmarks/scenarios/routing_choicecard.json); emitsbenchmarks/scenario_routing.md.make benchmark-scenario[-check].scripts/benchmark_gate.py+benchmarks/gating.yamlturn the informational delta into a gating CI job (benchmark-gate): recall@k / MRR / precision@k / token-savings regressions beyond their tolerance bands fail CI. Latency never gates; thebenchmark-acceptedlabel downgrades a failure to a warning.scripts/render_trend.pycaptures a deterministic, latency-free metric snapshot per release (benchmarks/results/history/<version>.json) and rendersbenchmarks/trend.md.make trend[-check].docs/benchmarks/scaling-matrix.mddocuments the 10k-tool methodology, reproducible commands, and result interpretation; wired into mkdocs nav anddocs/benchmarks.md..github/workflows/benchmark-scale.ymlruns the routing-scale profiler weekly and uploads its JSON + report as a per-run trend artifact (non-gating).ci.ymlbenchmark-gatejob, mkdocs nav,AGENTS.mdcommands,CHANGELOG.md. New tests cover all four new modules.How verified
ruff format --check+ruff check scripts/ tests/ benchmarks/{large_catalog,scenario_routing}.py— all checks passed.mypy scripts/benchmark_gate.py scripts/render_trend.py— no issues.pytest tests/test_benchmark_gate.py tests/test_render_trend.py tests/test_large_catalog_benchmark.py tests/test_scenario_routing.py— 27 passed.tests/test_mcp_serve_cli.py::test_serve_dry_run_writes_catalog_diagnostic_event(assert result.stderr == "") — is pre-existing and environmental: this sandbox has no network, sotiktokencannot fetchcl100k_baseand prints a one-line fallback warning to stderr. It is unrelated to this PR (nosrc/code changed) and passes on CI where the runner has network.make module-size-check— OK (174 modules).make drift-check— all 7 existing generated artifacts untouched/up to date. New--checktargets (trend,large-catalog,scenario) all clean. YAML validated for the new/edited workflow + config files.Checklist
make cipasses locally — ran fmt/lint/type + full test (one environmental offline-tiktoken failure, see above) + drift-check + module-size-check;example/demolegs not re-run (nosrc//example changes)CHANGELOG.mdupdated under## [Unreleased]src/changes;api/public_api.txtunaffected (verified viamake drift-check)src/; newscripts/modules are ≤ 277 lines.benchmarks/large_catalog.pyis 335 lines, consistent with the existing un-gatedbenchmarks/scripts (e.g.benchmark.py);benchmarks/is outsidemake module-size-checkscope.AGENTS.mdcommands,docs/benchmarks*)Notes for reviewers
src/changes, which keeps the public-API, schema, andllms.txtdrift gates untouched.benchmark-gatejob compares this PR's--matrixhead against the committedlatest.jsonbase. The gated quality metrics are deterministic/environment-independent, so on a PR that doesn't move routing/scoring/context they are byte-identical → the gate is green. Token-savings comparison relies ontiktokenparity between base and head; CI runners have network so this holds. Thebenchmark-acceptedoverride label is wired in for intentional trade-offs.trend.md,large_catalog_scorecard.md,scenario_routing.md,results/history/0.16.0.json) are deterministic and latency-free; each has a--check/test guarding drift. Host-specific latency JSON stays gitignored (matching the repo's existingbenchmarks/results/*.jsonconvention).🤖 Generated with Claude Code
Generated by Claude Code