Skip to content

feat(benchmarks): scaling, scenarios, CI quality gate, and release trend#727

Open
dgenio wants to merge 2 commits into
mainfrom
claude/issue-triage-grouping-34z06s
Open

feat(benchmarks): scaling, scenarios, CI quality gate, and release trend#727
dgenio wants to merge 2 commits into
mainfrom
claude/issue-triage-grouping-34z06s

Conversation

@dgenio

@dgenio dgenio commented Jun 24, 2026

Copy link
Copy Markdown
Owner

Summary

Coordinated maturation of the benchmark subsystem — the most coherent group from the open backlog (same code area benchmarks/ + scripts/, shared metric-extraction layer, single review surface). All additions are deterministic and offline; new scripts reuse the installed package only (mirroring benchmarks/smoke_eval.py).

Closes #369
Closes #418
Closes #491
Closes #554
Closes #687
Closes #688

Changes

  • Add large-catalog benchmark for 300+ MCP tools and routing quality regression checks #369 — Large-catalog benchmark. benchmarks/large_catalog.py routes 300+ tools across 8 namespaces with near-duplicate distractor variants and destructive (side-effecting) tools; reports recall@1/3/5, MRR, ChoiceCard-vs-naive token reduction, and allow/deny filtering of destructive tools. Writes a latency-free committed scorecard (benchmarks/large_catalog_scorecard.md) plus a gitignored JSON. --check gates scorecard drift, --strict gates regression-guard thresholds. make benchmark-large-catalog[-check].
  • Add Scenario benchmark: all-tools prompt vs ChoiceCard routing #418 — Scenario benchmark. benchmarks/scenario_routing.py contrasts naive all-tools prompting vs bounded ChoiceCard routing across a curated dataset (benchmarks/scenarios/routing_choicecard.json); emits benchmarks/scenario_routing.md. make benchmark-scenario[-check].
  • Promote benchmark quality deltas to gating CI checks with tolerance bands #491 — Quality-regression gate. scripts/benchmark_gate.py + benchmarks/gating.yaml turn the informational delta into a gating CI job (benchmark-gate): recall@k / MRR / precision@k / token-savings regressions beyond their tolerance bands fail CI. Latency never gates; the benchmark-accepted label downgrades a failure to a warning.
  • Persist benchmark history across releases and publish a trend page #554 — Release trend. scripts/render_trend.py captures a deterministic, latency-free metric snapshot per release (benchmarks/results/history/<version>.json) and renders benchmarks/trend.md. make trend[-check].
  • Publish scaling benchmark matrix to docs #687 — Scaling matrix docs. docs/benchmarks/scaling-matrix.md documents the 10k-tool methodology, reproducible commands, and result interpretation; wired into mkdocs nav and docs/benchmarks.md.
  • CI non-gating routing-scale smoke benchmark #688 — Scheduled routing-scale smoke. .github/workflows/benchmark-scale.yml runs the routing-scale profiler weekly and uploads its JSON + report as a per-run trend artifact (non-gating).
  • Wiring: Makefile targets, ci.yml benchmark-gate job, mkdocs nav, AGENTS.md commands, CHANGELOG.md. New tests cover all four new modules.

How verified

  • ruff format --check + ruff check scripts/ tests/ benchmarks/{large_catalog,scenario_routing}.py — all checks passed.
  • mypy scripts/benchmark_gate.py scripts/render_trend.py — no issues.
  • New tests: pytest tests/test_benchmark_gate.py tests/test_render_trend.py tests/test_large_catalog_benchmark.py tests/test_scenario_routing.py27 passed.
  • Full suite: 2863 passed, 33 skipped, 1 xfailed, 1 failed. The single failure — tests/test_mcp_serve_cli.py::test_serve_dry_run_writes_catalog_diagnostic_event (assert result.stderr == "") — is pre-existing and environmental: this sandbox has no network, so tiktoken cannot fetch cl100k_base and prints a one-line fallback warning to stderr. It is unrelated to this PR (no src/ code changed) and passes on CI where the runner has network.
  • make module-size-check — OK (174 modules). make drift-check — all 7 existing generated artifacts untouched/up to date. New --check targets (trend, large-catalog, scenario) all clean. YAML validated for the new/edited workflow + config files.

Checklist

  • Tests added or updated for every new/changed public function
  • make ci passes locally — ran fmt/lint/type + full test (one environmental offline-tiktoken failure, see above) + drift-check + module-size-check; example/demo legs not re-run (no src//example changes)
  • CHANGELOG.md updated under ## [Unreleased]
  • Docstrings added for all new public APIs (Google-style)
  • Public-API change? N/A — no src/ changes; api/public_api.txt unaffected (verified via make drift-check)
  • Every modified module stays ≤ 300 lines — N/A for src/; new scripts/ modules are ≤ 277 lines. benchmarks/large_catalog.py is 335 lines, consistent with the existing un-gated benchmarks/ scripts (e.g. benchmark.py); benchmarks/ is outside make module-size-check scope.
  • Related issues linked in the summary above
  • Agent-facing docs updated (AGENTS.md commands, docs/benchmarks*)

Notes for reviewers

  • Scope. This implements the recommended grouping from the issue-triage report; no source/runtime code under src/ changes, which keeps the public-API, schema, and llms.txt drift gates untouched.
  • Gate behaviour (Promote benchmark quality deltas to gating CI checks with tolerance bands #491). The benchmark-gate job compares this PR's --matrix head against the committed latest.json base. The gated quality metrics are deterministic/environment-independent, so on a PR that doesn't move routing/scoring/context they are byte-identical → the gate is green. Token-savings comparison relies on tiktoken parity between base and head; CI runners have network so this holds. The benchmark-accepted override label is wired in for intentional trade-offs.
  • Committed artifacts (trend.md, large_catalog_scorecard.md, scenario_routing.md, results/history/0.16.0.json) are deterministic and latency-free; each has a --check/test guarding drift. Host-specific latency JSON stays gitignored (matching the repo's existing benchmarks/results/*.json convention).

🤖 Generated with Claude Code


Generated by Claude Code

Coordinated maturation of the benchmark subsystem (issues #369, #418, #491,
#554, #687, #688). All additions are deterministic and offline; new scripts
reuse the installed package only.

- #369 large-catalog benchmark: benchmarks/large_catalog.py routes 300+ tools
  across 8 namespaces with near-duplicate distractors and destructive tools;
  reports recall@1/3/5, MRR, ChoiceCard-vs-naive token reduction, and deny
  filtering. Writes a latency-free committed scorecard (+ gitignored JSON);
  --check gates drift, --strict gates regression-guard thresholds.
- #418 scenario benchmark: benchmarks/scenario_routing.py contrasts naive
  all-tools prompting vs bounded ChoiceCard routing over a curated dataset,
  emitting a committed comparison report.
- #491 quality gate: scripts/benchmark_gate.py + benchmarks/gating.yaml turn
  the informational delta into a gating CI job; recall@k / MRR / precision@k /
  token-savings regressions beyond their bands fail CI. Latency never gates;
  the benchmark-accepted label downgrades a failure to a warning.
- #554 release trend: scripts/render_trend.py captures a deterministic,
  latency-free snapshot per release (benchmarks/results/history/<v>.json) and
  renders benchmarks/trend.md (make trend / trend-check).
- #687 scaling docs: docs/benchmarks/scaling-matrix.md documents the 10k
  methodology, reproducible commands, and interpretation.
- #688 scheduled smoke: .github/workflows/benchmark-scale.yml runs the
  routing-scale profiler weekly and uploads a per-run trend artifact.

Wiring: Makefile targets, ci.yml benchmark-gate job, mkdocs nav, AGENTS.md
commands, docs/benchmarks.md, CHANGELOG. New tests cover the gate, trend,
large-catalog, and scenario modules.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01H79GcrysRhSctnSYJEkuGk
Copilot AI review requested due to automatic review settings June 24, 2026 12:19

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR significantly expands and hardens the benchmarks/ subsystem by adding new deterministic/offline benchmark scenarios (large catalog + scenario routing), introducing a CI quality-regression gate with tolerance bands, and publishing a release-over-release trend page and scaling documentation. It also wires these benchmarks into docs navigation, Makefile targets, and CI/scheduled workflows to make benchmark drift and regressions visible and enforceable.

Changes:

  • Add new deterministic benchmarks: a 300+ tool “large catalog” quality benchmark and a scenario routing benchmark (naive vs ChoiceCards), with committed markdown artifacts and --check drift gates.
  • Introduce a new benchmark quality CI gate (scripts/benchmark_gate.py + benchmarks/gating.yaml) and a weekly non-gating routing-scale smoke workflow.
  • Add release history snapshots + trend renderer (scripts/render_trend.pybenchmarks/trend.md) and update docs/nav + Makefile/AGENTS/CHANGELOG wiring.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_scenario_routing.py Determinism and drift-gate tests for the scenario routing benchmark.
tests/test_render_trend.py Unit tests for release snapshot extraction, ordering, deterministic rendering, and committed drift check.
tests/test_large_catalog_benchmark.py Structural/determinism tests for the large-catalog benchmark (reduced size in CI).
tests/test_benchmark_gate.py Coverage of gating semantics (bands, skipped/new cells, override behavior, CLI exit codes).
scripts/render_trend.py New script to snapshot deterministic metrics per release and render benchmarks/trend.md (+ --check).
scripts/benchmark_gate.py New script to fail CI when deterministic quality metrics regress beyond configured bands.
mkdocs.yml Adds benchmarks sub-nav (overview + scaling matrix + routing-scale profile).
Makefile Adds targets for large-catalog, scenario benchmark, and trend render/check.
docs/benchmarks/scaling-matrix.md New scaling methodology page tying together scale-related benchmarks.
docs/benchmarks.md Adds “Scaling and trend” section and links to new benchmark outputs.
CHANGELOG.md Documents the benchmark subsystem maturation and new gates/artifacts.
benchmarks/trend.md New committed, generated trend page (from release history snapshots).
benchmarks/scenarios/routing_choicecard.json New scenario dataset for naive vs ChoiceCard routing comparisons.
benchmarks/scenario_routing.py New deterministic scenario benchmark script + --check drift behavior.
benchmarks/scenario_routing.md New committed, generated scenario report.
benchmarks/results/history/0.16.0.json First per-release deterministic snapshot for trend history.
benchmarks/large_catalog.py New deterministic 300+ tool benchmark with distractors/destructive tools + scorecard + JSON output.
benchmarks/large_catalog_scorecard.md New committed, generated large-catalog scorecard.
benchmarks/gating.yaml New tolerance-band config for quality gating and override label name.
AGENTS.md Documents new benchmark/trend Makefile commands.
.github/workflows/ci.yml Adds the new benchmark-gate CI job (quality-regression enforcement).
.github/workflows/benchmark-scale.yml Adds weekly scheduled non-gating routing-scale smoke benchmark with artifacts.

Comment thread .github/workflows/ci.yml Outdated
--base benchmarks/results/base.json \
--head benchmarks/results/head.json \
--gating-config benchmarks/gating.yaml $OVERRIDE
edit-mode: replace
Comment thread scripts/benchmark_gate.py
Comment on lines +90 to +110
def load_gating_config(path: Path | None) -> GatingConfig:
"""Load bands from *path*; fall back to :data:`DEFAULT_BANDS` when absent.

Only the ``quality`` metrics whose band is a positive number are gated;
a metric set to ``gating: false`` (or omitted) is treated as informational.
"""
if path is None or not path.exists():
return GatingConfig(bands=dict(DEFAULT_BANDS))
import yaml # lazy: keeps the import off the no-config path

raw = yaml.safe_load(path.read_text(encoding="utf-8")) or {}
quality = raw.get("quality", {}) if isinstance(raw, dict) else {}
bands: dict[str, float] = {}
for metric, spec in (quality or {}).items():
if not isinstance(spec, dict):
continue
band = spec.get("max_regression_pp")
if isinstance(band, (int, float)) and band >= 0:
bands[str(metric)] = float(band)
override = str(raw.get("override_label", "benchmark-accepted")) if isinstance(raw, dict) else ""
return GatingConfig(bands=bands or dict(DEFAULT_BANDS), override_label=override)
Comment thread benchmarks/scenario_routing.py Outdated
Comment on lines +167 to +168
"- `reduction` is how much smaller the ChoiceCard prompt is than dumping",
" every tool schema — the headline routing benefit at scale.",
Comment thread docs/benchmarks/scaling-matrix.md Outdated
Comment on lines +53 to +56
- **Token reduction is the headline benefit.** Bounded `ChoiceCard`s shrink
the routing prompt by ~95–97% versus exposing every tool schema — the gap
widens as the catalog grows, which is exactly when naive all-tools prompting
becomes untenable.
…ine wording

- ci.yml: remove a stray `edit-mode: replace` line that had been orphaned
  inside the benchmark-gate job's bash `run` block (would have executed as a
  command and failed the job); restore it as the intended input of the
  benchmark-comment sticky-comment step.
- benchmark_gate.py: clarify load_gating_config docstring — bands are gated when
  non-negative (0pp means "no regression tolerated"), matching the `>= 0` check.
- scenario_routing.py + scaling-matrix.md: the naive baseline is each tool's
  name + description, not full JSON schemas; reword the report/docs to match
  what is measured (and regenerate scenario_routing.md).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01H79GcrysRhSctnSYJEkuGk
@github-actions

Copy link
Copy Markdown

Benchmark delta (vs main)

Soft regression feedback only — this comment never blocks the PR.
Latency budget: ⚠️ when head > base × 1.3. Accuracy budget: ⚠️ when head < base - 1pp.

Routing summary (single backend × catalog sizes)

size recall@k (head Δ vs base) MRR (head Δ vs base) p99 (ms)
50 ✅ 0.5649 (+0.0000) ✅ 0.4978 (+0.0000) ✅ 0.546 (base 0.759)
83 ✅ 0.3825 (+0.0000) ✅ 0.3242 (+0.0000) ✅ 0.738 (base 1.134)
1000 ✅ 0.1475 (+0.0000) ✅ 0.1456 (+0.0000) ✅ 39.174 (base 41.711)

Per-backend × per-size matrix

backend size recall@k (Δ) MRR (Δ) p99 (ms)
bm25 100 ✅ 0.3825 (+0.0000) ✅ 0.3399 (+0.0000) ✅ 6.510 (base 8.140)
bm25 500 ✅ 0.2250 (+0.0000) ✅ 0.2165 (+0.0000) ✅ 29.479 (base 38.989)
bm25 1000 ✅ 0.1575 (+0.0000) ✅ 0.1525 (+0.0000) ✅ 86.777 (base 111.716)
embedding_hashing 100 ✅ 0.5175 (+0.0000) ✅ 0.4360 (+0.0000) ✅ 7.567 (base 7.225)
embedding_hashing 500 ✅ 0.2700 (+0.0000) ✅ 0.2674 (+0.0000) ✅ 41.610 (base 44.182)
embedding_hashing 1000 ✅ 0.2000 (+0.0000) ✅ 0.1931 (+0.0000) ✅ 101.199 (base 98.277)
embedding_st 100 skipped (skipped: missing sentence-transformers)
embedding_st 500 skipped (skipped: missing sentence-transformers)
embedding_st 1000 skipped (skipped: missing sentence-transformers)
fuzzy 100 skipped (skipped: missing rapidfuzz)
fuzzy 500 skipped (skipped: missing rapidfuzz)
fuzzy 1000 skipped (skipped: missing rapidfuzz)
tfidf 100 ✅ 0.3825 (+0.0000) ✅ 0.3220 (+0.0000) ✅ 1.054 (base 1.102)
tfidf 500 ✅ 0.2325 (+0.0000) ✅ 0.2314 (+0.0000) ✅ 10.083 (base 11.492)
tfidf 1000 ✅ 0.1475 (+0.0000) ✅ 0.1456 (+0.0000) ✅ 36.813 (base 50.755)

Context pipeline (per scenario)

scenario tokens dropped dedup
large_catalog 1480 (base 1514, Δ-34) 0 (base 0, Δ+0) 0 (base 0, Δ+0)
long_conversation 2500 (base 2548, Δ-48) 0 (base 0, Δ+0) 0 (base 0, Δ+0)
mixed_payload 488 (base 497, Δ-9) 0 (base 0, Δ+0) 0 (base 0, Δ+0)
short_conversation 487 (base 496, Δ-9) 0 (base 0, Δ+0) 0 (base 0, Δ+0)
stress_conversation 6590 (base 6651, Δ-61) 11 (base 7, Δ+4) 4 (base 4, Δ+0)
tiny_payload 256 (base 267, Δ-11) 0 (base 0, Δ+0) 0 (base 0, Δ+0)

Numbers come from make benchmark / make benchmark-matrix.
Latency is hardware-dependent — treat the markers as a rough guide.
See benchmarks/scorecard.md for the full picture.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants