feat(benchmarks): scaling, scenarios, CI quality gate, and release trend by dgenio · Pull Request #727 · dgenio/contextweaver

dgenio · 2026-06-24T12:19:06Z

Summary

Coordinated maturation of the benchmark subsystem — the most coherent group from the open backlog (same code area benchmarks/ + scripts/, shared metric-extraction layer, single review surface). All additions are deterministic and offline; new scripts reuse the installed package only (mirroring benchmarks/smoke_eval.py).

Closes #369
Closes #418
Closes #491
Closes #554
Closes #687
Closes #688

Changes

Add large-catalog benchmark for 300+ MCP tools and routing quality regression checks #369 — Large-catalog benchmark. benchmarks/large_catalog.py routes 300+ tools across 8 namespaces with near-duplicate distractor variants and destructive (side-effecting) tools; reports recall@1/3/5, MRR, ChoiceCard-vs-naive token reduction, and allow/deny filtering of destructive tools. Writes a latency-free committed scorecard (benchmarks/large_catalog_scorecard.md) plus a gitignored JSON. --check gates scorecard drift, --strict gates regression-guard thresholds. make benchmark-large-catalog[-check].
Add Scenario benchmark: all-tools prompt vs ChoiceCard routing #418 — Scenario benchmark. benchmarks/scenario_routing.py contrasts naive all-tools prompting vs bounded ChoiceCard routing across a curated dataset (benchmarks/scenarios/routing_choicecard.json); emits benchmarks/scenario_routing.md. make benchmark-scenario[-check].
Promote benchmark quality deltas to gating CI checks with tolerance bands #491 — Quality-regression gate. scripts/benchmark_gate.py + benchmarks/gating.yaml turn the informational delta into a gating CI job (benchmark-gate): recall@k / MRR / precision@k / token-savings regressions beyond their tolerance bands fail CI. Latency never gates; the benchmark-accepted label downgrades a failure to a warning.
Persist benchmark history across releases and publish a trend page #554 — Release trend. scripts/render_trend.py captures a deterministic, latency-free metric snapshot per release (benchmarks/results/history/<version>.json) and renders benchmarks/trend.md. make trend[-check].
Publish scaling benchmark matrix to docs #687 — Scaling matrix docs. docs/benchmarks/scaling-matrix.md documents the 10k-tool methodology, reproducible commands, and result interpretation; wired into mkdocs nav and docs/benchmarks.md.
CI non-gating routing-scale smoke benchmark #688 — Scheduled routing-scale smoke. .github/workflows/benchmark-scale.yml runs the routing-scale profiler weekly and uploads its JSON + report as a per-run trend artifact (non-gating).
Wiring: Makefile targets, ci.yml benchmark-gate job, mkdocs nav, AGENTS.md commands, CHANGELOG.md. New tests cover all four new modules.

How verified

ruff format --check + ruff check scripts/ tests/ benchmarks/{large_catalog,scenario_routing}.py — all checks passed.
mypy scripts/benchmark_gate.py scripts/render_trend.py — no issues.
New tests: pytest tests/test_benchmark_gate.py tests/test_render_trend.py tests/test_large_catalog_benchmark.py tests/test_scenario_routing.py — 27 passed.
Full suite: 2863 passed, 33 skipped, 1 xfailed, 1 failed. The single failure — tests/test_mcp_serve_cli.py::test_serve_dry_run_writes_catalog_diagnostic_event (assert result.stderr == "") — is pre-existing and environmental: this sandbox has no network, so tiktoken cannot fetch cl100k_base and prints a one-line fallback warning to stderr. It is unrelated to this PR (no src/ code changed) and passes on CI where the runner has network.
make module-size-check — OK (174 modules). make drift-check — all 7 existing generated artifacts untouched/up to date. New --check targets (trend, large-catalog, scenario) all clean. YAML validated for the new/edited workflow + config files.

Checklist

Tests added or updated for every new/changed public function
make ci passes locally — ran fmt/lint/type + full test (one environmental offline-tiktoken failure, see above) + drift-check + module-size-check; example/demo legs not re-run (no src//example changes)
CHANGELOG.md updated under ## [Unreleased]
Docstrings added for all new public APIs (Google-style)
Public-API change? N/A — no src/ changes; api/public_api.txt unaffected (verified via make drift-check)
Every modified module stays ≤ 300 lines — N/A for src/; new scripts/ modules are ≤ 277 lines. benchmarks/large_catalog.py is 335 lines, consistent with the existing un-gated benchmarks/ scripts (e.g. benchmark.py); benchmarks/ is outside make module-size-check scope.
Related issues linked in the summary above
Agent-facing docs updated (AGENTS.md commands, docs/benchmarks*)

Notes for reviewers

Scope. This implements the recommended grouping from the issue-triage report; no source/runtime code under src/ changes, which keeps the public-API, schema, and llms.txt drift gates untouched.
Gate behaviour (Promote benchmark quality deltas to gating CI checks with tolerance bands #491). The benchmark-gate job compares this PR's --matrix head against the committed latest.json base. The gated quality metrics are deterministic/environment-independent, so on a PR that doesn't move routing/scoring/context they are byte-identical → the gate is green. Token-savings comparison relies on tiktoken parity between base and head; CI runners have network so this holds. The benchmark-accepted override label is wired in for intentional trade-offs.
Committed artifacts (trend.md, large_catalog_scorecard.md, scenario_routing.md, results/history/0.16.0.json) are deterministic and latency-free; each has a --check/test guarding drift. Host-specific latency JSON stays gitignored (matching the repo's existing benchmarks/results/*.json convention).

🤖 Generated with Claude Code

Generated by Claude Code

Coordinated maturation of the benchmark subsystem (issues #369, #418, #491, #554, #687, #688). All additions are deterministic and offline; new scripts reuse the installed package only. - #369 large-catalog benchmark: benchmarks/large_catalog.py routes 300+ tools across 8 namespaces with near-duplicate distractors and destructive tools; reports recall@1/3/5, MRR, ChoiceCard-vs-naive token reduction, and deny filtering. Writes a latency-free committed scorecard (+ gitignored JSON); --check gates drift, --strict gates regression-guard thresholds. - #418 scenario benchmark: benchmarks/scenario_routing.py contrasts naive all-tools prompting vs bounded ChoiceCard routing over a curated dataset, emitting a committed comparison report. - #491 quality gate: scripts/benchmark_gate.py + benchmarks/gating.yaml turn the informational delta into a gating CI job; recall@k / MRR / precision@k / token-savings regressions beyond their bands fail CI. Latency never gates; the benchmark-accepted label downgrades a failure to a warning. - #554 release trend: scripts/render_trend.py captures a deterministic, latency-free snapshot per release (benchmarks/results/history/<v>.json) and renders benchmarks/trend.md (make trend / trend-check). - #687 scaling docs: docs/benchmarks/scaling-matrix.md documents the 10k methodology, reproducible commands, and interpretation. - #688 scheduled smoke: .github/workflows/benchmark-scale.yml runs the routing-scale profiler weekly and uploads a per-run trend artifact. Wiring: Makefile targets, ci.yml benchmark-gate job, mkdocs nav, AGENTS.md commands, docs/benchmarks.md, CHANGELOG. New tests cover the gate, trend, large-catalog, and scenario modules. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01H79GcrysRhSctnSYJEkuGk

Copilot

Pull request overview

This PR significantly expands and hardens the benchmarks/ subsystem by adding new deterministic/offline benchmark scenarios (large catalog + scenario routing), introducing a CI quality-regression gate with tolerance bands, and publishing a release-over-release trend page and scaling documentation. It also wires these benchmarks into docs navigation, Makefile targets, and CI/scheduled workflows to make benchmark drift and regressions visible and enforceable.

Changes:

Add new deterministic benchmarks: a 300+ tool “large catalog” quality benchmark and a scenario routing benchmark (naive vs ChoiceCards), with committed markdown artifacts and --check drift gates.
Introduce a new benchmark quality CI gate (scripts/benchmark_gate.py + benchmarks/gating.yaml) and a weekly non-gating routing-scale smoke workflow.
Add release history snapshots + trend renderer (scripts/render_trend.py → benchmarks/trend.md) and update docs/nav + Makefile/AGENTS/CHANGELOG wiring.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`tests/test_scenario_routing.py`	Determinism and drift-gate tests for the scenario routing benchmark.
`tests/test_render_trend.py`	Unit tests for release snapshot extraction, ordering, deterministic rendering, and committed drift check.
`tests/test_large_catalog_benchmark.py`	Structural/determinism tests for the large-catalog benchmark (reduced size in CI).
`tests/test_benchmark_gate.py`	Coverage of gating semantics (bands, skipped/new cells, override behavior, CLI exit codes).
`scripts/render_trend.py`	New script to snapshot deterministic metrics per release and render `benchmarks/trend.md` (+ `--check`).
`scripts/benchmark_gate.py`	New script to fail CI when deterministic quality metrics regress beyond configured bands.
`mkdocs.yml`	Adds benchmarks sub-nav (overview + scaling matrix + routing-scale profile).
`Makefile`	Adds targets for large-catalog, scenario benchmark, and trend render/check.
`docs/benchmarks/scaling-matrix.md`	New scaling methodology page tying together scale-related benchmarks.
`docs/benchmarks.md`	Adds “Scaling and trend” section and links to new benchmark outputs.
`CHANGELOG.md`	Documents the benchmark subsystem maturation and new gates/artifacts.
`benchmarks/trend.md`	New committed, generated trend page (from release history snapshots).
`benchmarks/scenarios/routing_choicecard.json`	New scenario dataset for naive vs ChoiceCard routing comparisons.
`benchmarks/scenario_routing.py`	New deterministic scenario benchmark script + `--check` drift behavior.
`benchmarks/scenario_routing.md`	New committed, generated scenario report.
`benchmarks/results/history/0.16.0.json`	First per-release deterministic snapshot for trend history.
`benchmarks/large_catalog.py`	New deterministic 300+ tool benchmark with distractors/destructive tools + scorecard + JSON output.
`benchmarks/large_catalog_scorecard.md`	New committed, generated large-catalog scorecard.
`benchmarks/gating.yaml`	New tolerance-band config for quality gating and override label name.
`AGENTS.md`	Documents new benchmark/trend Makefile commands.
`.github/workflows/ci.yml`	Adds the new `benchmark-gate` CI job (quality-regression enforcement).
`.github/workflows/benchmark-scale.yml`	Adds weekly scheduled non-gating routing-scale smoke benchmark with artifacts.

+            --base benchmarks/results/base.json \
+            --head benchmarks/results/head.json \
+            --gating-config benchmarks/gating.yaml $OVERRIDE
          edit-mode: replace


+def load_gating_config(path: Path | None) -> GatingConfig:
+    """Load bands from *path*; fall back to :data:`DEFAULT_BANDS` when absent.
+
+    Only the ``quality`` metrics whose band is a positive number are gated;
+    a metric set to ``gating: false`` (or omitted) is treated as informational.
+    """
+    if path is None or not path.exists():
+        return GatingConfig(bands=dict(DEFAULT_BANDS))
+    import yaml  # lazy: keeps the import off the no-config path
+
+    raw = yaml.safe_load(path.read_text(encoding="utf-8")) or {}
+    quality = raw.get("quality", {}) if isinstance(raw, dict) else {}
+    bands: dict[str, float] = {}
+    for metric, spec in (quality or {}).items():
+        if not isinstance(spec, dict):
+            continue
+        band = spec.get("max_regression_pp")
+        if isinstance(band, (int, float)) and band >= 0:
+            bands[str(metric)] = float(band)
+    override = str(raw.get("override_label", "benchmark-accepted")) if isinstance(raw, dict) else ""
+    return GatingConfig(bands=bands or dict(DEFAULT_BANDS), override_label=override)


+            "- `reduction` is how much smaller the ChoiceCard prompt is than dumping",
+            "  every tool schema — the headline routing benefit at scale.",


+- **Token reduction is the headline benefit.** Bounded `ChoiceCard`s shrink
+  the routing prompt by ~95–97% versus exposing every tool schema — the gap
+  widens as the catalog grows, which is exactly when naive all-tools prompting
+  becomes untenable.


…ine wording - ci.yml: remove a stray `edit-mode: replace` line that had been orphaned inside the benchmark-gate job's bash `run` block (would have executed as a command and failed the job); restore it as the intended input of the benchmark-comment sticky-comment step. - benchmark_gate.py: clarify load_gating_config docstring — bands are gated when non-negative (0pp means "no regression tolerated"), matching the `>= 0` check. - scenario_routing.py + scaling-matrix.md: the naive baseline is each tool's name + description, not full JSON schemas; reword the report/docs to match what is measured (and regenerate scenario_routing.md). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01H79GcrysRhSctnSYJEkuGk

github-actions · 2026-06-24T12:33:51Z

Benchmark delta (vs `main`)

Soft regression feedback only — this comment never blocks the PR.
Latency budget: ⚠️ when head > base × 1.3. Accuracy budget: ⚠️ when head < base - 1pp.

Routing summary (single backend × catalog sizes)

size	recall@k (head Δ vs base)	MRR (head Δ vs base)	p99 (ms)
50	✅ 0.5649 (+0.0000)	✅ 0.4978 (+0.0000)	✅ 0.546 (base 0.759)
83	✅ 0.3825 (+0.0000)	✅ 0.3242 (+0.0000)	✅ 0.738 (base 1.134)
1000	✅ 0.1475 (+0.0000)	✅ 0.1456 (+0.0000)	✅ 39.174 (base 41.711)

Per-backend × per-size matrix

backend	size	recall@k (Δ)	MRR (Δ)	p99 (ms)
bm25	100	✅ 0.3825 (+0.0000)	✅ 0.3399 (+0.0000)	✅ 6.510 (base 8.140)
bm25	500	✅ 0.2250 (+0.0000)	✅ 0.2165 (+0.0000)	✅ 29.479 (base 38.989)
bm25	1000	✅ 0.1575 (+0.0000)	✅ 0.1525 (+0.0000)	✅ 86.777 (base 111.716)
embedding_hashing	100	✅ 0.5175 (+0.0000)	✅ 0.4360 (+0.0000)	✅ 7.567 (base 7.225)
embedding_hashing	500	✅ 0.2700 (+0.0000)	✅ 0.2674 (+0.0000)	✅ 41.610 (base 44.182)
embedding_hashing	1000	✅ 0.2000 (+0.0000)	✅ 0.1931 (+0.0000)	✅ 101.199 (base 98.277)
embedding_st	100	skipped (skipped: missing sentence-transformers)	—	—
embedding_st	500	skipped (skipped: missing sentence-transformers)	—	—
embedding_st	1000	skipped (skipped: missing sentence-transformers)	—	—
fuzzy	100	skipped (skipped: missing rapidfuzz)	—	—
fuzzy	500	skipped (skipped: missing rapidfuzz)	—	—
fuzzy	1000	skipped (skipped: missing rapidfuzz)	—	—
tfidf	100	✅ 0.3825 (+0.0000)	✅ 0.3220 (+0.0000)	✅ 1.054 (base 1.102)
tfidf	500	✅ 0.2325 (+0.0000)	✅ 0.2314 (+0.0000)	✅ 10.083 (base 11.492)
tfidf	1000	✅ 0.1475 (+0.0000)	✅ 0.1456 (+0.0000)	✅ 36.813 (base 50.755)

Context pipeline (per scenario)

scenario	tokens	dropped	dedup
large_catalog	1480 (base 1514, Δ-34)	0 (base 0, Δ+0)	0 (base 0, Δ+0)
long_conversation	2500 (base 2548, Δ-48)	0 (base 0, Δ+0)	0 (base 0, Δ+0)
mixed_payload	488 (base 497, Δ-9)	0 (base 0, Δ+0)	0 (base 0, Δ+0)
short_conversation	487 (base 496, Δ-9)	0 (base 0, Δ+0)	0 (base 0, Δ+0)
stress_conversation	6590 (base 6651, Δ-61)	11 (base 7, Δ+4)	4 (base 4, Δ+0)
tiny_payload	256 (base 267, Δ-11)	0 (base 0, Δ+0)	0 (base 0, Δ+0)

Numbers come from make benchmark / make benchmark-matrix.
Latency is hardware-dependent — treat the markers as a rough guide.
See benchmarks/scorecard.md for the full picture.

Copilot AI review requested due to automatic review settings June 24, 2026 12:19

Copilot started reviewing on behalf of dgenio June 24, 2026 12:19 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmarks): scaling, scenarios, CI quality gate, and release trend#727

feat(benchmarks): scaling, scenarios, CI quality gate, and release trend#727
dgenio wants to merge 2 commits into
mainfrom
claude/issue-triage-grouping-34z06s

dgenio commented Jun 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		"- `reduction` is how much smaller the ChoiceCard prompt is than dumping",
		" every tool schema — the headline routing benefit at scale.",

Conversation

dgenio commented Jun 24, 2026

Summary

Changes

How verified

Checklist

Notes for reviewers

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

github-actions Bot commented Jun 24, 2026

Benchmark delta (vs main)

Routing summary (single backend × catalog sizes)

Per-backend × per-size matrix

Context pipeline (per scenario)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Benchmark delta (vs `main`)