methods/: peer-review-ready backing for benchmark/ (formalization + CIs + preregistration + repro) by g-shevchenko · Pull Request #56 · g-shevchenko/mcp-token-savers

g-shevchenko · 2026-05-24T18:10:04Z

Summary

Adds a new top-level methods/ directory with the peer-review-ready statistical and methodological framework backing the benchmark/ harness. If benchmark/ is the measurement primitive, methods/ is the defensibility layer.

File	Purpose
`formalization.md`	Formal definitions of `B`, `K`, `Q`, `C_prod` + cost-model theorem + Cook-Campbell threats to validity
`statistical_analysis.md`	Real sophon@0.5.4 measurements on the public 15-fixture corpus with proper CIs
`preregistration_n60.md`	OSF-style frozen protocol for the next-round N≥60 expansion
`datasheet.md`	Gebru et al. corpus disclosure
`provenance.md`	Pinned env + seeds for bytes-identical reproduction
`Dockerfile` + `repro-entrypoint.sh`	Turnkey reproduction container
`lib/statistical_analysis.py`	Pure-stdlib library: Wilson CI, cluster-bootstrap, two-way random-effects ANOVA, Welch t, Cohen's d, Holm-Bonferroni
`lib/run_statistical_analysis.py`	CLI consuming c2_bench JSON outputs
`lib/tests/test_statistical_analysis.py`	16 unit tests — all pass; Wilson boundary behaviour pinned with textbook reference values

The headline finding the CI-driven re-analysis surfaced

Sophon@0.5.4 on `long_realistic_v3` (N=5 runs × M=15 fixtures):

metric	point	95% CI	source
mean byte-saving	0.828	[0.721, 0.900]	cluster-bootstrap
cache-friendly κ	1.000 (15/15)	[0.796, 1.000]	Wilson
pass rate at `B >= 0.30` bar	14/15 = 0.933	[0.701, 0.988]	Wilson

Wilson lower bound on κ = 79.6%, not 100%. A point-estimate "100% cache-friendly" is a sample statistic. The population proportion at 95% confidence is >=79.6% on a corpus this size. Anyone publishing a "100% cache-friendly" claim without that caveat is over-claiming on small N. This is the kind of disclosure the methods directory exists to make easy.

Test plan

`cd methods/lib && python3 -m unittest tests.test_statistical_analysis` -> 16/16 PASS
Wilson 15/15 -> [0.7961, 1.0] (matches Wilson 1927 reference)
Cluster-bootstrap reproducible with seed=42
Moat-guard preflight: 0 banned markers, 0 measured-internal-ratio leaks
Identity: Gregory Shevchenko g@humanswith.ai, no Claude co-author

The benchmark/ directory ships a working harness with point-estimate results — enough for a smoke test, not enough to defend a published claim. This directory adds the statistical and methodological framework that converts "I measured a number" to "I measured a number with a defensible confidence bound": formalization.md B/K/Q/C_prod formal definitions + cost-model theorem (derives the exact condition under which a high-byte-saving / low-cache-stable compressor loses to a moderate-saving / high-cache-stable one in steady state) + Cook-Campbell threats to validity statistical_analysis.md Real sophon@0.5.4 measurements on the public 15-fixture corpus with proper CIs: B̄=0.828, cluster-bootstrap CI [0.721, 0.900] κ=1.000 (15/15), Wilson CI [0.796, 1.000] The Wilson lower bound 79.6% is the lesson — "100% cache-friendly" is a sample statistic, not a population estimate at N=15. preregistration_n60.md OSF-style frozen protocol for the next-round expansion (genre-stratified N≥60). Converts future analyses from exploratory to confirmatory. datasheet.md Gebru et al. format corpus disclosure for long_realistic_v3.jsonl. provenance.md Pinned env + RNG seeds for bytes-identical reproduction. Dockerfile + repro-entrypoint.sh Turnkey reproduction container. lib/statistical_analysis.py Pure-stdlib library: Wilson CI, cluster-bootstrap, two-way random- effects ANOVA, Welch t, Cohen's d, Holm-Bonferroni. lib/run_statistical_analysis.py CLI consuming c2_bench JSON outputs. lib/tests/test_statistical_analysis.py 16 tests, all pass. Wilson boundary behaviour pinned against textbook reference values (15/15 → [0.7961, 1.0], 0/15 → [0, 0.2039]). This is the input to peer review, not yet peer-reviewed. § "What's missing" in README.md is explicit about that: no independent evaluator has run this harness and reported their own numbers yet. Honest disclosure to the upstream article's "99% byte-saver vs 60% saver" rhetorical hook: the cost-model derivation in formalization.md § 3.1 shows the hook holds only for b_A < 1 − (1 − b_B) · ρ. For (b_B = 0.60, ρ = 0.10) that's b_A < 0.96. Literal b_A = 0.99 does NOT exhibit the cache crossover at any finite n — the byte-saving advantage dominates. The hook IS correct for measured-compressor values (typically b ≈ 0.7 – 0.95), but it's rhetorical, not mathematical, at the boundary value. Refs: Wilson 1927; Efron & Tibshirani 1993; Welch 1947; Holm 1979; Cohen 1988; Cook & Campbell 1979; Gebru et al. 2018.

g-shevchenko merged commit b6dc152 into main May 24, 2026
6 checks passed

g-shevchenko deleted the claude/methods-v3-public-mirror branch May 24, 2026 18:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

methods/: peer-review-ready backing for benchmark/ (formalization + CIs + preregistration + repro)#56

methods/: peer-review-ready backing for benchmark/ (formalization + CIs + preregistration + repro)#56
g-shevchenko merged 1 commit into
mainfrom
claude/methods-v3-public-mirror

g-shevchenko commented May 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

g-shevchenko commented May 24, 2026

Summary

The headline finding the CI-driven re-analysis surfaced

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant