Skip to content

methods/: peer-review-ready backing for benchmark/ (formalization + CIs + preregistration + repro)#56

Merged
g-shevchenko merged 1 commit into
mainfrom
claude/methods-v3-public-mirror
May 24, 2026
Merged

methods/: peer-review-ready backing for benchmark/ (formalization + CIs + preregistration + repro)#56
g-shevchenko merged 1 commit into
mainfrom
claude/methods-v3-public-mirror

Conversation

@g-shevchenko

Copy link
Copy Markdown
Owner

Summary

Adds a new top-level methods/ directory with the peer-review-ready statistical and methodological framework backing the benchmark/ harness. If benchmark/ is the measurement primitive, methods/ is the defensibility layer.

File Purpose
formalization.md Formal definitions of B, K, Q, C_prod + cost-model theorem + Cook-Campbell threats to validity
statistical_analysis.md Real sophon@0.5.4 measurements on the public 15-fixture corpus with proper CIs
preregistration_n60.md OSF-style frozen protocol for the next-round N≥60 expansion
datasheet.md Gebru et al. corpus disclosure
provenance.md Pinned env + seeds for bytes-identical reproduction
Dockerfile + repro-entrypoint.sh Turnkey reproduction container
lib/statistical_analysis.py Pure-stdlib library: Wilson CI, cluster-bootstrap, two-way random-effects ANOVA, Welch t, Cohen's d, Holm-Bonferroni
lib/run_statistical_analysis.py CLI consuming c2_bench JSON outputs
lib/tests/test_statistical_analysis.py 16 unit tests — all pass; Wilson boundary behaviour pinned with textbook reference values

The headline finding the CI-driven re-analysis surfaced

Sophon@0.5.4 on `long_realistic_v3` (N=5 runs × M=15 fixtures):

metric point 95% CI source
mean byte-saving 0.828 [0.721, 0.900] cluster-bootstrap
cache-friendly κ 1.000 (15/15) [0.796, 1.000] Wilson
pass rate at `B >= 0.30` bar 14/15 = 0.933 [0.701, 0.988] Wilson

Wilson lower bound on κ = 79.6%, not 100%. A point-estimate "100% cache-friendly" is a sample statistic. The population proportion at 95% confidence is >=79.6% on a corpus this size. Anyone publishing a "100% cache-friendly" claim without that caveat is over-claiming on small N. This is the kind of disclosure the methods directory exists to make easy.

Test plan

  • `cd methods/lib && python3 -m unittest tests.test_statistical_analysis` -> 16/16 PASS
  • Wilson 15/15 -> [0.7961, 1.0] (matches Wilson 1927 reference)
  • Cluster-bootstrap reproducible with seed=42
  • Moat-guard preflight: 0 banned markers, 0 measured-internal-ratio leaks
  • Identity: Gregory Shevchenko g@humanswith.ai, no Claude co-author

The benchmark/ directory ships a working harness with point-estimate
results — enough for a smoke test, not enough to defend a published
claim. This directory adds the statistical and methodological framework
that converts "I measured a number" to "I measured a number with a
defensible confidence bound":

  formalization.md         B/K/Q/C_prod formal definitions + cost-model
                           theorem (derives the exact condition under
                           which a high-byte-saving / low-cache-stable
                           compressor loses to a moderate-saving /
                           high-cache-stable one in steady state) +
                           Cook-Campbell threats to validity
  statistical_analysis.md  Real sophon@0.5.4 measurements on the public
                           15-fixture corpus with proper CIs:
                             B̄=0.828, cluster-bootstrap CI [0.721, 0.900]
                             κ=1.000 (15/15), Wilson CI [0.796, 1.000]
                           The Wilson lower bound 79.6% is the lesson —
                           "100% cache-friendly" is a sample statistic,
                           not a population estimate at N=15.
  preregistration_n60.md   OSF-style frozen protocol for the next-round
                           expansion (genre-stratified N≥60). Converts
                           future analyses from exploratory to
                           confirmatory.
  datasheet.md             Gebru et al. format corpus disclosure for
                           long_realistic_v3.jsonl.
  provenance.md            Pinned env + RNG seeds for bytes-identical
                           reproduction.
  Dockerfile + repro-entrypoint.sh
                           Turnkey reproduction container.

  lib/statistical_analysis.py       Pure-stdlib library: Wilson CI,
                                    cluster-bootstrap, two-way random-
                                    effects ANOVA, Welch t, Cohen's d,
                                    Holm-Bonferroni.
  lib/run_statistical_analysis.py   CLI consuming c2_bench JSON outputs.
  lib/tests/test_statistical_analysis.py
                                    16 tests, all pass. Wilson boundary
                                    behaviour pinned against textbook
                                    reference values (15/15 → [0.7961,
                                    1.0], 0/15 → [0, 0.2039]).

This is the input to peer review, not yet peer-reviewed. § "What's
missing" in README.md is explicit about that: no independent
evaluator has run this harness and reported their own numbers yet.

Honest disclosure to the upstream article's "99% byte-saver vs 60%
saver" rhetorical hook: the cost-model derivation in formalization.md
§ 3.1 shows the hook holds only for b_A < 1 − (1 − b_B) · ρ. For
(b_B = 0.60, ρ = 0.10) that's b_A < 0.96. Literal b_A = 0.99 does NOT
exhibit the cache crossover at any finite n — the byte-saving
advantage dominates. The hook IS correct for measured-compressor
values (typically b ≈ 0.7 – 0.95), but it's rhetorical, not
mathematical, at the boundary value.

Refs:
  Wilson 1927; Efron & Tibshirani 1993; Welch 1947; Holm 1979;
  Cohen 1988; Cook & Campbell 1979; Gebru et al. 2018.
@g-shevchenko g-shevchenko merged commit b6dc152 into main May 24, 2026
6 checks passed
@g-shevchenko g-shevchenko deleted the claude/methods-v3-public-mirror branch May 24, 2026 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant