methods/: peer-review-ready backing for benchmark/ (formalization + CIs + preregistration + repro)#56
Merged
Conversation
The benchmark/ directory ships a working harness with point-estimate
results — enough for a smoke test, not enough to defend a published
claim. This directory adds the statistical and methodological framework
that converts "I measured a number" to "I measured a number with a
defensible confidence bound":
formalization.md B/K/Q/C_prod formal definitions + cost-model
theorem (derives the exact condition under
which a high-byte-saving / low-cache-stable
compressor loses to a moderate-saving /
high-cache-stable one in steady state) +
Cook-Campbell threats to validity
statistical_analysis.md Real sophon@0.5.4 measurements on the public
15-fixture corpus with proper CIs:
B̄=0.828, cluster-bootstrap CI [0.721, 0.900]
κ=1.000 (15/15), Wilson CI [0.796, 1.000]
The Wilson lower bound 79.6% is the lesson —
"100% cache-friendly" is a sample statistic,
not a population estimate at N=15.
preregistration_n60.md OSF-style frozen protocol for the next-round
expansion (genre-stratified N≥60). Converts
future analyses from exploratory to
confirmatory.
datasheet.md Gebru et al. format corpus disclosure for
long_realistic_v3.jsonl.
provenance.md Pinned env + RNG seeds for bytes-identical
reproduction.
Dockerfile + repro-entrypoint.sh
Turnkey reproduction container.
lib/statistical_analysis.py Pure-stdlib library: Wilson CI,
cluster-bootstrap, two-way random-
effects ANOVA, Welch t, Cohen's d,
Holm-Bonferroni.
lib/run_statistical_analysis.py CLI consuming c2_bench JSON outputs.
lib/tests/test_statistical_analysis.py
16 tests, all pass. Wilson boundary
behaviour pinned against textbook
reference values (15/15 → [0.7961,
1.0], 0/15 → [0, 0.2039]).
This is the input to peer review, not yet peer-reviewed. § "What's
missing" in README.md is explicit about that: no independent
evaluator has run this harness and reported their own numbers yet.
Honest disclosure to the upstream article's "99% byte-saver vs 60%
saver" rhetorical hook: the cost-model derivation in formalization.md
§ 3.1 shows the hook holds only for b_A < 1 − (1 − b_B) · ρ. For
(b_B = 0.60, ρ = 0.10) that's b_A < 0.96. Literal b_A = 0.99 does NOT
exhibit the cache crossover at any finite n — the byte-saving
advantage dominates. The hook IS correct for measured-compressor
values (typically b ≈ 0.7 – 0.95), but it's rhetorical, not
mathematical, at the boundary value.
Refs:
Wilson 1927; Efron & Tibshirani 1993; Welch 1947; Holm 1979;
Cohen 1988; Cook & Campbell 1979; Gebru et al. 2018.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new top-level
methods/directory with the peer-review-ready statistical and methodological framework backing thebenchmark/harness. Ifbenchmark/is the measurement primitive,methods/is the defensibility layer.formalization.mdB,K,Q,C_prod+ cost-model theorem + Cook-Campbell threats to validitystatistical_analysis.mdpreregistration_n60.mddatasheet.mdprovenance.mdDockerfile+repro-entrypoint.shlib/statistical_analysis.pylib/run_statistical_analysis.pylib/tests/test_statistical_analysis.pyThe headline finding the CI-driven re-analysis surfaced
Sophon@0.5.4 on `long_realistic_v3` (N=5 runs × M=15 fixtures):
Wilson lower bound on κ = 79.6%, not 100%. A point-estimate "100% cache-friendly" is a sample statistic. The population proportion at 95% confidence is >=79.6% on a corpus this size. Anyone publishing a "100% cache-friendly" claim without that caveat is over-claiming on small N. This is the kind of disclosure the methods directory exists to make easy.
Test plan