Skip to content

Latest commit

 

History

History
198 lines (136 loc) · 15.7 KB

File metadata and controls

198 lines (136 loc) · 15.7 KB

PolybrainBench

A living benchmark for cross-model consensus verification of natural-language claims.

Paper v16 · Zenodo DOI 10.5281/zenodo.19546460 · N=10,452 cycles · 9-model generator fleet · 6-model disjoint reviewer fleet · Honest composite 72 · CC-BY-4.0


What the benchmark measures

Most LLM benchmarks measure a single model against a fixed answer key. PolybrainBench measures inter-model disagreement on the same declarative claim, dispatched in parallel to a standardized generator fleet drawn from 5 independent training families. The disagreement pattern across models is the signal. Which models agree, which dissent, where the boundary of cross-model consensus actually lies, and on what kinds of claims.

The benchmark is designed as a living artifact. A daemon runs continuously, adds verification cycles to the dataset, regenerates the paper from the current ledger, validates the paper using a reviewer fleet deliberately kept partially independent from the generator, and republishes the dataset and paper with a versioned identifier. Every cycle compounds. A concept DOI (10.5281/zenodo.19546459) always resolves to the latest version for long-term citation stability.

The two fleets

PolybrainBench is built on a deliberate separation between two distinct fleets. Which model responded to a claim, and which model graded the paper about those responses, are now two different questions.

Generator fleet (9 models)

Every cycle dispatches the same 9 models in parallel against a single declarative claim. All 9 responses are captured in full, stamped with a SHA-256 provenance hash, and written to the cycle directory. The generator fleet is unchanged since Sprint 2 (N~1,100 cycles) and spans 5 independent training families:

Slug Provider Family Role in the fleet
kimi-k2-groq Groq Moonshot Frontier-quality Groq reasoning
gpt-4.1-mini OpenAI OpenAI Reliable mid-tier
gpt-4.1-nano OpenAI OpenAI Fast, cheap, stable
grok-3-mini xAI xAI Reasoning, slowest typical response
grok-4-fast xAI xAI Strict reviewer (when in the reviewer seat)
qwen3-32b Groq Alibaba Has logprobs
gpt-oss-120b Groq OpenAI (open weights) Strictest in-fleet reviewer (when in that seat)
llama-4-scout Groq Meta Fastest cycle response
llama-3.3-70b Groq Meta Most generous reviewer (when in that seat)

Why five training families matter. Models from the same provider share training data, alignment procedures, and failure modes. Measuring cross-model disagreement across models from the same family understates the actual variance that matters in production. PolybrainBench draws its generator fleet from OpenAI, xAI, Moonshot, Meta, and Alibaba specifically to maximize independence of training lineages.

Reviewer fleet (6 models, partially disjoint)

Every validator run dispatches the regenerated paper as a topic to a different 6-model reviewer fleet. This is the central change introduced in Sprint 7. Until v8 the paper was validated by the exact same 9 models that had written the responses the paper describes. That is peer review by the author. The composite produced by that arrangement was systematically inflated relative to an independent reading of the same paper.

The Sprint 7 reviewer fleet consists of four in-fleet reviewers plus two external anchors:

Reviewer Provider Family Independence Role
gpt-4.1-nano OpenAI OpenAI In-fleet (also a generator) Stable mid-tier
grok-4-fast xAI xAI In-fleet (also a generator) Strict voice #1
gpt-oss-120b Groq OpenAI weights In-fleet (also a generator) Strictest in-fleet voice
llama-3.3-70b Groq Meta In-fleet (also a generator) Most generous voice
claude-sonnet-4-5 Anthropic Anthropic External anchor, zero corpus contribution Moderate voice
gemini-2.5-pro Google Google External anchor, zero corpus contribution Strictest reviewer overall on first reading

Partial reviewer independence, not absolute. The four in-fleet reviewers also contribute per-claim responses to the dataset. They do not grade their own individual responses; they grade the paper's aggregate analysis of ten thousand cycles. They span the observed reviewer personality range from strictest (gpt-oss-120b) to most generous (llama-3.3-70b), which is why they are retained in the reviewer seat rather than swapped out entirely. The two external anchors are fully independent: Anthropic and Google are not in the generator fleet at all, and neither model has written a single response into the corpus they are reading. This is the honest framing. It is a partial disjointness, and the paper header spells it out in the same block that shows the composite.

The full fleet catalog (both fleets, all 11 unique models, 7 training families) is documented separately in docs/fleet.md.

The cycle

Each verification cycle is a single atomic unit:

  1. A declarative claim is drawn from the topic queue (divergence-targeted, see §topic generator)
  2. The claim is dispatched in parallel to all 9 generator-fleet models
  3. Each model's full response text is captured and stamped with a SHA-256 provenance hash
  4. Per-model response time is recorded in milliseconds
  5. A grounding verification step confirms all expected response files exist on disk
  6. A manifest summarizing the cycle is written as the atomic transaction record

Cycle directories are stored as cycles/NNN/ with manifest.json, responses/<model-slug>.md, traces/<model-slug>-trace.json, and provenance.json. A harvest script unifies the cycle directories into the public ledger (public-ledger.jsonl), one row per cycle.

The topic generator (divergence-targeted)

A claim that every model trivially agrees on yields zero disagreement signal. A claim where independent models actually diverge on numbers, dates, attributions, or interpretations is where the measurement lives.

The topic generator scores each candidate claim on two independent 0-to-0.5 factors:

Factor A: well-formed declarative shell

  • Length in the 10 to 30 word range
  • Single sentence with terminal punctuation
  • No first-person pronouns
  • Declarative rather than interrogative

Factor B: signal density

  • Specific numbers (2+ digits, decimals, or percentages)
  • Specific years in the 1500 to 2029 range
  • Two or more mid-sentence capitalized words (named entities)
  • Two or more long words (technical vocabulary)
  • Attribution or action verbs from a curated list

Every claim must pass an acceptance threshold on the sum of factors A and B. A separate upstream prompt to the generator LLM explicitly requires each candidate to contain at least one of eight divergence-generating features: non-round specific numbers, contested dates, attribution claims, criterion-dependent superlatives, cross-field technical terms, near-training-cutoff events (2020 through 2025), named standards with specific effective dates, or flat-stated corrections of common misconceptions.

Trivia like "Paris is the capital of France", "The element with atomic number 79 is gold", or "The chemical formula for glucose is C6H12O6" are listed as forbidden examples in the generator prompt. They yield zero inter-model disagreement signal and therefore zero scientific value to this benchmark.

The validator

Every N cycles, the daemon regenerates the paper from the current ledger and validates the paper using the 6-model reviewer fleet as an independent panel. Each reviewer reports two dimensions: quality (Q, 0 to 100) and adversarial (A, 0 to 100). The honest composite is computed as:

composite = round(0.6 × mean(Q across reviewers) + 0.4 × mean(A across reviewers))

The weights are fixed. No single reviewer can move the composite by more than roughly two points per dimension in expectation at this fleet size. Real token cost is captured from each model's API response usage fields and summed into a per-run real cost. A validator run on the disjoint reviewer fleet costs approximately $0.040, up from $0.0073 on the previous self-review fleet because Claude Sonnet 4.5 and Gemini 2.5 Pro are paid APIs whereas five of the nine original reviewers were free via Groq's developer tier. The cost multiplier is roughly 5x. The total still rounds to "sub-penny-scale per validator run."

If a reviewer call returns an HTTP 429 (rate limit) or 503 (transient overload), the validator retries up to three times with exponential backoff (3s / 6s / 12s). The 503 retry was added after Sprint 7's first reviewer run hit a Gemini transient overload and dropped to 5-of-6 valid responses; with the retry, Gemini recovered on the second attempt and the run went 6-of-6.

The self-review bias measurement

Sprint 7's leveraged change was the disjoint reviewer split. The change moved the composite by a measurable amount, which is itself the first honest measurement of self-review bias in this benchmark. The same paper snapshot (v8, N=10,452 cycles, frontier corpus) was validated back-to-back under both arrangements:

Reading Fleet Mean Q Mean A Composite
v8 self-reviewed 9-model generator = reviewer 80.6 68.4 76
v13 disjoint 6-of-6 (run 1) 4 in-fleet + Claude + Gemini 71.17 60.33 67
v15 disjoint 6-of-6 (run 2) same 76.83 65.00 72
v16 disjoint 6-of-6 (run 3) same 75.00 67.00 72

Median of the three disjoint readings: 72. Mean: 70.3. Direct self-review inflation: 4 points on the central tendency, 4 to 9 points across individual readings. The honest disjoint composite lives in the 67-to-72 band. The prior 76 was the fleet grading its own corpus.

Per-model scores on the first full disjoint reading (v13):

Reviewer Provider Quality Adversarial
gpt-4.1-nano OpenAI 85 80
grok-4-fast xAI 65 52
gpt-oss-120b OpenAI weights / Groq 58 45
llama-3.3-70b Meta / Groq 95 92
claude-sonnet-4-5 Anthropic 72 58
gemini-2.5-pro Google 52 35

On its first reading of the benchmark, Gemini 2.5 Pro produced the single strictest score in the entire fleet, stricter than either in-fleet strict reviewer. Claude Sonnet 4.5 came in as a moderate mid-tier voice. Together, the two external anchors are responsible for most of the composite drop from the old 9-model self-review reading to the new 6-model disjoint reading.

Publication rule: Matthew Effect, no threshold

The published paper is the current canonical artifact, always. There is no threshold gate on the composite. The only condition that blocks publication is if the validator returned a null composite because one of the two dimensions was unparseable for all reviewers at once, which has not occurred since the parser bug in the earliest paper era was resolved.

The reason for the no-threshold rule: a 72-composite paper in the world compounds citations faster than a hypothetical 95-composite paper in a private directory. Every minute of threshold-waiting is a minute of lost discovery indexing. The truth of whatever the composite actually is, prominently displayed in the paper's header blockquote, is more credible than a hidden number.

Sprint history and trajectory

Sprint Cycles added Cumulative N Reviewer fleet Composite Corpus era
Sprint 2 749 1,103 9/9 self-review 74 trivia
Sprint 5 3,000 5,816 9/9 self-review 76 trivia (mode-collapsed)
Sprint 6 5,000 10,449 9/9 self-review 75 frontier (divergence-targeted)
Manual regen v8 0 (re-validate) 10,452 9/9 self-review 76 frontier
Sprint 7 v13 0 (re-validate) 10,452 6/6 disjoint 67 frontier
Sprint 7 v15 0 (re-validate) 10,452 6/6 disjoint 72 frontier
Sprint 7 v16 0 (re-validate) 10,452 6/6 disjoint 72 frontier

The three Sprint 7 readings are three back-to-back validator runs against the same paper on the same disjoint reviewer fleet. The variance across the three readings (67, 72, 72) is ensemble temperature sampling on the reviewer side, not a change in the paper itself. The median is the published number.

Repeated 9-of-9 self-review runs on an identical paper vary by approximately ±1 composite point under the old arrangement. The 6-of-6 disjoint fleet appears to have slightly higher sampling variance (approximately ±3 points, based on three readings). More readings will tighten the band.

Reproducing the benchmark

Every step is reproducible from the published artifacts:

  1. Dataset: download the unified ledger from the Zenodo deposit (DOI 10.5281/zenodo.19546460) or from the Hugging Face mirror at andysalvo/polybrainbench-v8.
  2. Methodology: this document is the full specification.
  3. Canonical pages: each verification cycle has a stable URL at polylogicai.com/trust/claim/<slug> with schema.org Dataset and FAQPage JSON-LD pointing back to the paper DOI.
  4. Generator fleet reproduction: any researcher with API access to OpenAI, xAI, and Groq can implement the same 9-model dispatch against the same claims from the ledger and verify the responses match within model sampling variance.
  5. Reviewer fleet reproduction: add API access to Anthropic and Google for the two external anchors. All six reviewer endpoints are public APIs. No gated access is required.
  6. Validator reproduction: the composite formula is simple enough to implement in approximately 20 lines: dispatch the paper as a topic to each reviewer, parse their Q and A scores, compute round(0.6 × mean(Q) + 0.4 × mean(A)).

Complementary, not competitive

PolybrainBench is deliberately not a capability benchmark. It does not claim that any model in the fleet is better than any other model, and it does not claim any aggregate "Polybrain" system outscores any frontier model on any task. The question it answers is one that cannot be answered by a single model: where do independently-trained models disagree on the same claim?

That question becomes more valuable, not less, as frontier models improve. If every model agrees on a claim, the claim is high-consensus and the fleet adds no information beyond a single model. If models disagree, the disagreement is a known-unknown marker and the fleet has measured something no single model can detect. Better frontier models shift the boundary, but the measurement of where the boundary currently sits remains informative.

Contributing

The benchmark grows via the generator daemon on a scheduled cadence. External contributions are welcome via issues for:

  • Topic-generator improvements (new divergence-generating claim features)
  • Alternative validator formulations (trim-mean, reviewer calibration, weighted voting)
  • Additional reviewer fleet anchors from families not yet represented
  • Sample canonical page improvements
  • Analysis of the existing ledger (the dataset is open for secondary analysis)

Report issues at https://github.com/andysalvo/polybrain/issues.

Citation

@dataset{salvo_polybrainbench_2026,
  author       = {Salvo, Andy},
  title        = {PolybrainBench v16: A Living Benchmark for Cross-Model
                  Consensus Verification of Natural-Language Claims},
  year         = 2026,
  publisher    = {Zenodo},
  version      = {v16},
  doi          = {10.5281/zenodo.19546460},
  url          = {https://doi.org/10.5281/zenodo.19546460}
}

License

Paper, dataset summary, methodology, and canonical pages: CC-BY-4.0. Engine source code: proprietary.