PolybrainBench

A living benchmark for cross-model consensus verification of natural-language claims.

Paper v16 · Zenodo DOI 10.5281/zenodo.19546460 · N=10,452 cycles · 9-model generator fleet · 6-model disjoint reviewer fleet · Honest composite 72 · CC-BY-4.0

What the benchmark measures

Most LLM benchmarks measure a single model against a fixed answer key. PolybrainBench measures inter-model disagreement on the same declarative claim, dispatched in parallel to a standardized generator fleet drawn from 5 independent training families. The disagreement pattern across models is the signal. Which models agree, which dissent, where the boundary of cross-model consensus actually lies, and on what kinds of claims.

The benchmark is designed as a living artifact. A daemon runs continuously, adds verification cycles to the dataset, regenerates the paper from the current ledger, validates the paper using a reviewer fleet deliberately kept partially independent from the generator, and republishes the dataset and paper with a versioned identifier. Every cycle compounds. A concept DOI (10.5281/zenodo.19546459) always resolves to the latest version for long-term citation stability.

The two fleets

PolybrainBench is built on a deliberate separation between two distinct fleets. Which model responded to a claim, and which model graded the paper about those responses, are now two different questions.

Generator fleet (9 models)

Every cycle dispatches the same 9 models in parallel against a single declarative claim. All 9 responses are captured in full, stamped with a SHA-256 provenance hash, and written to the cycle directory. The generator fleet is unchanged since Sprint 2 (N~1,100 cycles) and spans 5 independent training families:

Slug	Provider	Family	Role in the fleet
`kimi-k2-groq`	Groq	Moonshot	Frontier-quality Groq reasoning
`gpt-4.1-mini`	OpenAI	OpenAI	Reliable mid-tier
`gpt-4.1-nano`	OpenAI	OpenAI	Fast, cheap, stable
`grok-3-mini`	xAI	xAI	Reasoning, slowest typical response
`grok-4-fast`	xAI	xAI	Strict reviewer (when in the reviewer seat)
`qwen3-32b`	Groq	Alibaba	Has logprobs
`gpt-oss-120b`	Groq	OpenAI (open weights)	Strictest in-fleet reviewer (when in that seat)
`llama-4-scout`	Groq	Meta	Fastest cycle response
`llama-3.3-70b`	Groq	Meta	Most generous reviewer (when in that seat)

Why five training families matter. Models from the same provider share training data, alignment procedures, and failure modes. Measuring cross-model disagreement across models from the same family understates the actual variance that matters in production. PolybrainBench draws its generator fleet from OpenAI, xAI, Moonshot, Meta, and Alibaba specifically to maximize independence of training lineages.

Reviewer fleet (6 models, partially disjoint)

Every validator run dispatches the regenerated paper as a topic to a different 6-model reviewer fleet. This is the central change introduced in Sprint 7. Until v8 the paper was validated by the exact same 9 models that had written the responses the paper describes. That is peer review by the author. The composite produced by that arrangement was systematically inflated relative to an independent reading of the same paper.

The Sprint 7 reviewer fleet consists of four in-fleet reviewers plus two external anchors:

Reviewer	Provider	Family	Independence	Role
`gpt-4.1-nano`	OpenAI	OpenAI	In-fleet (also a generator)	Stable mid-tier
`grok-4-fast`	xAI	xAI	In-fleet (also a generator)	Strict voice #1
`gpt-oss-120b`	Groq	OpenAI weights	In-fleet (also a generator)	Strictest in-fleet voice
`llama-3.3-70b`	Groq	Meta	In-fleet (also a generator)	Most generous voice
`claude-sonnet-4-5`	Anthropic	Anthropic	External anchor, zero corpus contribution	Moderate voice
`gemini-2.5-pro`	Google	Google	External anchor, zero corpus contribution	Strictest reviewer overall on first reading

Partial reviewer independence, not absolute. The four in-fleet reviewers also contribute per-claim responses to the dataset. They do not grade their own individual responses; they grade the paper's aggregate analysis of ten thousand cycles. They span the observed reviewer personality range from strictest (gpt-oss-120b) to most generous (llama-3.3-70b), which is why they are retained in the reviewer seat rather than swapped out entirely. The two external anchors are fully independent: Anthropic and Google are not in the generator fleet at all, and neither model has written a single response into the corpus they are reading. This is the honest framing. It is a partial disjointness, and the paper header spells it out in the same block that shows the composite.

The full fleet catalog (both fleets, all 11 unique models, 7 training families) is documented separately in docs/fleet.md.

The cycle

Each verification cycle is a single atomic unit:

A declarative claim is drawn from the topic queue (divergence-targeted, see §topic generator)
The claim is dispatched in parallel to all 9 generator-fleet models
Each model's full response text is captured and stamped with a SHA-256 provenance hash
Per-model response time is recorded in milliseconds
A grounding verification step confirms all expected response files exist on disk
A manifest summarizing the cycle is written as the atomic transaction record

Cycle directories are stored as cycles/NNN/ with manifest.json, responses/<model-slug>.md, traces/<model-slug>-trace.json, and provenance.json. A harvest script unifies the cycle directories into the public ledger (public-ledger.jsonl), one row per cycle.

The topic generator (divergence-targeted)

A claim that every model trivially agrees on yields zero disagreement signal. A claim where independent models actually diverge on numbers, dates, attributions, or interpretations is where the measurement lives.

The topic generator scores each candidate claim on two independent 0-to-0.5 factors:

Factor A: well-formed declarative shell

Length in the 10 to 30 word range
Single sentence with terminal punctuation
No first-person pronouns
Declarative rather than interrogative

Factor B: signal density

Specific numbers (2+ digits, decimals, or percentages)
Specific years in the 1500 to 2029 range
Two or more mid-sentence capitalized words (named entities)
Two or more long words (technical vocabulary)
Attribution or action verbs from a curated list

Every claim must pass an acceptance threshold on the sum of factors A and B. A separate upstream prompt to the generator LLM explicitly requires each candidate to contain at least one of eight divergence-generating features: non-round specific numbers, contested dates, attribution claims, criterion-dependent superlatives, cross-field technical terms, near-training-cutoff events (2020 through 2025), named standards with specific effective dates, or flat-stated corrections of common misconceptions.

Trivia like "Paris is the capital of France", "The element with atomic number 79 is gold", or "The chemical formula for glucose is C6H12O6" are listed as forbidden examples in the generator prompt. They yield zero inter-model disagreement signal and therefore zero scientific value to this benchmark.

The validator

Every N cycles, the daemon regenerates the paper from the current ledger and validates the paper using the 6-model reviewer fleet as an independent panel. Each reviewer reports two dimensions: quality (Q, 0 to 100) and adversarial (A, 0 to 100). The honest composite is computed as:

composite = round(0.6 × mean(Q across reviewers) + 0.4 × mean(A across reviewers))

The weights are fixed. No single reviewer can move the composite by more than roughly two points per dimension in expectation at this fleet size. Real token cost is captured from each model's API response usage fields and summed into a per-run real cost. A validator run on the disjoint reviewer fleet costs approximately $0.040, up from $0.0073 on the previous self-review fleet because Claude Sonnet 4.5 and Gemini 2.5 Pro are paid APIs whereas five of the nine original reviewers were free via Groq's developer tier. The cost multiplier is roughly 5x. The total still rounds to "sub-penny-scale per validator run."

If a reviewer call returns an HTTP 429 (rate limit) or 503 (transient overload), the validator retries up to three times with exponential backoff (3s / 6s / 12s). The 503 retry was added after Sprint 7's first reviewer run hit a Gemini transient overload and dropped to 5-of-6 valid responses; with the retry, Gemini recovered on the second attempt and the run went 6-of-6.

The self-review bias measurement

Sprint 7's leveraged change was the disjoint reviewer split. The change moved the composite by a measurable amount, which is itself the first honest measurement of self-review bias in this benchmark. The same paper snapshot (v8, N=10,452 cycles, frontier corpus) was validated back-to-back under both arrangements:

Reading	Fleet	Mean Q	Mean A	Composite
v8 self-reviewed	9-model generator = reviewer	80.6	68.4	76
v13 disjoint 6-of-6 (run 1)	4 in-fleet + Claude + Gemini	71.17	60.33	67
v15 disjoint 6-of-6 (run 2)	same	76.83	65.00	72
v16 disjoint 6-of-6 (run 3)	same	75.00	67.00	72

Median of the three disjoint readings: 72. Mean: 70.3. Direct self-review inflation: 4 points on the central tendency, 4 to 9 points across individual readings. The honest disjoint composite lives in the 67-to-72 band. The prior 76 was the fleet grading its own corpus.

Per-model scores on the first full disjoint reading (v13):

Reviewer	Provider	Quality	Adversarial
gpt-4.1-nano	OpenAI	85	80
grok-4-fast	xAI	65	52
gpt-oss-120b	OpenAI weights / Groq	58	45
llama-3.3-70b	Meta / Groq	95	92
claude-sonnet-4-5	Anthropic	72	58
gemini-2.5-pro	Google	52	35

On its first reading of the benchmark, Gemini 2.5 Pro produced the single strictest score in the entire fleet, stricter than either in-fleet strict reviewer. Claude Sonnet 4.5 came in as a moderate mid-tier voice. Together, the two external anchors are responsible for most of the composite drop from the old 9-model self-review reading to the new 6-model disjoint reading.

Publication rule: Matthew Effect, no threshold

The published paper is the current canonical artifact, always. There is no threshold gate on the composite. The only condition that blocks publication is if the validator returned a null composite because one of the two dimensions was unparseable for all reviewers at once, which has not occurred since the parser bug in the earliest paper era was resolved.

The reason for the no-threshold rule: a 72-composite paper in the world compounds citations faster than a hypothetical 95-composite paper in a private directory. Every minute of threshold-waiting is a minute of lost discovery indexing. The truth of whatever the composite actually is, prominently displayed in the paper's header blockquote, is more credible than a hidden number.

Sprint history and trajectory

Sprint	Cycles added	Cumulative N	Reviewer fleet	Composite	Corpus era
Sprint 2	749	1,103	9/9 self-review	74	trivia
Sprint 5	3,000	5,816	9/9 self-review	76	trivia (mode-collapsed)
Sprint 6	5,000	10,449	9/9 self-review	75	frontier (divergence-targeted)
Manual regen v8	0 (re-validate)	10,452	9/9 self-review	76	frontier
Sprint 7 v13	0 (re-validate)	10,452	6/6 disjoint	67	frontier
Sprint 7 v15	0 (re-validate)	10,452	6/6 disjoint	72	frontier
Sprint 7 v16	0 (re-validate)	10,452	6/6 disjoint	72	frontier

The three Sprint 7 readings are three back-to-back validator runs against the same paper on the same disjoint reviewer fleet. The variance across the three readings (67, 72, 72) is ensemble temperature sampling on the reviewer side, not a change in the paper itself. The median is the published number.

Repeated 9-of-9 self-review runs on an identical paper vary by approximately ±1 composite point under the old arrangement. The 6-of-6 disjoint fleet appears to have slightly higher sampling variance (approximately ±3 points, based on three readings). More readings will tighten the band.

Reproducing the benchmark

Every step is reproducible from the published artifacts:

Dataset: download the unified ledger from the Zenodo deposit (DOI 10.5281/zenodo.19546460) or from the Hugging Face mirror at andysalvo/polybrainbench-v8.
Methodology: this document is the full specification.
Canonical pages: each verification cycle has a stable URL at polylogicai.com/trust/claim/<slug> with schema.org Dataset and FAQPage JSON-LD pointing back to the paper DOI.
Generator fleet reproduction: any researcher with API access to OpenAI, xAI, and Groq can implement the same 9-model dispatch against the same claims from the ledger and verify the responses match within model sampling variance.
Reviewer fleet reproduction: add API access to Anthropic and Google for the two external anchors. All six reviewer endpoints are public APIs. No gated access is required.
Validator reproduction: the composite formula is simple enough to implement in approximately 20 lines: dispatch the paper as a topic to each reviewer, parse their Q and A scores, compute round(0.6 × mean(Q) + 0.4 × mean(A)).

Complementary, not competitive

PolybrainBench is deliberately not a capability benchmark. It does not claim that any model in the fleet is better than any other model, and it does not claim any aggregate "Polybrain" system outscores any frontier model on any task. The question it answers is one that cannot be answered by a single model: where do independently-trained models disagree on the same claim?

That question becomes more valuable, not less, as frontier models improve. If every model agrees on a claim, the claim is high-consensus and the fleet adds no information beyond a single model. If models disagree, the disagreement is a known-unknown marker and the fleet has measured something no single model can detect. Better frontier models shift the boundary, but the measurement of where the boundary currently sits remains informative.

Contributing

The benchmark grows via the generator daemon on a scheduled cadence. External contributions are welcome via issues for:

Topic-generator improvements (new divergence-generating claim features)
Alternative validator formulations (trim-mean, reviewer calibration, weighted voting)
Additional reviewer fleet anchors from families not yet represented
Sample canonical page improvements
Analysis of the existing ledger (the dataset is open for secondary analysis)

Report issues at https://github.com/andysalvo/polybrain/issues.

Citation

@dataset{salvo_polybrainbench_2026,
  author       = {Salvo, Andy},
  title        = {PolybrainBench v16: A Living Benchmark for Cross-Model
                  Consensus Verification of Natural-Language Claims},
  year         = 2026,
  publisher    = {Zenodo},
  version      = {v16},
  doi          = {10.5281/zenodo.19546460},
  url          = {https://doi.org/10.5281/zenodo.19546460}
}

License

Paper, dataset summary, methodology, and canonical pages: CC-BY-4.0. Engine source code: proprietary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PolybrainBench

What the benchmark measures

The two fleets

Generator fleet (9 models)

Reviewer fleet (6 models, partially disjoint)

The cycle

The topic generator (divergence-targeted)

The validator

The self-review bias measurement

Publication rule: Matthew Effect, no threshold

Sprint history and trajectory

Reproducing the benchmark

Complementary, not competitive

Contributing

Citation

License

FilesExpand file tree

polybrainbench.md

Latest commit

History

polybrainbench.md

File metadata and controls

PolybrainBench

What the benchmark measures

The two fleets

Generator fleet (9 models)

Reviewer fleet (6 models, partially disjoint)

The cycle

The topic generator (divergence-targeted)

The validator

The self-review bias measurement

Publication rule: Matthew Effect, no threshold

Sprint history and trajectory

Reproducing the benchmark

Complementary, not competitive

Contributing

Citation

License