A living benchmark for cross-model consensus verification of natural-language claims.
Paper v16 · Zenodo DOI 10.5281/zenodo.19546460 · N=10,452 cycles · 9-model generator fleet · 6-model disjoint reviewer fleet · Honest composite 72 · CC-BY-4.0
Most LLM benchmarks measure a single model against a fixed answer key. PolybrainBench measures inter-model disagreement on the same declarative claim, dispatched in parallel to a standardized generator fleet drawn from 5 independent training families. The disagreement pattern across models is the signal. Which models agree, which dissent, where the boundary of cross-model consensus actually lies, and on what kinds of claims.
The benchmark is designed as a living artifact. A daemon runs continuously, adds verification cycles to the dataset, regenerates the paper from the current ledger, validates the paper using a reviewer fleet deliberately kept partially independent from the generator, and republishes the dataset and paper with a versioned identifier. Every cycle compounds. A concept DOI (10.5281/zenodo.19546459) always resolves to the latest version for long-term citation stability.
PolybrainBench is built on a deliberate separation between two distinct fleets. Which model responded to a claim, and which model graded the paper about those responses, are now two different questions.
Every cycle dispatches the same 9 models in parallel against a single declarative claim. All 9 responses are captured in full, stamped with a SHA-256 provenance hash, and written to the cycle directory. The generator fleet is unchanged since Sprint 2 (N~1,100 cycles) and spans 5 independent training families:
| Slug | Provider | Family | Role in the fleet |
|---|---|---|---|
kimi-k2-groq |
Groq | Moonshot | Frontier-quality Groq reasoning |
gpt-4.1-mini |
OpenAI | OpenAI | Reliable mid-tier |
gpt-4.1-nano |
OpenAI | OpenAI | Fast, cheap, stable |
grok-3-mini |
xAI | xAI | Reasoning, slowest typical response |
grok-4-fast |
xAI | xAI | Strict reviewer (when in the reviewer seat) |
qwen3-32b |
Groq | Alibaba | Has logprobs |
gpt-oss-120b |
Groq | OpenAI (open weights) | Strictest in-fleet reviewer (when in that seat) |
llama-4-scout |
Groq | Meta | Fastest cycle response |
llama-3.3-70b |
Groq | Meta | Most generous reviewer (when in that seat) |
Why five training families matter. Models from the same provider share training data, alignment procedures, and failure modes. Measuring cross-model disagreement across models from the same family understates the actual variance that matters in production. PolybrainBench draws its generator fleet from OpenAI, xAI, Moonshot, Meta, and Alibaba specifically to maximize independence of training lineages.
Every validator run dispatches the regenerated paper as a topic to a different 6-model reviewer fleet. This is the central change introduced in Sprint 7. Until v8 the paper was validated by the exact same 9 models that had written the responses the paper describes. That is peer review by the author. The composite produced by that arrangement was systematically inflated relative to an independent reading of the same paper.
The Sprint 7 reviewer fleet consists of four in-fleet reviewers plus two external anchors:
| Reviewer | Provider | Family | Independence | Role |
|---|---|---|---|---|
gpt-4.1-nano |
OpenAI | OpenAI | In-fleet (also a generator) | Stable mid-tier |
grok-4-fast |
xAI | xAI | In-fleet (also a generator) | Strict voice #1 |
gpt-oss-120b |
Groq | OpenAI weights | In-fleet (also a generator) | Strictest in-fleet voice |
llama-3.3-70b |
Groq | Meta | In-fleet (also a generator) | Most generous voice |
claude-sonnet-4-5 |
Anthropic | Anthropic | External anchor, zero corpus contribution | Moderate voice |
gemini-2.5-pro |
External anchor, zero corpus contribution | Strictest reviewer overall on first reading |
Partial reviewer independence, not absolute. The four in-fleet reviewers also contribute per-claim responses to the dataset. They do not grade their own individual responses; they grade the paper's aggregate analysis of ten thousand cycles. They span the observed reviewer personality range from strictest (gpt-oss-120b) to most generous (llama-3.3-70b), which is why they are retained in the reviewer seat rather than swapped out entirely. The two external anchors are fully independent: Anthropic and Google are not in the generator fleet at all, and neither model has written a single response into the corpus they are reading. This is the honest framing. It is a partial disjointness, and the paper header spells it out in the same block that shows the composite.
The full fleet catalog (both fleets, all 11 unique models, 7 training families) is documented separately in docs/fleet.md.
Each verification cycle is a single atomic unit:
- A declarative claim is drawn from the topic queue (divergence-targeted, see §topic generator)
- The claim is dispatched in parallel to all 9 generator-fleet models
- Each model's full response text is captured and stamped with a SHA-256 provenance hash
- Per-model response time is recorded in milliseconds
- A grounding verification step confirms all expected response files exist on disk
- A manifest summarizing the cycle is written as the atomic transaction record
Cycle directories are stored as cycles/NNN/ with manifest.json, responses/<model-slug>.md, traces/<model-slug>-trace.json, and provenance.json. A harvest script unifies the cycle directories into the public ledger (public-ledger.jsonl), one row per cycle.
A claim that every model trivially agrees on yields zero disagreement signal. A claim where independent models actually diverge on numbers, dates, attributions, or interpretations is where the measurement lives.
The topic generator scores each candidate claim on two independent 0-to-0.5 factors:
Factor A: well-formed declarative shell
- Length in the 10 to 30 word range
- Single sentence with terminal punctuation
- No first-person pronouns
- Declarative rather than interrogative
Factor B: signal density
- Specific numbers (2+ digits, decimals, or percentages)
- Specific years in the 1500 to 2029 range
- Two or more mid-sentence capitalized words (named entities)
- Two or more long words (technical vocabulary)
- Attribution or action verbs from a curated list
Every claim must pass an acceptance threshold on the sum of factors A and B. A separate upstream prompt to the generator LLM explicitly requires each candidate to contain at least one of eight divergence-generating features: non-round specific numbers, contested dates, attribution claims, criterion-dependent superlatives, cross-field technical terms, near-training-cutoff events (2020 through 2025), named standards with specific effective dates, or flat-stated corrections of common misconceptions.
Trivia like "Paris is the capital of France", "The element with atomic number 79 is gold", or "The chemical formula for glucose is C6H12O6" are listed as forbidden examples in the generator prompt. They yield zero inter-model disagreement signal and therefore zero scientific value to this benchmark.
Every N cycles, the daemon regenerates the paper from the current ledger and validates the paper using the 6-model reviewer fleet as an independent panel. Each reviewer reports two dimensions: quality (Q, 0 to 100) and adversarial (A, 0 to 100). The honest composite is computed as:
composite = round(0.6 × mean(Q across reviewers) + 0.4 × mean(A across reviewers))
The weights are fixed. No single reviewer can move the composite by more than roughly two points per dimension in expectation at this fleet size. Real token cost is captured from each model's API response usage fields and summed into a per-run real cost. A validator run on the disjoint reviewer fleet costs approximately $0.040, up from $0.0073 on the previous self-review fleet because Claude Sonnet 4.5 and Gemini 2.5 Pro are paid APIs whereas five of the nine original reviewers were free via Groq's developer tier. The cost multiplier is roughly 5x. The total still rounds to "sub-penny-scale per validator run."
If a reviewer call returns an HTTP 429 (rate limit) or 503 (transient overload), the validator retries up to three times with exponential backoff (3s / 6s / 12s). The 503 retry was added after Sprint 7's first reviewer run hit a Gemini transient overload and dropped to 5-of-6 valid responses; with the retry, Gemini recovered on the second attempt and the run went 6-of-6.
Sprint 7's leveraged change was the disjoint reviewer split. The change moved the composite by a measurable amount, which is itself the first honest measurement of self-review bias in this benchmark. The same paper snapshot (v8, N=10,452 cycles, frontier corpus) was validated back-to-back under both arrangements:
| Reading | Fleet | Mean Q | Mean A | Composite |
|---|---|---|---|---|
| v8 self-reviewed | 9-model generator = reviewer | 80.6 | 68.4 | 76 |
| v13 disjoint 6-of-6 (run 1) | 4 in-fleet + Claude + Gemini | 71.17 | 60.33 | 67 |
| v15 disjoint 6-of-6 (run 2) | same | 76.83 | 65.00 | 72 |
| v16 disjoint 6-of-6 (run 3) | same | 75.00 | 67.00 | 72 |
Median of the three disjoint readings: 72. Mean: 70.3. Direct self-review inflation: 4 points on the central tendency, 4 to 9 points across individual readings. The honest disjoint composite lives in the 67-to-72 band. The prior 76 was the fleet grading its own corpus.
Per-model scores on the first full disjoint reading (v13):
| Reviewer | Provider | Quality | Adversarial |
|---|---|---|---|
| gpt-4.1-nano | OpenAI | 85 | 80 |
| grok-4-fast | xAI | 65 | 52 |
| gpt-oss-120b | OpenAI weights / Groq | 58 | 45 |
| llama-3.3-70b | Meta / Groq | 95 | 92 |
| claude-sonnet-4-5 | Anthropic | 72 | 58 |
| gemini-2.5-pro | 52 | 35 |
On its first reading of the benchmark, Gemini 2.5 Pro produced the single strictest score in the entire fleet, stricter than either in-fleet strict reviewer. Claude Sonnet 4.5 came in as a moderate mid-tier voice. Together, the two external anchors are responsible for most of the composite drop from the old 9-model self-review reading to the new 6-model disjoint reading.
The published paper is the current canonical artifact, always. There is no threshold gate on the composite. The only condition that blocks publication is if the validator returned a null composite because one of the two dimensions was unparseable for all reviewers at once, which has not occurred since the parser bug in the earliest paper era was resolved.
The reason for the no-threshold rule: a 72-composite paper in the world compounds citations faster than a hypothetical 95-composite paper in a private directory. Every minute of threshold-waiting is a minute of lost discovery indexing. The truth of whatever the composite actually is, prominently displayed in the paper's header blockquote, is more credible than a hidden number.
| Sprint | Cycles added | Cumulative N | Reviewer fleet | Composite | Corpus era |
|---|---|---|---|---|---|
| Sprint 2 | 749 | 1,103 | 9/9 self-review | 74 | trivia |
| Sprint 5 | 3,000 | 5,816 | 9/9 self-review | 76 | trivia (mode-collapsed) |
| Sprint 6 | 5,000 | 10,449 | 9/9 self-review | 75 | frontier (divergence-targeted) |
| Manual regen v8 | 0 (re-validate) | 10,452 | 9/9 self-review | 76 | frontier |
| Sprint 7 v13 | 0 (re-validate) | 10,452 | 6/6 disjoint | 67 | frontier |
| Sprint 7 v15 | 0 (re-validate) | 10,452 | 6/6 disjoint | 72 | frontier |
| Sprint 7 v16 | 0 (re-validate) | 10,452 | 6/6 disjoint | 72 | frontier |
The three Sprint 7 readings are three back-to-back validator runs against the same paper on the same disjoint reviewer fleet. The variance across the three readings (67, 72, 72) is ensemble temperature sampling on the reviewer side, not a change in the paper itself. The median is the published number.
Repeated 9-of-9 self-review runs on an identical paper vary by approximately ±1 composite point under the old arrangement. The 6-of-6 disjoint fleet appears to have slightly higher sampling variance (approximately ±3 points, based on three readings). More readings will tighten the band.
Every step is reproducible from the published artifacts:
- Dataset: download the unified ledger from the Zenodo deposit (DOI 10.5281/zenodo.19546460) or from the Hugging Face mirror at andysalvo/polybrainbench-v8.
- Methodology: this document is the full specification.
- Canonical pages: each verification cycle has a stable URL at
polylogicai.com/trust/claim/<slug>with schema.org Dataset and FAQPage JSON-LD pointing back to the paper DOI. - Generator fleet reproduction: any researcher with API access to OpenAI, xAI, and Groq can implement the same 9-model dispatch against the same claims from the ledger and verify the responses match within model sampling variance.
- Reviewer fleet reproduction: add API access to Anthropic and Google for the two external anchors. All six reviewer endpoints are public APIs. No gated access is required.
- Validator reproduction: the composite formula is simple enough to implement in approximately 20 lines: dispatch the paper as a topic to each reviewer, parse their Q and A scores, compute
round(0.6 × mean(Q) + 0.4 × mean(A)).
PolybrainBench is deliberately not a capability benchmark. It does not claim that any model in the fleet is better than any other model, and it does not claim any aggregate "Polybrain" system outscores any frontier model on any task. The question it answers is one that cannot be answered by a single model: where do independently-trained models disagree on the same claim?
That question becomes more valuable, not less, as frontier models improve. If every model agrees on a claim, the claim is high-consensus and the fleet adds no information beyond a single model. If models disagree, the disagreement is a known-unknown marker and the fleet has measured something no single model can detect. Better frontier models shift the boundary, but the measurement of where the boundary currently sits remains informative.
The benchmark grows via the generator daemon on a scheduled cadence. External contributions are welcome via issues for:
- Topic-generator improvements (new divergence-generating claim features)
- Alternative validator formulations (trim-mean, reviewer calibration, weighted voting)
- Additional reviewer fleet anchors from families not yet represented
- Sample canonical page improvements
- Analysis of the existing ledger (the dataset is open for secondary analysis)
Report issues at https://github.com/andysalvo/polybrain/issues.
@dataset{salvo_polybrainbench_2026,
author = {Salvo, Andy},
title = {PolybrainBench v16: A Living Benchmark for Cross-Model
Consensus Verification of Natural-Language Claims},
year = 2026,
publisher = {Zenodo},
version = {v16},
doi = {10.5281/zenodo.19546460},
url = {https://doi.org/10.5281/zenodo.19546460}
}Paper, dataset summary, methodology, and canonical pages: CC-BY-4.0. Engine source code: proprietary.