The expensive moment in the loop is when the critic has to hold all N branches simultaneously to score and rank them. Observed degradation: at ~60–80K tokens of combined branch output feeding into one critic call, scoring becomes inconsistent. Cost scales with branch_count × branch_length, not linearly with branch count.
Three possible fixes to investigate:
- Pairwise tournament scoring (Elo-style) — each comparison stays under context limit; aggregate over O(N log N) comparisons.
- Chunked scoring with normalization across chunks — the critic scores groups of 4–5 ideas at a time, then a second pass normalizes scores across chunks.
- Hierarchical scoring — cluster-level first (which angles look promising), then idea-level within winning clusters only. Reduces total comparisons.
Tradeoff to measure: extra critic calls vs improved scoring consistency. The eval harness can test each variant on the existing problem set.
Raised by u/Unlikely_Ad_8060 with concrete numbers from their own multi-agent harness (3–5 concurrent agents, raised after Anthropic doubled rate limits in May).
The expensive moment in the loop is when the critic has to hold all N branches simultaneously to score and rank them. Observed degradation: at ~60–80K tokens of combined branch output feeding into one critic call, scoring becomes inconsistent. Cost scales with
branch_count × branch_length, not linearly with branch count.Three possible fixes to investigate:
Tradeoff to measure: extra critic calls vs improved scoring consistency. The eval harness can test each variant on the existing problem set.
Raised by u/Unlikely_Ad_8060 with concrete numbers from their own multi-agent harness (3–5 concurrent agents, raised after Anthropic doubled rate limits in May).