Conversation
📝 WalkthroughWalkthroughAdds a full benchmarking feature: DB-backed run/round persistence, bench execution engine with cancellation and retry logic, REST APIs and server integration, an admin UI to start/cancel benchmarks, and a standalone React UI for viewing runs, live progress, rankings, and head‑to‑head details. Changes
Sequence Diagram(s)sequenceDiagram
actor Admin
participant AdminUI as Admin UI
participant Server
participant BenchEngine as Bench Engine
participant Database
participant WebSocket as WebSocket
Admin->>AdminUI: Click Start Benchmark (rounds)
AdminUI->>Server: POST /api/bench/start
Server->>BenchEngine: runBench(config)
BenchEngine->>Database: saveBenchRun(...)
BenchEngine->>Server: update progress/status
Server->>WebSocket: broadcast bench started/progress
WebSocket->>AdminUI: deliver updates
loop For each pairing and round
BenchEngine->>BenchEngine: generate prompt, get answers, vote
BenchEngine->>Database: saveBenchRound(...)
BenchEngine->>Server: update completed rounds
Server->>WebSocket: broadcast progress
end
BenchEngine->>Database: updateBenchRunStatus(completed/failed/cancelled)
Server->>WebSocket: broadcast bench finished
WebSocket->>AdminUI: final update
sequenceDiagram
actor User
participant BenchFrontend as Bench Frontend
participant Server
participant Database
participant WebSocket
User->>BenchFrontend: Load /bench
BenchFrontend->>Server: GET /api/bench/runs
Server->>Database: getBenchRuns()
Database-->>Server: runs
Server-->>BenchFrontend: runs JSON
BenchFrontend->>Server: GET /api/bench/results/:runId
Server->>Database: getBenchRounds(runId)
Database-->>Server: rounds data
Server-->>BenchFrontend: results JSON
BenchFrontend->>BenchFrontend: computeRankings & computeH2H
BenchFrontend->>User: render rankings/matrix
BenchFrontend->>WebSocket: connect for live updates
loop Live updates
WebSocket->>BenchFrontend: bench progress/status
BenchFrontend->>Server: refresh runs/results as needed
BenchFrontend->>User: update UI
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Add round-robin benchmark flow with admin controls, database persistence, live WebSocket updates, and a new /bench results UIIntroduce a benchmark run loop with cancellation, persist runs and rounds, expose REST endpoints for start/cancel/status/results, broadcast progress over WebSocket, and add a /bench page with rankings, head‑to‑head matrix, and live progress. Server marks stale runs as error on startup. 📍Where to StartStart with the benchmark orchestration in Macroscope summarized 7801db9. |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (7)
server.ts (2)
527-528: The 50mssetTimeoutis unnecessary.
runBenchsetsactiveBenchRunsynchronously before its firstawait, sogetActiveBenchRun()will already return the run state immediately after the.catch()on line 523 is attached. The 50ms delay is harmless but misleading — it suggests timing fragility that doesn't exist.♻️ Suggested simplification
// Start in background (not awaited) runBench(config, broadcast).catch((err) => { log("ERROR", "bench", "Bench run failed", { error: err.message }); }); - // Wait a tick for state to initialize - await new Promise((r) => setTimeout(r, 50)); const active = getActiveBenchRun();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server.ts` around lines 527 - 528, Remove the 50ms artificial delay: delete the "await new Promise((r) => setTimeout(r, 50));" and its comment; since runBench sets activeBenchRun synchronously before its first await, getActiveBenchRun() will already return the run state and the .catch() is already attached, so no additional tick wait is required—ensure the surrounding logic that calls runBench and then reads getActiveBenchRun() remains unchanged.
553-583: Public bench endpoints lack rate limiting.
/api/bench/status,/api/bench/runs, and/api/bench/results/:idare unauthenticated (correctly, for the public bench page) but have no rate limiting. The results endpoint in particular could be expensive (fetches all rounds for a run from SQLite). Consider applying a rate limit similar toHISTORY_LIMIT_PER_MIN.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server.ts` around lines 553 - 583, These public bench endpoints lack rate limiting; add a limit check using the existing HISTORY_LIMIT_PER_MIN logic (or reuse the same rate limiter mechanism) before handling requests for "/api/bench/status", "/api/bench/runs" and "/api/bench/results/:id". For each handler (where getBenchSummary, getBenchRuns, getBenchRun and getBenchRounds are called) enforce the HISTORY_LIMIT_PER_MIN rate window (keyed by requester IP or same keying scheme used elsewhere), return a 429 Response when the limit is exceeded, and only proceed to call the getBench* functions when the rate check passes.bench.ts (1)
132-168:Promise.allshort-circuits: if model A's answer fails, model B's retry chain is abandoned.When both answers are fetched in parallel via
Promise.all, if one model'swithRetryexhausts retries and throws, the other in-flight call is effectively orphaned (its result is discarded). The error message recorded may reference only the first failure. This is acceptable for a benchmark, but worth noting that partial answer data (answerA/answerBon lines 159/161) will remain empty strings for both models even if one succeeded.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bench.ts` around lines 132 - 168, The current parallel fetch uses Promise.all which short-circuits and discards the other in-flight withRetry call if one rejects; change the logic to start both retry promises independently and await them with Promise.allSettled (or await each separately) so each model's callGenerateAnswer via withRetry (refer to withRetry(() => callGenerateAnswer(pairing.modelA, prompt), ...) and the pairing.modelB equivalent) completes or fails independently; then set answerA/answerB from the settled results (checking status === "fulfilled" and extracting value, otherwise capture the individual error messages) and preserve partial answers instead of returning immediately on the first rejection.bench-frontend.tsx (2)
679-688: Inline styles on the "All Runs" heading are inconsistent with the CSS-class approach used elsewhere.Every other heading in this file uses CSS classes (e.g.,
.rankings h2,.matrix h2). Consider adding a reusable class inbench.cssinstead.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bench-frontend.tsx` around lines 679 - 688, The "All Runs" h2 uses inline styles instead of the project's CSS-class approach; remove the inline style object from the h2 in bench-frontend.tsx and add a className (e.g., "all-runs" or "all-runs-heading") to that element, then define the same visual rules (font-family, font-size:13, font-weight:700, text-transform:uppercase, letter-spacing:1, color:var(--text-dim), margin-bottom:16) in bench.css under the new selector (e.g., .all-runs h2 or .all-runs-heading) so styling is consistent with other headings like .rankings h2 and .matrix h2.
53-96:MODEL_COLORS,getColor,getLogo, andModelTagare duplicated fromfrontend.tsx.These are verbatim copies. If model names or colors change, both files must be updated in lockstep. Consider extracting them into a shared module (e.g.,
models-ui.ts) and importing from both entry points.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bench-frontend.tsx` around lines 53 - 96, MODEL_COLORS, getColor, getLogo and ModelTag are duplicated here; extract these symbols into a shared UI module (e.g., models-ui.ts or models-ui.tsx) and import them in both bench-frontend.tsx and frontend.tsx. Move the constant MODEL_COLORS and the functions getColor/getLogo plus the ModelTag component into the new module, export them (keeping their names), then replace the local definitions in bench-frontend.tsx with imports from the shared module so both entry points consume the single source of truth.bench-db.ts (1)
16-25: Consider adding an index onbench_rounds(run_id)for query performance.
getBenchRoundsqueries byrun_id, and without an index this becomes a full table scan as rounds accumulate across many benchmark runs.🔧 Suggested addition after the CREATE TABLE
db.exec(` CREATE TABLE IF NOT EXISTS bench_rounds ( id INTEGER PRIMARY KEY AUTOINCREMENT, run_id TEXT NOT NULL, pairing_index INTEGER NOT NULL, round_num INTEGER NOT NULL, data TEXT NOT NULL, FOREIGN KEY (run_id) REFERENCES bench_runs(id) ); `); + +db.exec(` + CREATE INDEX IF NOT EXISTS idx_bench_rounds_run_id ON bench_rounds(run_id); +`);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bench-db.ts` around lines 16 - 25, Add a non-unique index on bench_rounds(run_id) to speed queries from getBenchRounds by run_id; after the CREATE TABLE block for bench_rounds execute a CREATE INDEX IF NOT EXISTS statement (e.g. idx_bench_rounds_run_id on bench_rounds(run_id)) so future queries filtering by run_id use the index and avoid full table scans.admin.tsx (1)
185-210: Unused variable:dataon line 199 is assigned but never read.The JSON response from
/api/bench/startis parsed but discarded. Either remove the assignment or use it to update state directly.🔧 Proposed fix
if (!response.ok) { const text = await response.text(); throw new Error(text || `Request failed (${response.status})`); } - const data = await response.json(); + await response.json(); // consume body // Refresh admin status🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@admin.tsx` around lines 185 - 210, In onBenchStart, the JSON response from fetch("/api/bench/start") is parsed into the unused variable data; either remove the line "const data = await response.json();" or consume it (e.g., use its contents to update state instead of discarding) — if you intend to update the admin snapshot with the start response, pass that parsed object into setSnapshot(data) (or merge it with the result of requestAdminJson), otherwise delete the unused data assignment to avoid the lint warning; keep the surrounding error handling and finally block (setPending/setError) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bench-frontend.tsx`:
- Around line 538-560: When a bench completes the code refreshes the runs list
but doesn’t reload rounds if the top run ID equals the current selectedRunId, so
make the completion handler also explicitly reload rounds for the selected run:
after setRuns(...) and determining latestId = data[0].id, if latestId ===
selectedRunId call the same rounds-fetch logic used in the useEffect (fetch
"/api/bench/runs/{id}/rounds" and update roundsData via setRoundsData) or call
the existing helper that loads rounds; otherwise setSelectedRunId(latestId) as
before. Ensure you reference and use selectedRunId, setSelectedRunId and
setRoundsData (or the existing rounds-loading function) in the completion
branch.
In `@bench.css`:
- Line 17: The CSS custom property --serif uses the font-family value 'DM Serif
Display', Georgia, serif with an uppercase keyword "Georgia" that violates
value-keyword-case; update the keyword to lowercase by changing "Georgia" to
"georgia" in the --serif declaration so the value becomes 'DM Serif Display',
georgia, serif.
In `@bench.ts`:
- Around line 96-101: runBenchRound can crash when nonContestants is empty
(models length <= 2) because shuffledNon[0]! is assumed non-null; add a guard
early in runBenchRound: if allModels.length <= 2 (or nonContestants.length ===
0) handle the edge by either returning/skipping the round or choosing a safe
fallback prompter and voters (for example set prompter = pairing.modelA and
voters = [pairing.modelA, pairing.modelB] or duplicate a model), and remove the
non-null assertion on shuffledNon[0]; ensure callGeneratePrompt is only invoked
with a defined prompter and that voters is a valid non-empty array.
---
Nitpick comments:
In `@admin.tsx`:
- Around line 185-210: In onBenchStart, the JSON response from
fetch("/api/bench/start") is parsed into the unused variable data; either remove
the line "const data = await response.json();" or consume it (e.g., use its
contents to update state instead of discarding) — if you intend to update the
admin snapshot with the start response, pass that parsed object into
setSnapshot(data) (or merge it with the result of requestAdminJson), otherwise
delete the unused data assignment to avoid the lint warning; keep the
surrounding error handling and finally block (setPending/setError) unchanged.
In `@bench-db.ts`:
- Around line 16-25: Add a non-unique index on bench_rounds(run_id) to speed
queries from getBenchRounds by run_id; after the CREATE TABLE block for
bench_rounds execute a CREATE INDEX IF NOT EXISTS statement (e.g.
idx_bench_rounds_run_id on bench_rounds(run_id)) so future queries filtering by
run_id use the index and avoid full table scans.
In `@bench-frontend.tsx`:
- Around line 679-688: The "All Runs" h2 uses inline styles instead of the
project's CSS-class approach; remove the inline style object from the h2 in
bench-frontend.tsx and add a className (e.g., "all-runs" or "all-runs-heading")
to that element, then define the same visual rules (font-family, font-size:13,
font-weight:700, text-transform:uppercase, letter-spacing:1,
color:var(--text-dim), margin-bottom:16) in bench.css under the new selector
(e.g., .all-runs h2 or .all-runs-heading) so styling is consistent with other
headings like .rankings h2 and .matrix h2.
- Around line 53-96: MODEL_COLORS, getColor, getLogo and ModelTag are duplicated
here; extract these symbols into a shared UI module (e.g., models-ui.ts or
models-ui.tsx) and import them in both bench-frontend.tsx and frontend.tsx. Move
the constant MODEL_COLORS and the functions getColor/getLogo plus the ModelTag
component into the new module, export them (keeping their names), then replace
the local definitions in bench-frontend.tsx with imports from the shared module
so both entry points consume the single source of truth.
In `@bench.ts`:
- Around line 132-168: The current parallel fetch uses Promise.all which
short-circuits and discards the other in-flight withRetry call if one rejects;
change the logic to start both retry promises independently and await them with
Promise.allSettled (or await each separately) so each model's callGenerateAnswer
via withRetry (refer to withRetry(() => callGenerateAnswer(pairing.modelA,
prompt), ...) and the pairing.modelB equivalent) completes or fails
independently; then set answerA/answerB from the settled results (checking
status === "fulfilled" and extracting value, otherwise capture the individual
error messages) and preserve partial answers instead of returning immediately on
the first rejection.
In `@server.ts`:
- Around line 527-528: Remove the 50ms artificial delay: delete the "await new
Promise((r) => setTimeout(r, 50));" and its comment; since runBench sets
activeBenchRun synchronously before its first await, getActiveBenchRun() will
already return the run state and the .catch() is already attached, so no
additional tick wait is required—ensure the surrounding logic that calls
runBench and then reads getActiveBenchRun() remains unchanged.
- Around line 553-583: These public bench endpoints lack rate limiting; add a
limit check using the existing HISTORY_LIMIT_PER_MIN logic (or reuse the same
rate limiter mechanism) before handling requests for "/api/bench/status",
"/api/bench/runs" and "/api/bench/results/:id". For each handler (where
getBenchSummary, getBenchRuns, getBenchRun and getBenchRounds are called)
enforce the HISTORY_LIMIT_PER_MIN rate window (keyed by requester IP or same
keying scheme used elsewhere), return a 429 Response when the limit is exceeded,
and only proceed to call the getBench* functions when the rate check passes.
| --text-dim: #888; | ||
| --text-muted: #444; | ||
| --accent: #D97757; | ||
| --serif: 'DM Serif Display', Georgia, serif; |
There was a problem hiding this comment.
Fix casing: Georgia should be georgia.
Stylelint flags value-keyword-case: CSS keyword values should be lowercase.
🔧 Proposed fix
- --serif: 'DM Serif Display', Georgia, serif;
+ --serif: 'DM Serif Display', georgia, serif;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| --serif: 'DM Serif Display', Georgia, serif; | |
| --serif: 'DM Serif Display', georgia, serif; |
🧰 Tools
🪛 Stylelint (17.3.0)
[error] 17-17: Expected "Georgia" to be "georgia" (value-keyword-case)
(value-keyword-case)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bench.css` at line 17, The CSS custom property --serif uses the font-family
value 'DM Serif Display', Georgia, serif with an uppercase keyword "Georgia"
that violates value-keyword-case; update the keyword to lowercase by changing
"Georgia" to "georgia" in the --serif declaration so the value becomes 'DM Serif
Display', georgia, serif.
Adds a systematic round-robin benchmark mode where every model plays against every other model (C(7,2) = 21 matchups with 7 models). Results are persisted to SQLite and displayed on a dedicated /bench page with overall rankings and a head-to-head matrix. New files: bench-db.ts, bench.ts, bench.html, bench.css, bench-frontend.tsx Modified: server.ts (routes + API), frontend.tsx (nav link), admin.tsx (controls)
| useEffect(() => { | ||
| const wsProtocol = window.location.protocol === "https:" ? "wss:" : "ws:"; | ||
| const wsUrl = `${wsProtocol}//${window.location.host}/ws`; | ||
| let ws: WebSocket; |
There was a problem hiding this comment.
🟡 Medium bench-frontend.tsx:530
WebSocket onclose fires when cleanup calls ws.close(), scheduling a reconnect after component unmounts. Consider adding a flag (e.g., let unmounted = false) set in cleanup, and checking it in onclose before scheduling the timeout.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file bench-frontend.tsx around line 530:
WebSocket `onclose` fires when cleanup calls `ws.close()`, scheduling a reconnect after component unmounts. Consider adding a flag (e.g., `let unmounted = false`) set in cleanup, and checking it in `onclose` before scheduling the timeout.
Evidence trail:
bench-frontend.tsx lines 525-565 at commit 074c7e413e85fb33cce42e0e383746adec1427d7 - The WebSocket `onclose` handler (line 533-535) unconditionally schedules reconnect with `setTimeout(connect, 3000)`. The cleanup function (lines 562-565) calls `clearTimeout(reconnectTimer)` before `ws?.close()`, but when `ws.close()` triggers `onclose`, a new timer is created that won't be cleared.
| ); | ||
| const shuffledNon = shuffle([...nonContestants]); |
There was a problem hiding this comment.
🟠 High bench.ts:98
Edge case: with fewer than 3 models, nonContestants is empty so prompter is undefined and runBenchRound can crash (prompter.name). Suggest validating config.models.length >= 3 up front in runBench (and/or guarding in runBenchRound) to fail fast with a clear error, or document the 3+ model requirement.
);
+ if (nonContestants.length === 0) {
+ throw new Error("Need at least 3 models to run a bench round");
+ }
const shuffledNon = shuffle([...nonContestants]);🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file bench.ts around lines 98-99:
Edge case: with fewer than 3 models, `nonContestants` is empty so `prompter` is undefined and `runBenchRound` can crash (`prompter.name`). Suggest validating `config.models.length >= 3` up front in `runBench` (and/or guarding in `runBenchRound`) to fail fast with a clear error, or document the 3+ model requirement.
Evidence trail:
bench.ts lines 96-100: nonContestants filtering and prompter assignment at commit 074c7e413e85fb33cce42e0e383746adec1427d7
bench.ts line 112: prompter.name access in retry label
bench.ts lines 224-262: runBench function showing no validation for minimum model count
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
server.ts (1)
691-700:⚠️ Potential issue | 🟡 MinorInitial WS state message omits
benchfield.The
broadcast()payload includesbench: getBenchSummary(), but the initial state sent to a newly connected client (here) does not. A client connecting mid-benchmark won't see progress until the next round completes and triggers a broadcast.🔧 Proposed fix
ws.send( JSON.stringify({ type: "state", data: getClientState(), totalRounds: runs, viewerCount: clients.size, version: VERSION, + bench: getBenchSummary(), }), );🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server.ts` around lines 691 - 700, The initial WebSocket state sent in ws.send is missing the current bench summary so new clients miss in-progress benchmark info; update the payload sent in the ws.send call (where getClientState() is used) to include bench: getBenchSummary() (same field included by broadcast()) so newly connected clients receive the current bench progress immediately.
🧹 Nitpick comments (7)
server.ts (2)
569-583: Public bench endpoints lack rate limiting.
/api/bench/statusand/api/bench/runsare unauthenticated and have no rate limiting, unlike/api/history. A misbehaving client could hammer these endpoints.Consider adding rate limiting similar to the history endpoint pattern (
isRateLimited(\bench:${ip}`, ...)`).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server.ts` around lines 569 - 583, These public bench endpoints (handlers around getActiveBenchRun/getBenchSummary and getBenchRuns) need the same rate-limiting protection used by the history endpoint: obtain the client's IP (the same ip variable used by the history path), call isRateLimited(`bench:${ip}`, <same limits>), and if it returns true return a 429 Response (with the same JSON content-type and Cache-Control headers). Add this check at the top of both the "/api/bench/status" and "/api/bench/runs" branches so misbehaving clients are throttled before calling getActiveBenchRun/getBenchRuns.
540-549: Fragile 50 ms sleep to capture the run ID.
runBenchis fired-and-forgotten and the handler sleeps 50 ms hoping the run state is initialized. This works today becauseactiveBenchRunis set synchronously inrunBenchbefore the firstawait, but it's brittle — any future refactor that introduces an earlyawaitwould return{ runId: null }.Consider having
runBenchreturn therunIdsynchronously (or accept a pre-generated one), so you don't need the sleep at all.♻️ Sketch
- runBench(config, broadcast).catch((err) => { - log("ERROR", "bench", "Bench run failed", { error: err.message }); - }); - - await new Promise((r) => setTimeout(r, 50)); - const active = getActiveBenchRun(); - return new Response(JSON.stringify({ ok: true, runId: active?.id }), { + const runId = crypto.randomUUID(); + runBench(config, broadcast, runId).catch((err) => { + log("ERROR", "bench", "Bench run failed", { error: err.message }); + }); + + return new Response(JSON.stringify({ ok: true, runId }), {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server.ts` around lines 540 - 549, The current handler relies on a fragile 50ms sleep to let runBench initialize activeBenchRun; instead make the runId available synchronously: either change runBench to accept a pre-generated runId (generate a UUID/ID in the handler, pass it into runBench and set active run immediately) or modify runBench to return the new runId synchronously before any async awaits, then remove the setTimeout and use that returned runId (or the pre-generated one) in the Response and when calling getActiveBenchRun; update all callers of runBench to the new signature and ensure activeBenchRun is set from that id immediately.bench-frontend.tsx (3)
55-96:MODEL_COLORS,getColor,getLogo, andModelTagare duplicated fromfrontend.tsx.These are copy-pasted verbatim from
frontend.tsx(lines 57–110). If a model is added or a color changes, both files must be updated in lockstep.Consider extracting these into a shared module (e.g.,
models.ts/ModelTag.tsx) that both entry points import.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bench-frontend.tsx` around lines 55 - 96, MODEL_COLORS, getColor, getLogo, and ModelTag are duplicated here and in frontend.tsx; extract them into a shared module and import it from both entry points to avoid drift. Create a new module (e.g., models.ts or ModelTag.tsx) that exports MODEL_COLORS, getColor, getLogo, and the ModelTag component, update this file to import those symbols instead of redefining them, and adjust any relative paths or typings so both bench-frontend.tsx and frontend.tsx consume the same implementation.
679-688: Prefer CSS class over inline styles.The
h2in the "All Runs" panel uses inline styles that replicate the pattern already defined in.rankings h2and.matrix h2inbench.css. Extract a shared heading class (e.g..panel-section-title) and use it here instead.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bench-frontend.tsx` around lines 679 - 688, Replace the inline style block on the h2 in the "All Runs" panel with a shared CSS class: create a new class (e.g. .panel-section-title) in bench.css that consolidates the rules used by .rankings h2 and .matrix h2 (font-family, font-size:13px, font-weight:700, text-transform:uppercase, letter-spacing:1px, color:var(--text-dim), margin-bottom:16px) and then apply className="panel-section-title" to the h2 in bench-frontend.tsx; remove the inline style object from that h2 so the component uses the shared CSS class instead.
491-505: Initial runs fetch doesn't check for HTTP errors.
r.json()is called without checkingr.okfirst. A non-2xx response (e.g. 500) with a non-JSON body will throw a parse error, which gets caught — but the error message will be misleading ("Failed to load benchmark runs" instead of something indicating a server error).This pattern repeats in the other
fetchcalls (lines 513, 550, 575, 585). Consider a small helper:async function fetchJSON<T>(url: string, init?: RequestInit): Promise<T> { const res = await fetch(url, init); if (!res.ok) throw new Error(`HTTP ${res.status}`); return res.json(); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bench-frontend.tsx` around lines 491 - 505, The fetch in the useEffect (which calls fetch("/api/bench/runs") and then r.json()) doesn't check HTTP status and can misreport server errors; add a small helper like fetchJSON<T>(url, init?) that awaits fetch, throws when !res.ok (including status text/code) and then returns await res.json(); replace the direct fetch(...).then(r => r.json()) usages in the useEffect and the other fetch sites that load runs/results (the fetch calls that currently call r.json() then setRuns, setSelectedRunId, setLoading, or setError) to use fetchJSON and catch/handle errors to setError with a clearer message including the HTTP status when available.bench-db.ts (1)
16-25: Consider adding an index onbench_rounds.run_id.
getBenchRoundsfilters byrun_id. Without an index, this is a full table scan. With multiple benchmark runs each producing 21+ rounds, this will degrade over time.🔧 Proposed fix
db.exec(` CREATE TABLE IF NOT EXISTS bench_rounds ( id INTEGER PRIMARY KEY AUTOINCREMENT, run_id TEXT NOT NULL, pairing_index INTEGER NOT NULL, round_num INTEGER NOT NULL, data TEXT NOT NULL, FOREIGN KEY (run_id) REFERENCES bench_runs(id) ); `); + +db.exec(`CREATE INDEX IF NOT EXISTS idx_bench_rounds_run_id ON bench_rounds(run_id);`);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bench-db.ts` around lines 16 - 25, Add an index on bench_rounds.run_id to speed up queries from getBenchRounds which filter by run_id; update the table creation/initialization logic that defines bench_rounds (and any migration/DB init path) to create an index like "CREATE INDEX IF NOT EXISTS idx_bench_rounds_run_id ON bench_rounds(run_id);" so lookups by run_id use the index (optionally add run_id, round_num composite if you also frequently order/filter by round_num).bench.ts (1)
132-168:Promise.alldiscards a successful answer when its sibling fails.If model A's answer succeeds but model B's
withRetryexhausts all 3 attempts and throws,Promise.allrejects immediately. The error result returned on line 153 reportsanswerA: ""even though model A's answer was actually obtained — it just never made it out of thePromise.alldestructuring.Consider using
Promise.allSettledso you can capture whichever answers did succeed before recording the error.♻️ Sketch
- try { - const [ansA, ansB] = await Promise.all([ - withRetry( - () => callGenerateAnswer(pairing.modelA, prompt), - (s) => isRealString(s, 3), - 3, - `${label}:answer:${pairing.modelA.name}`, - ), - withRetry( - () => callGenerateAnswer(pairing.modelB, prompt), - (s) => isRealString(s, 3), - 3, - `${label}:answer:${pairing.modelB.name}`, - ), - ]); - answerA = ansA; - answerB = ansB; - } catch (err: any) { + const [resA, resB] = await Promise.allSettled([ + withRetry( + () => callGenerateAnswer(pairing.modelA, prompt), + (s) => isRealString(s, 3), + 3, + `${label}:answer:${pairing.modelA.name}`, + ), + withRetry( + () => callGenerateAnswer(pairing.modelB, prompt), + (s) => isRealString(s, 3), + 3, + `${label}:answer:${pairing.modelB.name}`, + ), + ]); + if (resA.status === "fulfilled") answerA = resA.value; + if (resB.status === "fulfilled") answerB = resB.value; + + if (resA.status === "rejected" || resB.status === "rejected") { + const err = resA.status === "rejected" ? resA.reason : resB.reason; return { pairingIndex: pairing.index, roundNum, prompter, prompt, modelA: pairing.modelA, answerA, modelB: pairing.modelB, answerB, votes: [], votesA: 0, votesB: 0, winner: "tie", - error: `Answer failed: ${err.message}`, + error: `Answer failed: ${err instanceof Error ? err.message : String(err)}`, }; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bench.ts` around lines 132 - 168, The Promise.all call around withRetry(...) for callGenerateAnswer discards any successful answer when its sibling rejects; replace Promise.all with Promise.allSettled and inspect the settled results to set answerA and answerB from the fulfilled entries (referencing withRetry, callGenerateAnswer, isRealString and the local vars answerA/answerB), and if either promise rejected keep the existing error-return path but include the answers that did succeed when building the returned object (preserve pairingIndex, roundNum, prompter, prompt, modelA/modelB, votes, votesA/votesB, winner fields).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@server.ts`:
- Around line 691-700: The initial WebSocket state sent in ws.send is missing
the current bench summary so new clients miss in-progress benchmark info; update
the payload sent in the ws.send call (where getClientState() is used) to include
bench: getBenchSummary() (same field included by broadcast()) so newly connected
clients receive the current bench progress immediately.
---
Duplicate comments:
In `@bench-frontend.tsx`:
- Around line 538-560: The WebSocket onmessage handler (inside ws.onmessage)
updates runs and calls setSelectedRunId(data[0]!.id) on bench completion, but if
that id equals the current selectedRunId the component's useEffect that loads
roundsData (watching selectedRunId) won't re-run and roundsData stays stale;
modify the handler to force a re-fetch when a completed run is reported — e.g.
after setRuns(...) detect the newId = data[0]!.id and if newId === selectedRunId
call the rounds-refresh path explicitly (either by briefly clearing
selectedRunId then setting it back, or by invoking the same fetch logic used in
the useEffect to reload roundsData) so roundsData is refreshed even when the
selected run id is unchanged; touch functions/variables: ws.onmessage,
setActiveBench, setRuns, setSelectedRunId, selectedRunId and the rounds-loading
effect.
In `@bench.css`:
- Line 17: The CSS custom property --serif uses the keyword value Georgia with
incorrect casing; update the declaration for --serif to use lowercase keyword
georgia (i.e., change 'Georgia' to 'georgia') to satisfy the value-keyword-case
Stylelint rule while keeping the rest of the font stack ('DM Serif Display',
georgia, serif) unchanged.
In `@bench.ts`:
- Around line 96-101: The code assumes at least three models and crashes when
nonContestants is empty; add a guard after computing nonContestants/shuffledNon
to handle the case where nonContestants.length === 0 (i.e., allModels.length <
3): either early-return or skip this pairing with a clear log/error, or provide
a safe fallback (e.g., set prompter = pairing.modelA or pairing.modelB and set
voters to a non-empty array) so that calls to callGeneratePrompt never receive
undefined. Update the logic around nonContestants, shuffledNon, prompter, and
voters to implement the chosen guard/fallback and ensure downstream code
(callGeneratePrompt) receives valid values.
---
Nitpick comments:
In `@bench-db.ts`:
- Around line 16-25: Add an index on bench_rounds.run_id to speed up queries
from getBenchRounds which filter by run_id; update the table
creation/initialization logic that defines bench_rounds (and any migration/DB
init path) to create an index like "CREATE INDEX IF NOT EXISTS
idx_bench_rounds_run_id ON bench_rounds(run_id);" so lookups by run_id use the
index (optionally add run_id, round_num composite if you also frequently
order/filter by round_num).
In `@bench-frontend.tsx`:
- Around line 55-96: MODEL_COLORS, getColor, getLogo, and ModelTag are
duplicated here and in frontend.tsx; extract them into a shared module and
import it from both entry points to avoid drift. Create a new module (e.g.,
models.ts or ModelTag.tsx) that exports MODEL_COLORS, getColor, getLogo, and the
ModelTag component, update this file to import those symbols instead of
redefining them, and adjust any relative paths or typings so both
bench-frontend.tsx and frontend.tsx consume the same implementation.
- Around line 679-688: Replace the inline style block on the h2 in the "All
Runs" panel with a shared CSS class: create a new class (e.g.
.panel-section-title) in bench.css that consolidates the rules used by .rankings
h2 and .matrix h2 (font-family, font-size:13px, font-weight:700,
text-transform:uppercase, letter-spacing:1px, color:var(--text-dim),
margin-bottom:16px) and then apply className="panel-section-title" to the h2 in
bench-frontend.tsx; remove the inline style object from that h2 so the component
uses the shared CSS class instead.
- Around line 491-505: The fetch in the useEffect (which calls
fetch("/api/bench/runs") and then r.json()) doesn't check HTTP status and can
misreport server errors; add a small helper like fetchJSON<T>(url, init?) that
awaits fetch, throws when !res.ok (including status text/code) and then returns
await res.json(); replace the direct fetch(...).then(r => r.json()) usages in
the useEffect and the other fetch sites that load runs/results (the fetch calls
that currently call r.json() then setRuns, setSelectedRunId, setLoading, or
setError) to use fetchJSON and catch/handle errors to setError with a clearer
message including the HTTP status when available.
In `@bench.ts`:
- Around line 132-168: The Promise.all call around withRetry(...) for
callGenerateAnswer discards any successful answer when its sibling rejects;
replace Promise.all with Promise.allSettled and inspect the settled results to
set answerA and answerB from the fulfilled entries (referencing withRetry,
callGenerateAnswer, isRealString and the local vars answerA/answerB), and if
either promise rejected keep the existing error-return path but include the
answers that did succeed when building the returned object (preserve
pairingIndex, roundNum, prompter, prompt, modelA/modelB, votes, votesA/votesB,
winner fields).
In `@server.ts`:
- Around line 569-583: These public bench endpoints (handlers around
getActiveBenchRun/getBenchSummary and getBenchRuns) need the same rate-limiting
protection used by the history endpoint: obtain the client's IP (the same ip
variable used by the history path), call isRateLimited(`bench:${ip}`, <same
limits>), and if it returns true return a 429 Response (with the same JSON
content-type and Cache-Control headers). Add this check at the top of both the
"/api/bench/status" and "/api/bench/runs" branches so misbehaving clients are
throttled before calling getActiveBenchRun/getBenchRuns.
- Around line 540-549: The current handler relies on a fragile 50ms sleep to let
runBench initialize activeBenchRun; instead make the runId available
synchronously: either change runBench to accept a pre-generated runId (generate
a UUID/ID in the handler, pass it into runBench and set active run immediately)
or modify runBench to return the new runId synchronously before any async
awaits, then remove the setTimeout and use that returned runId (or the
pre-generated one) in the Response and when calling getActiveBenchRun; update
all callers of runBench to the new signature and ensure activeBenchRun is set
from that id immediately.
…mount leak - bench.ts: Guard against crash when fewer than 3 models are configured by returning an error result instead of accessing undefined prompter - bench-frontend.tsx: Force re-fetch round data on bench completion even when the selected run ID hasn't changed - bench-frontend.tsx: Prevent WebSocket reconnect timer after component unmount using an unmounted flag
Summary
/benchpage with overall rankings and a color-coded head-to-head matrixDetails
New files:
bench-db.ts— Database layer (bench_runs+bench_roundstables, crash recovery)bench.ts— Benchmark engine (pairing generation, round execution reusing existinggame.tsexports)bench.html+bench.css+bench-frontend.tsx— Frontend with rankings table, head-to-head matrix, matchup detail view, real-time WebSocket progressModified files:
server.ts—/benchroute, bench API endpoints (/api/bench/start|cancel|status|runs|results/:id), bench status in WS broadcast, startup crash recoveryfrontend.tsx— "Bench" link in standings navadmin.tsx— Benchmark section with start/cancel controls and progress displayEdge cases handled:
Test plan
bun --hot server.ts/admin, start a benchmark (1 round/pairing)/bench, verify progress bar updates in real-time via WebSocketSummary by CodeRabbit