Skip to content

Add benchmark feature#8

Open
aravhawk wants to merge 2 commits intoT3-Content:mainfrom
aravhawk:feat/bench
Open

Add benchmark feature#8
aravhawk wants to merge 2 commits intoT3-Content:mainfrom
aravhawk:feat/bench

Conversation

@aravhawk
Copy link
Copy Markdown

@aravhawk aravhawk commented Feb 23, 2026

Summary

  • Adds a systematic round-robin benchmark mode where every model plays against every other model (C(7,2) = 21 matchups with 7 models)
  • Results are persisted to SQLite and displayed on a dedicated /bench page with overall rankings and a color-coded head-to-head matrix
  • Benchmark can be started/cancelled from the admin panel with configurable rounds-per-pairing (1/3/5)

Details

New files:

  • bench-db.ts — Database layer (bench_runs + bench_rounds tables, crash recovery)
  • bench.ts — Benchmark engine (pairing generation, round execution reusing existing game.ts exports)
  • bench.html + bench.css + bench-frontend.tsx — Frontend with rankings table, head-to-head matrix, matchup detail view, real-time WebSocket progress

Modified files:

  • server.ts/bench route, bench API endpoints (/api/bench/start|cancel|status|runs|results/:id), bench status in WS broadcast, startup crash recovery
  • frontend.tsx — "Bench" link in standings nav
  • admin.tsx — Benchmark section with start/cancel controls and progress display

Edge cases handled:

  • Only one bench at a time (409 conflict)
  • Graceful cancel between rounds, partial results preserved
  • Per-round retry (3 attempts), failed rounds recorded with error
  • Server restart marks stale running bench runs as error

Test plan

  • Start server with bun --hot server.ts
  • Navigate to /admin, start a benchmark (1 round/pairing)
  • Navigate to /bench, verify progress bar updates in real-time via WebSocket
  • After completion, verify rankings table and head-to-head matrix
  • Click a matrix cell to see matchup details (prompt, answers, votes)
  • Test cancel mid-run, verify partial results preserved
  • Restart server, verify stale running bench is marked as error

Summary by CodeRabbit

  • New Features
    • Full benchmarking system to run, track, and persist model comparison runs.
    • Start/cancel benchmark runs from Admin with configurable rounds per pairing.
    • Dedicated Benchmark UI page with dark-themed layout, run selector, progress, and past runs.
    • Results view: overall rankings, head‑to‑head matrix, and round‑by‑round matchup detail.
    • Real‑time updates during active runs and historical results browsing.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 23, 2026

📝 Walkthrough

Walkthrough

Adds a full benchmarking feature: DB-backed run/round persistence, bench execution engine with cancellation and retry logic, REST APIs and server integration, an admin UI to start/cancel benchmarks, and a standalone React UI for viewing runs, live progress, rankings, and head‑to‑head details.

Changes

Cohort / File(s) Summary
Benchmark core
bench.ts
New orchestration: pairing generation, per‑pair multi‑round execution with retries, voting, persistence hooks, cancelable active run tracking, and public APIs (runBench, cancelBench, getActiveBenchRun, generatePairings).
Persistence layer
bench-db.ts
New SQLite schema and helpers for bench_runs and bench_rounds; typed row interfaces and CRUD-like functions including save/update/run/round retrieval and stale-run cleanup.
Frontend app (bench UI)
bench-frontend.tsx, bench.css, bench.html
New standalone benchmark UI: React app, styling, HTML entry; displays runs, live progress via WebSocket, rankings, head‑to‑head matrix, matchup detail views, and run list/selection.
Admin integration
admin.tsx
Admin panel extended with Bench link and benchmark controls: rounds selector, start/cancel handlers, status/progress display, and link to live results; integrates with server APIs and snapshot bench field.
Server & API
server.ts
New /bench route and bench HTML served; REST endpoints under /api/bench/* for start/cancel/status/runs/results with auth/validation; integrates bench state into broadcasts and admin snapshots; marks stale runs as error on startup.
Navigation
frontend.tsx
Added Bench link in main navigation (small UI change).

Sequence Diagram(s)

sequenceDiagram
    actor Admin
    participant AdminUI as Admin UI
    participant Server
    participant BenchEngine as Bench Engine
    participant Database
    participant WebSocket as WebSocket

    Admin->>AdminUI: Click Start Benchmark (rounds)
    AdminUI->>Server: POST /api/bench/start
    Server->>BenchEngine: runBench(config)
    BenchEngine->>Database: saveBenchRun(...)
    BenchEngine->>Server: update progress/status
    Server->>WebSocket: broadcast bench started/progress
    WebSocket->>AdminUI: deliver updates
    loop For each pairing and round
        BenchEngine->>BenchEngine: generate prompt, get answers, vote
        BenchEngine->>Database: saveBenchRound(...)
        BenchEngine->>Server: update completed rounds
        Server->>WebSocket: broadcast progress
    end
    BenchEngine->>Database: updateBenchRunStatus(completed/failed/cancelled)
    Server->>WebSocket: broadcast bench finished
    WebSocket->>AdminUI: final update
Loading
sequenceDiagram
    actor User
    participant BenchFrontend as Bench Frontend
    participant Server
    participant Database
    participant WebSocket

    User->>BenchFrontend: Load /bench
    BenchFrontend->>Server: GET /api/bench/runs
    Server->>Database: getBenchRuns()
    Database-->>Server: runs
    Server-->>BenchFrontend: runs JSON
    BenchFrontend->>Server: GET /api/bench/results/:runId
    Server->>Database: getBenchRounds(runId)
    Database-->>Server: rounds data
    Server-->>BenchFrontend: results JSON
    BenchFrontend->>BenchFrontend: computeRankings & computeH2H
    BenchFrontend->>User: render rankings/matrix
    BenchFrontend->>WebSocket: connect for live updates
    loop Live updates
        WebSocket->>BenchFrontend: bench progress/status
        BenchFrontend->>Server: refresh runs/results as needed
        BenchFrontend->>User: update UI
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 Hopping through runs with a twitchy nose,

Pairings bloom where the leaderboard grows,
Rounds roll by as carrots of data fall,
WebSocket whispers, I cheer them all,
Benchmarks bloom — a rabbit's small ball of code.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title "Add benchmark feature" accurately and concisely summarizes the primary change—adding a complete round-robin benchmarking system with database persistence, backend engine, frontend UI, and admin controls. It is a short, single sentence that highlights the most important addition from the developer's perspective.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@macroscopeapp
Copy link
Copy Markdown

macroscopeapp bot commented Feb 23, 2026

Add round-robin benchmark flow with admin controls, database persistence, live WebSocket updates, and a new /bench results UI

Introduce a benchmark run loop with cancellation, persist runs and rounds, expose REST endpoints for start/cancel/status/results, broadcast progress over WebSocket, and add a /bench page with rankings, head‑to‑head matrix, and live progress. Server marks stale runs as error on startup.

📍Where to Start

Start with the benchmark orchestration in runBench and API wiring in server.ts, then review persistence helpers in bench-db.ts and the UI entry in bench-frontend.tsx.


Macroscope summarized 7801db9.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (7)
server.ts (2)

527-528: The 50ms setTimeout is unnecessary.

runBench sets activeBenchRun synchronously before its first await, so getActiveBenchRun() will already return the run state immediately after the .catch() on line 523 is attached. The 50ms delay is harmless but misleading — it suggests timing fragility that doesn't exist.

♻️ Suggested simplification
      // Start in background (not awaited)
      runBench(config, broadcast).catch((err) => {
        log("ERROR", "bench", "Bench run failed", { error: err.message });
      });

-     // Wait a tick for state to initialize
-     await new Promise((r) => setTimeout(r, 50));
      const active = getActiveBenchRun();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server.ts` around lines 527 - 528, Remove the 50ms artificial delay: delete
the "await new Promise((r) => setTimeout(r, 50));" and its comment; since
runBench sets activeBenchRun synchronously before its first await,
getActiveBenchRun() will already return the run state and the .catch() is
already attached, so no additional tick wait is required—ensure the surrounding
logic that calls runBench and then reads getActiveBenchRun() remains unchanged.

553-583: Public bench endpoints lack rate limiting.

/api/bench/status, /api/bench/runs, and /api/bench/results/:id are unauthenticated (correctly, for the public bench page) but have no rate limiting. The results endpoint in particular could be expensive (fetches all rounds for a run from SQLite). Consider applying a rate limit similar to HISTORY_LIMIT_PER_MIN.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server.ts` around lines 553 - 583, These public bench endpoints lack rate
limiting; add a limit check using the existing HISTORY_LIMIT_PER_MIN logic (or
reuse the same rate limiter mechanism) before handling requests for
"/api/bench/status", "/api/bench/runs" and "/api/bench/results/:id". For each
handler (where getBenchSummary, getBenchRuns, getBenchRun and getBenchRounds are
called) enforce the HISTORY_LIMIT_PER_MIN rate window (keyed by requester IP or
same keying scheme used elsewhere), return a 429 Response when the limit is
exceeded, and only proceed to call the getBench* functions when the rate check
passes.
bench.ts (1)

132-168: Promise.all short-circuits: if model A's answer fails, model B's retry chain is abandoned.

When both answers are fetched in parallel via Promise.all, if one model's withRetry exhausts retries and throws, the other in-flight call is effectively orphaned (its result is discarded). The error message recorded may reference only the first failure. This is acceptable for a benchmark, but worth noting that partial answer data (answerA/answerB on lines 159/161) will remain empty strings for both models even if one succeeded.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bench.ts` around lines 132 - 168, The current parallel fetch uses Promise.all
which short-circuits and discards the other in-flight withRetry call if one
rejects; change the logic to start both retry promises independently and await
them with Promise.allSettled (or await each separately) so each model's
callGenerateAnswer via withRetry (refer to withRetry(() =>
callGenerateAnswer(pairing.modelA, prompt), ...) and the pairing.modelB
equivalent) completes or fails independently; then set answerA/answerB from the
settled results (checking status === "fulfilled" and extracting value, otherwise
capture the individual error messages) and preserve partial answers instead of
returning immediately on the first rejection.
bench-frontend.tsx (2)

679-688: Inline styles on the "All Runs" heading are inconsistent with the CSS-class approach used elsewhere.

Every other heading in this file uses CSS classes (e.g., .rankings h2, .matrix h2). Consider adding a reusable class in bench.css instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bench-frontend.tsx` around lines 679 - 688, The "All Runs" h2 uses inline
styles instead of the project's CSS-class approach; remove the inline style
object from the h2 in bench-frontend.tsx and add a className (e.g., "all-runs"
or "all-runs-heading") to that element, then define the same visual rules
(font-family, font-size:13, font-weight:700, text-transform:uppercase,
letter-spacing:1, color:var(--text-dim), margin-bottom:16) in bench.css under
the new selector (e.g., .all-runs h2 or .all-runs-heading) so styling is
consistent with other headings like .rankings h2 and .matrix h2.

53-96: MODEL_COLORS, getColor, getLogo, and ModelTag are duplicated from frontend.tsx.

These are verbatim copies. If model names or colors change, both files must be updated in lockstep. Consider extracting them into a shared module (e.g., models-ui.ts) and importing from both entry points.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bench-frontend.tsx` around lines 53 - 96, MODEL_COLORS, getColor, getLogo and
ModelTag are duplicated here; extract these symbols into a shared UI module
(e.g., models-ui.ts or models-ui.tsx) and import them in both bench-frontend.tsx
and frontend.tsx. Move the constant MODEL_COLORS and the functions
getColor/getLogo plus the ModelTag component into the new module, export them
(keeping their names), then replace the local definitions in bench-frontend.tsx
with imports from the shared module so both entry points consume the single
source of truth.
bench-db.ts (1)

16-25: Consider adding an index on bench_rounds(run_id) for query performance.

getBenchRounds queries by run_id, and without an index this becomes a full table scan as rounds accumulate across many benchmark runs.

🔧 Suggested addition after the CREATE TABLE
 db.exec(`
   CREATE TABLE IF NOT EXISTS bench_rounds (
     id INTEGER PRIMARY KEY AUTOINCREMENT,
     run_id TEXT NOT NULL,
     pairing_index INTEGER NOT NULL,
     round_num INTEGER NOT NULL,
     data TEXT NOT NULL,
     FOREIGN KEY (run_id) REFERENCES bench_runs(id)
   );
 `);
+
+db.exec(`
+  CREATE INDEX IF NOT EXISTS idx_bench_rounds_run_id ON bench_rounds(run_id);
+`);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bench-db.ts` around lines 16 - 25, Add a non-unique index on
bench_rounds(run_id) to speed queries from getBenchRounds by run_id; after the
CREATE TABLE block for bench_rounds execute a CREATE INDEX IF NOT EXISTS
statement (e.g. idx_bench_rounds_run_id on bench_rounds(run_id)) so future
queries filtering by run_id use the index and avoid full table scans.
admin.tsx (1)

185-210: Unused variable: data on line 199 is assigned but never read.

The JSON response from /api/bench/start is parsed but discarded. Either remove the assignment or use it to update state directly.

🔧 Proposed fix
      if (!response.ok) {
        const text = await response.text();
        throw new Error(text || `Request failed (${response.status})`);
      }
-     const data = await response.json();
+     await response.json(); // consume body
      // Refresh admin status
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@admin.tsx` around lines 185 - 210, In onBenchStart, the JSON response from
fetch("/api/bench/start") is parsed into the unused variable data; either remove
the line "const data = await response.json();" or consume it (e.g., use its
contents to update state instead of discarding) — if you intend to update the
admin snapshot with the start response, pass that parsed object into
setSnapshot(data) (or merge it with the result of requestAdminJson), otherwise
delete the unused data assignment to avoid the lint warning; keep the
surrounding error handling and finally block (setPending/setError) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bench-frontend.tsx`:
- Around line 538-560: When a bench completes the code refreshes the runs list
but doesn’t reload rounds if the top run ID equals the current selectedRunId, so
make the completion handler also explicitly reload rounds for the selected run:
after setRuns(...) and determining latestId = data[0].id, if latestId ===
selectedRunId call the same rounds-fetch logic used in the useEffect (fetch
"/api/bench/runs/{id}/rounds" and update roundsData via setRoundsData) or call
the existing helper that loads rounds; otherwise setSelectedRunId(latestId) as
before. Ensure you reference and use selectedRunId, setSelectedRunId and
setRoundsData (or the existing rounds-loading function) in the completion
branch.

In `@bench.css`:
- Line 17: The CSS custom property --serif uses the font-family value 'DM Serif
Display', Georgia, serif with an uppercase keyword "Georgia" that violates
value-keyword-case; update the keyword to lowercase by changing "Georgia" to
"georgia" in the --serif declaration so the value becomes 'DM Serif Display',
georgia, serif.

In `@bench.ts`:
- Around line 96-101: runBenchRound can crash when nonContestants is empty
(models length <= 2) because shuffledNon[0]! is assumed non-null; add a guard
early in runBenchRound: if allModels.length <= 2 (or nonContestants.length ===
0) handle the edge by either returning/skipping the round or choosing a safe
fallback prompter and voters (for example set prompter = pairing.modelA and
voters = [pairing.modelA, pairing.modelB] or duplicate a model), and remove the
non-null assertion on shuffledNon[0]; ensure callGeneratePrompt is only invoked
with a defined prompter and that voters is a valid non-empty array.

---

Nitpick comments:
In `@admin.tsx`:
- Around line 185-210: In onBenchStart, the JSON response from
fetch("/api/bench/start") is parsed into the unused variable data; either remove
the line "const data = await response.json();" or consume it (e.g., use its
contents to update state instead of discarding) — if you intend to update the
admin snapshot with the start response, pass that parsed object into
setSnapshot(data) (or merge it with the result of requestAdminJson), otherwise
delete the unused data assignment to avoid the lint warning; keep the
surrounding error handling and finally block (setPending/setError) unchanged.

In `@bench-db.ts`:
- Around line 16-25: Add a non-unique index on bench_rounds(run_id) to speed
queries from getBenchRounds by run_id; after the CREATE TABLE block for
bench_rounds execute a CREATE INDEX IF NOT EXISTS statement (e.g.
idx_bench_rounds_run_id on bench_rounds(run_id)) so future queries filtering by
run_id use the index and avoid full table scans.

In `@bench-frontend.tsx`:
- Around line 679-688: The "All Runs" h2 uses inline styles instead of the
project's CSS-class approach; remove the inline style object from the h2 in
bench-frontend.tsx and add a className (e.g., "all-runs" or "all-runs-heading")
to that element, then define the same visual rules (font-family, font-size:13,
font-weight:700, text-transform:uppercase, letter-spacing:1,
color:var(--text-dim), margin-bottom:16) in bench.css under the new selector
(e.g., .all-runs h2 or .all-runs-heading) so styling is consistent with other
headings like .rankings h2 and .matrix h2.
- Around line 53-96: MODEL_COLORS, getColor, getLogo and ModelTag are duplicated
here; extract these symbols into a shared UI module (e.g., models-ui.ts or
models-ui.tsx) and import them in both bench-frontend.tsx and frontend.tsx. Move
the constant MODEL_COLORS and the functions getColor/getLogo plus the ModelTag
component into the new module, export them (keeping their names), then replace
the local definitions in bench-frontend.tsx with imports from the shared module
so both entry points consume the single source of truth.

In `@bench.ts`:
- Around line 132-168: The current parallel fetch uses Promise.all which
short-circuits and discards the other in-flight withRetry call if one rejects;
change the logic to start both retry promises independently and await them with
Promise.allSettled (or await each separately) so each model's callGenerateAnswer
via withRetry (refer to withRetry(() => callGenerateAnswer(pairing.modelA,
prompt), ...) and the pairing.modelB equivalent) completes or fails
independently; then set answerA/answerB from the settled results (checking
status === "fulfilled" and extracting value, otherwise capture the individual
error messages) and preserve partial answers instead of returning immediately on
the first rejection.

In `@server.ts`:
- Around line 527-528: Remove the 50ms artificial delay: delete the "await new
Promise((r) => setTimeout(r, 50));" and its comment; since runBench sets
activeBenchRun synchronously before its first await, getActiveBenchRun() will
already return the run state and the .catch() is already attached, so no
additional tick wait is required—ensure the surrounding logic that calls
runBench and then reads getActiveBenchRun() remains unchanged.
- Around line 553-583: These public bench endpoints lack rate limiting; add a
limit check using the existing HISTORY_LIMIT_PER_MIN logic (or reuse the same
rate limiter mechanism) before handling requests for "/api/bench/status",
"/api/bench/runs" and "/api/bench/results/:id". For each handler (where
getBenchSummary, getBenchRuns, getBenchRun and getBenchRounds are called)
enforce the HISTORY_LIMIT_PER_MIN rate window (keyed by requester IP or same
keying scheme used elsewhere), return a 429 Response when the limit is exceeded,
and only proceed to call the getBench* functions when the rate check passes.

--text-dim: #888;
--text-muted: #444;
--accent: #D97757;
--serif: 'DM Serif Display', Georgia, serif;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix casing: Georgia should be georgia.

Stylelint flags value-keyword-case: CSS keyword values should be lowercase.

🔧 Proposed fix
-  --serif: 'DM Serif Display', Georgia, serif;
+  --serif: 'DM Serif Display', georgia, serif;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
--serif: 'DM Serif Display', Georgia, serif;
--serif: 'DM Serif Display', georgia, serif;
🧰 Tools
🪛 Stylelint (17.3.0)

[error] 17-17: Expected "Georgia" to be "georgia" (value-keyword-case)

(value-keyword-case)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bench.css` at line 17, The CSS custom property --serif uses the font-family
value 'DM Serif Display', Georgia, serif with an uppercase keyword "Georgia"
that violates value-keyword-case; update the keyword to lowercase by changing
"Georgia" to "georgia" in the --serif declaration so the value becomes 'DM Serif
Display', georgia, serif.

Adds a systematic round-robin benchmark mode where every model plays
against every other model (C(7,2) = 21 matchups with 7 models). Results
are persisted to SQLite and displayed on a dedicated /bench page with
overall rankings and a head-to-head matrix.

New files: bench-db.ts, bench.ts, bench.html, bench.css, bench-frontend.tsx
Modified: server.ts (routes + API), frontend.tsx (nav link), admin.tsx (controls)
useEffect(() => {
const wsProtocol = window.location.protocol === "https:" ? "wss:" : "ws:";
const wsUrl = `${wsProtocol}//${window.location.host}/ws`;
let ws: WebSocket;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium bench-frontend.tsx:530

WebSocket onclose fires when cleanup calls ws.close(), scheduling a reconnect after component unmounts. Consider adding a flag (e.g., let unmounted = false) set in cleanup, and checking it in onclose before scheduling the timeout.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file bench-frontend.tsx around line 530:

WebSocket `onclose` fires when cleanup calls `ws.close()`, scheduling a reconnect after component unmounts. Consider adding a flag (e.g., `let unmounted = false`) set in cleanup, and checking it in `onclose` before scheduling the timeout.

Evidence trail:
bench-frontend.tsx lines 525-565 at commit 074c7e413e85fb33cce42e0e383746adec1427d7 - The WebSocket `onclose` handler (line 533-535) unconditionally schedules reconnect with `setTimeout(connect, 3000)`. The cleanup function (lines 562-565) calls `clearTimeout(reconnectTimer)` before `ws?.close()`, but when `ws.close()` triggers `onclose`, a new timer is created that won't be cleared.

Comment on lines +98 to +99
);
const shuffledNon = shuffle([...nonContestants]);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 High bench.ts:98

Edge case: with fewer than 3 models, nonContestants is empty so prompter is undefined and runBenchRound can crash (prompter.name). Suggest validating config.models.length >= 3 up front in runBench (and/or guarding in runBenchRound) to fail fast with a clear error, or document the 3+ model requirement.

 );
+  if (nonContestants.length === 0) {
+    throw new Error("Need at least 3 models to run a bench round");
+  }
   const shuffledNon = shuffle([...nonContestants]);
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file bench.ts around lines 98-99:

Edge case: with fewer than 3 models, `nonContestants` is empty so `prompter` is undefined and `runBenchRound` can crash (`prompter.name`). Suggest validating `config.models.length >= 3` up front in `runBench` (and/or guarding in `runBenchRound`) to fail fast with a clear error, or document the 3+ model requirement.

Evidence trail:
bench.ts lines 96-100: nonContestants filtering and prompter assignment at commit 074c7e413e85fb33cce42e0e383746adec1427d7
bench.ts line 112: prompter.name access in retry label
bench.ts lines 224-262: runBench function showing no validation for minimum model count

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
server.ts (1)

691-700: ⚠️ Potential issue | 🟡 Minor

Initial WS state message omits bench field.

The broadcast() payload includes bench: getBenchSummary(), but the initial state sent to a newly connected client (here) does not. A client connecting mid-benchmark won't see progress until the next round completes and triggers a broadcast.

🔧 Proposed fix
       ws.send(
         JSON.stringify({
           type: "state",
           data: getClientState(),
           totalRounds: runs,
           viewerCount: clients.size,
           version: VERSION,
+          bench: getBenchSummary(),
         }),
       );
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server.ts` around lines 691 - 700, The initial WebSocket state sent in
ws.send is missing the current bench summary so new clients miss in-progress
benchmark info; update the payload sent in the ws.send call (where
getClientState() is used) to include bench: getBenchSummary() (same field
included by broadcast()) so newly connected clients receive the current bench
progress immediately.
🧹 Nitpick comments (7)
server.ts (2)

569-583: Public bench endpoints lack rate limiting.

/api/bench/status and /api/bench/runs are unauthenticated and have no rate limiting, unlike /api/history. A misbehaving client could hammer these endpoints.

Consider adding rate limiting similar to the history endpoint pattern (isRateLimited(\bench:${ip}`, ...)`).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server.ts` around lines 569 - 583, These public bench endpoints (handlers
around getActiveBenchRun/getBenchSummary and getBenchRuns) need the same
rate-limiting protection used by the history endpoint: obtain the client's IP
(the same ip variable used by the history path), call
isRateLimited(`bench:${ip}`, <same limits>), and if it returns true return a 429
Response (with the same JSON content-type and Cache-Control headers). Add this
check at the top of both the "/api/bench/status" and "/api/bench/runs" branches
so misbehaving clients are throttled before calling
getActiveBenchRun/getBenchRuns.

540-549: Fragile 50 ms sleep to capture the run ID.

runBench is fired-and-forgotten and the handler sleeps 50 ms hoping the run state is initialized. This works today because activeBenchRun is set synchronously in runBench before the first await, but it's brittle — any future refactor that introduces an early await would return { runId: null }.

Consider having runBench return the runId synchronously (or accept a pre-generated one), so you don't need the sleep at all.

♻️ Sketch
-      runBench(config, broadcast).catch((err) => {
-        log("ERROR", "bench", "Bench run failed", { error: err.message });
-      });
-
-      await new Promise((r) => setTimeout(r, 50));
-      const active = getActiveBenchRun();
-      return new Response(JSON.stringify({ ok: true, runId: active?.id }), {
+      const runId = crypto.randomUUID();
+      runBench(config, broadcast, runId).catch((err) => {
+        log("ERROR", "bench", "Bench run failed", { error: err.message });
+      });
+
+      return new Response(JSON.stringify({ ok: true, runId }), {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server.ts` around lines 540 - 549, The current handler relies on a fragile
50ms sleep to let runBench initialize activeBenchRun; instead make the runId
available synchronously: either change runBench to accept a pre-generated runId
(generate a UUID/ID in the handler, pass it into runBench and set active run
immediately) or modify runBench to return the new runId synchronously before any
async awaits, then remove the setTimeout and use that returned runId (or the
pre-generated one) in the Response and when calling getActiveBenchRun; update
all callers of runBench to the new signature and ensure activeBenchRun is set
from that id immediately.
bench-frontend.tsx (3)

55-96: MODEL_COLORS, getColor, getLogo, and ModelTag are duplicated from frontend.tsx.

These are copy-pasted verbatim from frontend.tsx (lines 57–110). If a model is added or a color changes, both files must be updated in lockstep.

Consider extracting these into a shared module (e.g., models.ts / ModelTag.tsx) that both entry points import.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bench-frontend.tsx` around lines 55 - 96, MODEL_COLORS, getColor, getLogo,
and ModelTag are duplicated here and in frontend.tsx; extract them into a shared
module and import it from both entry points to avoid drift. Create a new module
(e.g., models.ts or ModelTag.tsx) that exports MODEL_COLORS, getColor, getLogo,
and the ModelTag component, update this file to import those symbols instead of
redefining them, and adjust any relative paths or typings so both
bench-frontend.tsx and frontend.tsx consume the same implementation.

679-688: Prefer CSS class over inline styles.

The h2 in the "All Runs" panel uses inline styles that replicate the pattern already defined in .rankings h2 and .matrix h2 in bench.css. Extract a shared heading class (e.g. .panel-section-title) and use it here instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bench-frontend.tsx` around lines 679 - 688, Replace the inline style block on
the h2 in the "All Runs" panel with a shared CSS class: create a new class (e.g.
.panel-section-title) in bench.css that consolidates the rules used by .rankings
h2 and .matrix h2 (font-family, font-size:13px, font-weight:700,
text-transform:uppercase, letter-spacing:1px, color:var(--text-dim),
margin-bottom:16px) and then apply className="panel-section-title" to the h2 in
bench-frontend.tsx; remove the inline style object from that h2 so the component
uses the shared CSS class instead.

491-505: Initial runs fetch doesn't check for HTTP errors.

r.json() is called without checking r.ok first. A non-2xx response (e.g. 500) with a non-JSON body will throw a parse error, which gets caught — but the error message will be misleading ("Failed to load benchmark runs" instead of something indicating a server error).

This pattern repeats in the other fetch calls (lines 513, 550, 575, 585). Consider a small helper:

async function fetchJSON<T>(url: string, init?: RequestInit): Promise<T> {
  const res = await fetch(url, init);
  if (!res.ok) throw new Error(`HTTP ${res.status}`);
  return res.json();
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bench-frontend.tsx` around lines 491 - 505, The fetch in the useEffect (which
calls fetch("/api/bench/runs") and then r.json()) doesn't check HTTP status and
can misreport server errors; add a small helper like fetchJSON<T>(url, init?)
that awaits fetch, throws when !res.ok (including status text/code) and then
returns await res.json(); replace the direct fetch(...).then(r => r.json())
usages in the useEffect and the other fetch sites that load runs/results (the
fetch calls that currently call r.json() then setRuns, setSelectedRunId,
setLoading, or setError) to use fetchJSON and catch/handle errors to setError
with a clearer message including the HTTP status when available.
bench-db.ts (1)

16-25: Consider adding an index on bench_rounds.run_id.

getBenchRounds filters by run_id. Without an index, this is a full table scan. With multiple benchmark runs each producing 21+ rounds, this will degrade over time.

🔧 Proposed fix
 db.exec(`
   CREATE TABLE IF NOT EXISTS bench_rounds (
     id INTEGER PRIMARY KEY AUTOINCREMENT,
     run_id TEXT NOT NULL,
     pairing_index INTEGER NOT NULL,
     round_num INTEGER NOT NULL,
     data TEXT NOT NULL,
     FOREIGN KEY (run_id) REFERENCES bench_runs(id)
   );
 `);
+
+db.exec(`CREATE INDEX IF NOT EXISTS idx_bench_rounds_run_id ON bench_rounds(run_id);`);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bench-db.ts` around lines 16 - 25, Add an index on bench_rounds.run_id to
speed up queries from getBenchRounds which filter by run_id; update the table
creation/initialization logic that defines bench_rounds (and any migration/DB
init path) to create an index like "CREATE INDEX IF NOT EXISTS
idx_bench_rounds_run_id ON bench_rounds(run_id);" so lookups by run_id use the
index (optionally add run_id, round_num composite if you also frequently
order/filter by round_num).
bench.ts (1)

132-168: Promise.all discards a successful answer when its sibling fails.

If model A's answer succeeds but model B's withRetry exhausts all 3 attempts and throws, Promise.all rejects immediately. The error result returned on line 153 reports answerA: "" even though model A's answer was actually obtained — it just never made it out of the Promise.all destructuring.

Consider using Promise.allSettled so you can capture whichever answers did succeed before recording the error.

♻️ Sketch
-  try {
-    const [ansA, ansB] = await Promise.all([
-      withRetry(
-        () => callGenerateAnswer(pairing.modelA, prompt),
-        (s) => isRealString(s, 3),
-        3,
-        `${label}:answer:${pairing.modelA.name}`,
-      ),
-      withRetry(
-        () => callGenerateAnswer(pairing.modelB, prompt),
-        (s) => isRealString(s, 3),
-        3,
-        `${label}:answer:${pairing.modelB.name}`,
-      ),
-    ]);
-    answerA = ansA;
-    answerB = ansB;
-  } catch (err: any) {
+  const [resA, resB] = await Promise.allSettled([
+    withRetry(
+      () => callGenerateAnswer(pairing.modelA, prompt),
+      (s) => isRealString(s, 3),
+      3,
+      `${label}:answer:${pairing.modelA.name}`,
+    ),
+    withRetry(
+      () => callGenerateAnswer(pairing.modelB, prompt),
+      (s) => isRealString(s, 3),
+      3,
+      `${label}:answer:${pairing.modelB.name}`,
+    ),
+  ]);
+  if (resA.status === "fulfilled") answerA = resA.value;
+  if (resB.status === "fulfilled") answerB = resB.value;
+
+  if (resA.status === "rejected" || resB.status === "rejected") {
+    const err = resA.status === "rejected" ? resA.reason : resB.reason;
     return {
       pairingIndex: pairing.index,
       roundNum,
       prompter,
       prompt,
       modelA: pairing.modelA,
       answerA,
       modelB: pairing.modelB,
       answerB,
       votes: [],
       votesA: 0,
       votesB: 0,
       winner: "tie",
-      error: `Answer failed: ${err.message}`,
+      error: `Answer failed: ${err instanceof Error ? err.message : String(err)}`,
     };
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bench.ts` around lines 132 - 168, The Promise.all call around withRetry(...)
for callGenerateAnswer discards any successful answer when its sibling rejects;
replace Promise.all with Promise.allSettled and inspect the settled results to
set answerA and answerB from the fulfilled entries (referencing withRetry,
callGenerateAnswer, isRealString and the local vars answerA/answerB), and if
either promise rejected keep the existing error-return path but include the
answers that did succeed when building the returned object (preserve
pairingIndex, roundNum, prompter, prompt, modelA/modelB, votes, votesA/votesB,
winner fields).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@server.ts`:
- Around line 691-700: The initial WebSocket state sent in ws.send is missing
the current bench summary so new clients miss in-progress benchmark info; update
the payload sent in the ws.send call (where getClientState() is used) to include
bench: getBenchSummary() (same field included by broadcast()) so newly connected
clients receive the current bench progress immediately.

---

Duplicate comments:
In `@bench-frontend.tsx`:
- Around line 538-560: The WebSocket onmessage handler (inside ws.onmessage)
updates runs and calls setSelectedRunId(data[0]!.id) on bench completion, but if
that id equals the current selectedRunId the component's useEffect that loads
roundsData (watching selectedRunId) won't re-run and roundsData stays stale;
modify the handler to force a re-fetch when a completed run is reported — e.g.
after setRuns(...) detect the newId = data[0]!.id and if newId === selectedRunId
call the rounds-refresh path explicitly (either by briefly clearing
selectedRunId then setting it back, or by invoking the same fetch logic used in
the useEffect to reload roundsData) so roundsData is refreshed even when the
selected run id is unchanged; touch functions/variables: ws.onmessage,
setActiveBench, setRuns, setSelectedRunId, selectedRunId and the rounds-loading
effect.

In `@bench.css`:
- Line 17: The CSS custom property --serif uses the keyword value Georgia with
incorrect casing; update the declaration for --serif to use lowercase keyword
georgia (i.e., change 'Georgia' to 'georgia') to satisfy the value-keyword-case
Stylelint rule while keeping the rest of the font stack ('DM Serif Display',
georgia, serif) unchanged.

In `@bench.ts`:
- Around line 96-101: The code assumes at least three models and crashes when
nonContestants is empty; add a guard after computing nonContestants/shuffledNon
to handle the case where nonContestants.length === 0 (i.e., allModels.length <
3): either early-return or skip this pairing with a clear log/error, or provide
a safe fallback (e.g., set prompter = pairing.modelA or pairing.modelB and set
voters to a non-empty array) so that calls to callGeneratePrompt never receive
undefined. Update the logic around nonContestants, shuffledNon, prompter, and
voters to implement the chosen guard/fallback and ensure downstream code
(callGeneratePrompt) receives valid values.

---

Nitpick comments:
In `@bench-db.ts`:
- Around line 16-25: Add an index on bench_rounds.run_id to speed up queries
from getBenchRounds which filter by run_id; update the table
creation/initialization logic that defines bench_rounds (and any migration/DB
init path) to create an index like "CREATE INDEX IF NOT EXISTS
idx_bench_rounds_run_id ON bench_rounds(run_id);" so lookups by run_id use the
index (optionally add run_id, round_num composite if you also frequently
order/filter by round_num).

In `@bench-frontend.tsx`:
- Around line 55-96: MODEL_COLORS, getColor, getLogo, and ModelTag are
duplicated here and in frontend.tsx; extract them into a shared module and
import it from both entry points to avoid drift. Create a new module (e.g.,
models.ts or ModelTag.tsx) that exports MODEL_COLORS, getColor, getLogo, and the
ModelTag component, update this file to import those symbols instead of
redefining them, and adjust any relative paths or typings so both
bench-frontend.tsx and frontend.tsx consume the same implementation.
- Around line 679-688: Replace the inline style block on the h2 in the "All
Runs" panel with a shared CSS class: create a new class (e.g.
.panel-section-title) in bench.css that consolidates the rules used by .rankings
h2 and .matrix h2 (font-family, font-size:13px, font-weight:700,
text-transform:uppercase, letter-spacing:1px, color:var(--text-dim),
margin-bottom:16px) and then apply className="panel-section-title" to the h2 in
bench-frontend.tsx; remove the inline style object from that h2 so the component
uses the shared CSS class instead.
- Around line 491-505: The fetch in the useEffect (which calls
fetch("/api/bench/runs") and then r.json()) doesn't check HTTP status and can
misreport server errors; add a small helper like fetchJSON<T>(url, init?) that
awaits fetch, throws when !res.ok (including status text/code) and then returns
await res.json(); replace the direct fetch(...).then(r => r.json()) usages in
the useEffect and the other fetch sites that load runs/results (the fetch calls
that currently call r.json() then setRuns, setSelectedRunId, setLoading, or
setError) to use fetchJSON and catch/handle errors to setError with a clearer
message including the HTTP status when available.

In `@bench.ts`:
- Around line 132-168: The Promise.all call around withRetry(...) for
callGenerateAnswer discards any successful answer when its sibling rejects;
replace Promise.all with Promise.allSettled and inspect the settled results to
set answerA and answerB from the fulfilled entries (referencing withRetry,
callGenerateAnswer, isRealString and the local vars answerA/answerB), and if
either promise rejected keep the existing error-return path but include the
answers that did succeed when building the returned object (preserve
pairingIndex, roundNum, prompter, prompt, modelA/modelB, votes, votesA/votesB,
winner fields).

In `@server.ts`:
- Around line 569-583: These public bench endpoints (handlers around
getActiveBenchRun/getBenchSummary and getBenchRuns) need the same rate-limiting
protection used by the history endpoint: obtain the client's IP (the same ip
variable used by the history path), call isRateLimited(`bench:${ip}`, <same
limits>), and if it returns true return a 429 Response (with the same JSON
content-type and Cache-Control headers). Add this check at the top of both the
"/api/bench/status" and "/api/bench/runs" branches so misbehaving clients are
throttled before calling getActiveBenchRun/getBenchRuns.
- Around line 540-549: The current handler relies on a fragile 50ms sleep to let
runBench initialize activeBenchRun; instead make the runId available
synchronously: either change runBench to accept a pre-generated runId (generate
a UUID/ID in the handler, pass it into runBench and set active run immediately)
or modify runBench to return the new runId synchronously before any async
awaits, then remove the setTimeout and use that returned runId (or the
pre-generated one) in the Response and when calling getActiveBenchRun; update
all callers of runBench to the new signature and ensure activeBenchRun is set
from that id immediately.

…mount leak

- bench.ts: Guard against crash when fewer than 3 models are configured
  by returning an error result instead of accessing undefined prompter
- bench-frontend.tsx: Force re-fetch round data on bench completion even
  when the selected run ID hasn't changed
- bench-frontend.tsx: Prevent WebSocket reconnect timer after component
  unmount using an unmounted flag
@aravhawk aravhawk changed the title Add round-robin benchmark feature Add benchmark feature Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant