[Fix] count failed/empty generations as wrong instead of droppin them. by yl231 · Pull Request #118 · RouteWorks/RouterArena

yl231 · 2026-05-29T17:46:27Z

The router evaluation averaged accuracy only over entries that produced a valid generation, while dividing total cost by the full regular-query count. A submission could exploit this asymmetry by marking queries it did not want scored as success=false (empty generation): those entries were silently dropped from the accuracy denominator (inflating accuracy over a self-selected subset) yet the cost was still spread across all queries (diluting cost/1K). On one submission this lifted the arena score from ~0.22 to 0.76.

Now every regular query contributes to the accuracy denominator: entries with no valid generation (success=false / missing / empty) are scored as 0 and tallied as abnormal_count. The count is logged, written to metrics.json, and surfaced in the PR evaluation comment with a warning so authors can regenerate the missing predictions and resubmit.

Routers that answered every query are unaffected (verified byte-identical scores across all currently-accepted leaderboard routers).

llm_evaluation/run.py: compute_router_metrics counts failed/None as 0 + abnormal_count
automation/process_pr_submission.py: same logic in compute_scores (defensive)
.github/workflows/pr-evaluation.yml: show Abnormal Entries row + warning note

… them The router evaluation averaged accuracy only over entries that produced a valid generation, while dividing total cost by the full regular-query count. A submission could exploit this asymmetry by marking queries it did not want scored as success=false (empty generation): those entries were silently dropped from the accuracy denominator (inflating accuracy over a self-selected subset) yet the cost was still spread across all queries (diluting cost/1K). On one submission this lifted the arena score from ~0.22 to 0.76. Now every regular query contributes to the accuracy denominator: entries with no valid generation (success=false / missing / empty) are scored as 0 and tallied as `abnormal_count`. The count is logged, written to metrics.json, and surfaced in the PR evaluation comment with a warning so authors can regenerate the missing predictions and resubmit. Routers that answered every query are unaffected (verified byte-identical scores across all currently-accepted leaderboard routers). - llm_evaluation/run.py: compute_router_metrics counts failed/None as 0 + abnormal_count - automation/process_pr_submission.py: same logic in compute_scores (defensive) - .github/workflows/pr-evaluation.yml: show Abnormal Entries row + warning note Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

yl231 · 2026-05-29T17:47:41Z

/gemini review

gemini-code-assist

Code Review

This pull request updates the accuracy calculation in both automation/process_pr_submission.py and llm_evaluation/run.py to ensure that all regular queries contribute to the denominator. Failed or invalid generations are now scored as 0.0 and tracked via an abnormal_count metric rather than being silently omitted. The reviewer feedback recommends hardening the input validation for these untrusted inputs by explicitly checking that success is strictly True and verifying that accuracy is a valid numeric type (excluding booleans) to prevent potential runtime type errors.

- Require generated_result.success is True (reject truthy strings like "False") - Validate accuracy is a real number (reject bool/str) before summing; invalid -> 0 + abnormal - Apply ruff-format to satisfy pre-commit CI Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Inline the numeric isinstance check in the branch so mypy narrows the type; behavior unchanged (valid numeric accuracy kept, everything else -> 0 + abnormal). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

yl231 self-assigned this May 29, 2026

gemini-code-assist Bot reviewed May 29, 2026

View reviewed changes

Comment thread automation/process_pr_submission.py Outdated

Comment thread llm_evaluation/run.py Outdated

Louie Lu and others added 2 commits May 29, 2026 12:51

yl231 merged commit f77c4be into main May 29, 2026
10 checks passed

yl231 deleted the fix-eval-coverage branch May 29, 2026 17:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] count failed/empty generations as wrong instead of droppin them. #118

[Fix] count failed/empty generations as wrong instead of droppin them. #118
yl231 merged 3 commits into
mainfrom
fix-eval-coverage

yl231 commented May 29, 2026

Uh oh!

yl231 commented May 29, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yl231 commented May 29, 2026

Uh oh!

yl231 commented May 29, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant