Skip to content

[Fix] count failed/empty generations as wrong instead of droppin them. #118

Merged
yl231 merged 3 commits into
mainfrom
fix-eval-coverage
May 29, 2026
Merged

[Fix] count failed/empty generations as wrong instead of droppin them. #118
yl231 merged 3 commits into
mainfrom
fix-eval-coverage

Conversation

@yl231

@yl231 yl231 commented May 29, 2026

Copy link
Copy Markdown
Contributor

The router evaluation averaged accuracy only over entries that produced a valid generation, while dividing total cost by the full regular-query count. A submission could exploit this asymmetry by marking queries it did not want scored as success=false (empty generation): those entries were silently dropped from the accuracy denominator (inflating accuracy over a self-selected subset) yet the cost was still spread across all queries (diluting cost/1K). On one submission this lifted the arena score from ~0.22 to 0.76.

Now every regular query contributes to the accuracy denominator: entries with no valid generation (success=false / missing / empty) are scored as 0 and tallied as abnormal_count. The count is logged, written to metrics.json, and surfaced in the PR evaluation comment with a warning so authors can regenerate the missing predictions and resubmit.

Routers that answered every query are unaffected (verified byte-identical scores across all currently-accepted leaderboard routers).

  • llm_evaluation/run.py: compute_router_metrics counts failed/None as 0 + abnormal_count
  • automation/process_pr_submission.py: same logic in compute_scores (defensive)
  • .github/workflows/pr-evaluation.yml: show Abnormal Entries row + warning note

… them

The router evaluation averaged accuracy only over entries that produced a
valid generation, while dividing total cost by the full regular-query count.
A submission could exploit this asymmetry by marking queries it did not want
scored as success=false (empty generation): those entries were silently
dropped from the accuracy denominator (inflating accuracy over a self-selected
subset) yet the cost was still spread across all queries (diluting cost/1K).
On one submission this lifted the arena score from ~0.22 to 0.76.

Now every regular query contributes to the accuracy denominator: entries with
no valid generation (success=false / missing / empty) are scored as 0 and
tallied as `abnormal_count`. The count is logged, written to metrics.json, and
surfaced in the PR evaluation comment with a warning so authors can regenerate
the missing predictions and resubmit.

Routers that answered every query are unaffected (verified byte-identical
scores across all currently-accepted leaderboard routers).

- llm_evaluation/run.py: compute_router_metrics counts failed/None as 0 + abnormal_count
- automation/process_pr_submission.py: same logic in compute_scores (defensive)
- .github/workflows/pr-evaluation.yml: show Abnormal Entries row + warning note

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@yl231 yl231 self-assigned this May 29, 2026
@yl231

yl231 commented May 29, 2026

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the accuracy calculation in both automation/process_pr_submission.py and llm_evaluation/run.py to ensure that all regular queries contribute to the denominator. Failed or invalid generations are now scored as 0.0 and tracked via an abnormal_count metric rather than being silently omitted. The reviewer feedback recommends hardening the input validation for these untrusted inputs by explicitly checking that success is strictly True and verifying that accuracy is a valid numeric type (excluding booleans) to prevent potential runtime type errors.

Comment thread automation/process_pr_submission.py Outdated
Comment thread llm_evaluation/run.py Outdated
Louie Lu and others added 2 commits May 29, 2026 12:51
- Require generated_result.success is True (reject truthy strings like "False")
- Validate accuracy is a real number (reject bool/str) before summing; invalid -> 0 + abnormal
- Apply ruff-format to satisfy pre-commit CI

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Inline the numeric isinstance check in the branch so mypy narrows the type;
behavior unchanged (valid numeric accuracy kept, everything else -> 0 + abnormal).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@yl231 yl231 merged commit f77c4be into main May 29, 2026
10 checks passed
@yl231 yl231 deleted the fix-eval-coverage branch May 29, 2026 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant