[Fix] count failed/empty generations as wrong instead of droppin them. #118
Merged
Conversation
… them The router evaluation averaged accuracy only over entries that produced a valid generation, while dividing total cost by the full regular-query count. A submission could exploit this asymmetry by marking queries it did not want scored as success=false (empty generation): those entries were silently dropped from the accuracy denominator (inflating accuracy over a self-selected subset) yet the cost was still spread across all queries (diluting cost/1K). On one submission this lifted the arena score from ~0.22 to 0.76. Now every regular query contributes to the accuracy denominator: entries with no valid generation (success=false / missing / empty) are scored as 0 and tallied as `abnormal_count`. The count is logged, written to metrics.json, and surfaced in the PR evaluation comment with a warning so authors can regenerate the missing predictions and resubmit. Routers that answered every query are unaffected (verified byte-identical scores across all currently-accepted leaderboard routers). - llm_evaluation/run.py: compute_router_metrics counts failed/None as 0 + abnormal_count - automation/process_pr_submission.py: same logic in compute_scores (defensive) - .github/workflows/pr-evaluation.yml: show Abnormal Entries row + warning note Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
Author
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request updates the accuracy calculation in both automation/process_pr_submission.py and llm_evaluation/run.py to ensure that all regular queries contribute to the denominator. Failed or invalid generations are now scored as 0.0 and tracked via an abnormal_count metric rather than being silently omitted. The reviewer feedback recommends hardening the input validation for these untrusted inputs by explicitly checking that success is strictly True and verifying that accuracy is a valid numeric type (excluding booleans) to prevent potential runtime type errors.
- Require generated_result.success is True (reject truthy strings like "False") - Validate accuracy is a real number (reject bool/str) before summing; invalid -> 0 + abnormal - Apply ruff-format to satisfy pre-commit CI Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Inline the numeric isinstance check in the branch so mypy narrows the type; behavior unchanged (valid numeric accuracy kept, everything else -> 0 + abnormal). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The router evaluation averaged accuracy only over entries that produced a valid generation, while dividing total cost by the full regular-query count. A submission could exploit this asymmetry by marking queries it did not want scored as success=false (empty generation): those entries were silently dropped from the accuracy denominator (inflating accuracy over a self-selected subset) yet the cost was still spread across all queries (diluting cost/1K). On one submission this lifted the arena score from ~0.22 to 0.76.
Now every regular query contributes to the accuracy denominator: entries with no valid generation (success=false / missing / empty) are scored as 0 and tallied as
abnormal_count. The count is logged, written to metrics.json, and surfaced in the PR evaluation comment with a warning so authors can regenerate the missing predictions and resubmit.Routers that answered every query are unaffected (verified byte-identical scores across all currently-accepted leaderboard routers).