[Feat] Support evaluation on modified prediction files by yl231 · Pull Request #48 · RouteWorks/RouterArena

yl231 · 2025-11-30T08:42:58Z

PR Summary

Enables evaluation of modified (not just newly added) prediction files. The modified testing router has been successfully evaluated below.

Key Changes

Support for modified files: Both added (A) and modified (M) prediction files are now detected and evaluated.
Fork-aware comparison: Compares against the fork's base branch instead of upstream main for independent evaluation of each submission (determining which router prediction file is modified).
Force re-evaluation: Added --force flag to run the full evaluation and ensure consistent evaluation results in workflow runs.

…ion option * Updated the PR evaluation workflow to detect changed prediction files more accurately by comparing against the fork's base branch. * Added a `--force` option to the evaluation script to allow re-evaluation of all entries, even if they have already been evaluated. * Minor adjustments to the GLM-4-air-router prediction JSON to test the above functionalities.

github-actions · 2025-11-30T09:02:54Z

📊 Router Evaluation Results

Router: glm-4-air-router
Dataset Split: full

Metrics

Metric	Value
RouterArena Score	0.5617
Accuracy	54.65%
Total Cost	$0.396728
Avg Cost per Query	$0.000047
Avg Cost per 1K Queries	$0.0472
Number of Queries	8400

Evaluation completed by RouterArena automated workflow

jiarong0907

LGTM. One question: Will is also triggered if we evaluate something in our own repo without a fork?

jiarong0907 · 2025-11-30T22:18:59Z

/gemini review

gemini-code-assist

Code Review

This pull request effectively enhances the evaluation process by including modified prediction files and adding a --force flag for re-evaluation. The changes are well-structured and address the intended goals. I have a couple of suggestions: one to improve the conciseness of a condition check, and another to correct a misleading help message for a command-line argument.

yl231 · 2025-11-30T23:03:37Z

LGTM. One question: Will is also triggered if we evaluate something in our own repo without a fork?

Yes, it will.

github-actions · 2025-11-30T23:25:49Z

📊 Router Evaluation Results

Router: glm-4-air-router
Dataset Split: full

Metrics

Metric	Value
RouterArena Score	0.5617
Accuracy	54.65%
Total Cost	$0.396728
Avg Cost per Query	$0.000047
Avg Cost per 1K Queries	$0.0472
Number of Queries	8400

Evaluation completed by RouterArena automated workflow

yl231 added 7 commits November 29, 2025 15:52

Improved the workflow

fbd1bf0

Improved.

54685fb

fixed comparing with the pr branch rather than main branch

f52e846

re-run

d9022fc

re-test

ae50686

Update PR evaluation workflow

34a10da

yl231 changed the title ~~[Enhancement] Use PR branch for evaluation and support modified prediction files~~ [Enhancement] Support evaluation on modified prediction files Nov 30, 2025

yl231 requested a review from jiarong0907 November 30, 2025 20:24

yl231 mentioned this pull request Nov 30, 2025

[TODO] Better way to trigger evaluation for new routers #38

Closed

jiarong0907 approved these changes Nov 30, 2025

View reviewed changes

gemini-code-assist Bot reviewed Nov 30, 2025

View reviewed changes

Comment thread llm_evaluation/run.py

Comment thread automation/process_pr_submission.py Outdated

Fixes to gemini code review.

255f436

jiarong0907 added the enhancement New feature or request label Nov 30, 2025

jiarong0907 changed the title ~~[Enhancement] Support evaluation on modified prediction files~~ [Feat] Support evaluation on modified prediction files Nov 30, 2025

jiarong0907 merged commit d77c288 into main Nov 30, 2025
11 checks passed

jiarong0907 deleted the trigger-evaluation branch November 30, 2025 23:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Support evaluation on modified prediction files#48

[Feat] Support evaluation on modified prediction files#48
jiarong0907 merged 8 commits into
mainfrom
trigger-evaluation

yl231 commented Nov 30, 2025 •

edited

Loading

Uh oh!

github-actions Bot commented Nov 30, 2025

Uh oh!

jiarong0907 left a comment

Uh oh!

jiarong0907 commented Nov 30, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

yl231 commented Nov 30, 2025

Uh oh!

github-actions Bot commented Nov 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yl231 commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Key Changes

Uh oh!

github-actions Bot commented Nov 30, 2025

📊 Router Evaluation Results

Metrics

Uh oh!

jiarong0907 left a comment

Choose a reason for hiding this comment

Uh oh!

jiarong0907 commented Nov 30, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

yl231 commented Nov 30, 2025

Uh oh!

github-actions Bot commented Nov 30, 2025

📊 Router Evaluation Results

Metrics

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yl231 commented Nov 30, 2025 •

edited

Loading