[Feat] Support evaluation on modified prediction files#48
Conversation
…ion option * Updated the PR evaluation workflow to detect changed prediction files more accurately by comparing against the fork's base branch. * Added a `--force` option to the evaluation script to allow re-evaluation of all entries, even if they have already been evaluated. * Minor adjustments to the GLM-4-air-router prediction JSON to test the above functionalities.
📊 Router Evaluation ResultsRouter: Metrics
Evaluation completed by RouterArena automated workflow |
jiarong0907
left a comment
There was a problem hiding this comment.
LGTM. One question: Will is also triggered if we evaluate something in our own repo without a fork?
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request effectively enhances the evaluation process by including modified prediction files and adding a --force flag for re-evaluation. The changes are well-structured and address the intended goals. I have a couple of suggestions: one to improve the conciseness of a condition check, and another to correct a misleading help message for a command-line argument.
Yes, it will. |
📊 Router Evaluation ResultsRouter: Metrics
Evaluation completed by RouterArena automated workflow |
PR Summary
Enables evaluation of modified (not just newly added) prediction files. The modified testing router has been successfully evaluated below.
Key Changes
Support for modified files: Both added (
A) and modified (M) prediction files are now detected and evaluated.Fork-aware comparison: Compares against the fork's base branch instead of upstream
mainfor independent evaluation of each submission (determining which router prediction file is modified).Force re-evaluation: Added
--forceflag to run the full evaluation and ensure consistent evaluation results in workflow runs.