Skip to content

[Feat] Support evaluation on modified prediction files#48

Merged
jiarong0907 merged 8 commits into
mainfrom
trigger-evaluation
Nov 30, 2025
Merged

[Feat] Support evaluation on modified prediction files#48
jiarong0907 merged 8 commits into
mainfrom
trigger-evaluation

Conversation

@yl231

@yl231 yl231 commented Nov 30, 2025

Copy link
Copy Markdown
Contributor

PR Summary

Enables evaluation of modified (not just newly added) prediction files. The modified testing router has been successfully evaluated below.

Key Changes

  1. Support for modified files: Both added (A) and modified (M) prediction files are now detected and evaluated.

  2. Fork-aware comparison: Compares against the fork's base branch instead of upstream main for independent evaluation of each submission (determining which router prediction file is modified).

  3. Force re-evaluation: Added --force flag to run the full evaluation and ensure consistent evaluation results in workflow runs.

…ion option

* Updated the PR evaluation workflow to detect changed prediction files more accurately by comparing against the fork's base branch.
* Added a `--force` option to the evaluation script to allow re-evaluation of all entries, even if they have already been evaluated.
* Minor adjustments to the GLM-4-air-router prediction JSON to test the above functionalities.
@github-actions

Copy link
Copy Markdown

📊 Router Evaluation Results

Router: glm-4-air-router
Dataset Split: full

Metrics

Metric Value
RouterArena Score 0.5617
Accuracy 54.65%
Total Cost $0.396728
Avg Cost per Query $0.000047
Avg Cost per 1K Queries $0.0472
Number of Queries 8400

Evaluation completed by RouterArena automated workflow

@yl231 yl231 changed the title [Enhancement] Use PR branch for evaluation and support modified prediction files [Enhancement] Support evaluation on modified prediction files Nov 30, 2025
@yl231 yl231 requested a review from jiarong0907 November 30, 2025 20:24

@jiarong0907 jiarong0907 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One question: Will is also triggered if we evaluate something in our own repo without a fork?

@jiarong0907

Copy link
Copy Markdown
Contributor

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively enhances the evaluation process by including modified prediction files and adding a --force flag for re-evaluation. The changes are well-structured and address the intended goals. I have a couple of suggestions: one to improve the conciseness of a condition check, and another to correct a misleading help message for a command-line argument.

Comment thread llm_evaluation/run.py
Comment thread automation/process_pr_submission.py Outdated
@yl231

yl231 commented Nov 30, 2025

Copy link
Copy Markdown
Contributor Author

LGTM. One question: Will is also triggered if we evaluate something in our own repo without a fork?

Yes, it will.

@github-actions

Copy link
Copy Markdown

📊 Router Evaluation Results

Router: glm-4-air-router
Dataset Split: full

Metrics

Metric Value
RouterArena Score 0.5617
Accuracy 54.65%
Total Cost $0.396728
Avg Cost per Query $0.000047
Avg Cost per 1K Queries $0.0472
Number of Queries 8400

Evaluation completed by RouterArena automated workflow

@jiarong0907 jiarong0907 added the enhancement New feature or request label Nov 30, 2025
@jiarong0907 jiarong0907 changed the title [Enhancement] Support evaluation on modified prediction files [Feat] Support evaluation on modified prediction files Nov 30, 2025
@jiarong0907 jiarong0907 merged commit d77c288 into main Nov 30, 2025
11 checks passed
@jiarong0907 jiarong0907 deleted the trigger-evaluation branch November 30, 2025 23:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants