Automated PR-Based Router Submission and Evaluation by yl231 · Pull Request #13 · RouteWorks/RouterArena

yl231 · 2025-11-08T07:52:43Z

Summary

Implements an automated GitHub Actions workflow that evaluates router submissions via Pull Requests. Contributors submit prediction files through PRs, and the system automatically validates, evaluates, and posts results as PR comments.

Changes

New Files:

.github/workflows/pr-evaluation.yml - GitHub Actions workflow
automation/process_pr_submission.py - Core evaluation script
automation/extract_metrics.py - Metrics extraction helper

Modified Files:

router_inference/check_config_prediction_files.py - Added dictionary validation and model cost checks
llm_evaluation/run.py - Private dataset loading and error handling improvements
scripts/process_datasets/prep_datasets.py - Conditional private repo access
README.md - Updated submission instructions

- Add automatic PR comment posting with evaluation results - Add extract_metrics.py helper to parse evaluation output - Add upfront model cost validation in check_config_prediction_files.py - Auto-detect dataset split based on prediction file size (809=sub_10, 8400=full) - Always prepare dataset in workflow to ensure availability

- Add router config with glm-4-air model - Add prediction file with 8400 entries (full dataset) - Remove test1.json prediction file - Update llm_evaluation/run.py to simplify load_ground_truth_dataset - Update prep_datasets.py to support private repo dataset loading

- Add --check-generated-result flag to check_config_prediction_files.py - Flag defaults to False (for pre-inference validation) - When True, validates that all generated_result fields are populated - Add error handling in compute_arena_score to prevent math domain errors - Validates cost and accuracy inputs - Provides clear error messages for invalid values - Add checks in compute_router_metrics to detect when no entries evaluated - Prevents silent failures when all entries are skipped

- Fix populate script to create generated_result as dictionary (not string) - generated_result now matches format expected by evaluation code - Update README to use PR submission instead of email contact - Add instructions for submitting routers via Pull Request

- Update validation to check for dictionary format (not string) - Validate required fields: generated_answer, success, token_usage - Allow empty generated_answer when success is False (failed entries)

- Remove test configs (test1.json, glm-4-air-router.json) - Remove test predictions (glm-4-air-router.json) - Keep only your-router.json config as in main branch

jiarong0907

Solve the comments and we can merge it. Thanks!

jiarong0907 · 2025-11-10T03:43:47Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a significant and well-designed automation workflow for evaluating router submissions via GitHub Actions. The changes include new scripts for processing PRs, extracting metrics, and modifications to existing evaluation and validation scripts to support the new flow. The use of git worktrees for isolation is a great choice. My review focuses on improving the robustness and maintainability of the new scripts. Key suggestions include moving an import to the top level for better code style, making metric extraction from logs less fragile, and improving the model cost validation by removing ambiguous partial-string matching in favor of exact matches to prevent potential miscalculations. Overall, this is a strong contribution that greatly improves the project's submission process.

jiarong0907 · 2025-11-10T06:07:47Z

I believe I improved this file in previous commit, but somehow when merging the conflict, the changes are inverted. See this: https://github.com/RouteWorks/RouterArena/blob/e893355c294b13c6646b4bf2268f646386c86bb6/llm_evaluation/livecodebench_util.py

You current implementation still has uncleaned ingore, and the implementation looks not very elegant to me. Maybe consider merge the current version with my previous version.

yl231 · 2025-11-10T22:14:00Z

You should upload the prediction file to RouterArena/router_inference/predictions and make a PR. The workflow will automatically detect your submission and perform the evaluation. In the end, you would see the result of the evaluation similar to below:

* Update leaderboard: add Nadir (#112), update AgentForge (#117) - AgentForge (#117): refreshed to faithful post-fix metrics (arena 74.13, acc 74.72%, cost $0.13/1K, abnormal 0) — moves #13 -> #2. - Nadir (#112): new entry, arena 73.33 (acc 74.87%, cost $0.29/1K) at #3. Optimality left as — pending review (computed 1.0/1.0/1.0 is an artifact of the submission listing only expensive optimality candidates). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Update Nadir Router link in README.md --------- Co-authored-by: Louie Lu <yl231@datalab2.cs.rice.edu> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

yl231 added 10 commits November 7, 2025 13:21

Add router submission evaluation workflow

053de02

[Testing Automatic Eval] upload predictions and config files

4b04bc5

Sync dataset into worktree for evaluations

ecb5ae8

Fix dataset preparation and environment variable handling in workflow

5482a16

Fix validation to accept generated_result as dictionary

69db384

- Update validation to check for dictionary format (not string) - Validate required fields: generated_answer, success, token_usage - Allow empty generated_answer when success is False (failed entries)

Restore config and predictions to match main branch

f6ec20b

- Remove test configs (test1.json, glm-4-air-router.json) - Remove test predictions (glm-4-air-router.json) - Keep only your-router.json config as in main branch

yl231 requested a review from jiarong0907 November 8, 2025 07:53

jiarong0907 reviewed Nov 10, 2025

View reviewed changes

Comment thread .github/workflows/pr-evaluation.yml

jiarong0907 reviewed Nov 10, 2025

View reviewed changes

Comment thread llm_evaluation/run.py Outdated

jiarong0907 reviewed Nov 10, 2025

View reviewed changes

Comment thread llm_evaluation/run.py

jiarong0907 reviewed Nov 10, 2025

View reviewed changes

Comment thread images/routerarena-diagram.png

jiarong0907 requested changes Nov 10, 2025

View reviewed changes

gemini-code-assist Bot reviewed Nov 10, 2025

View reviewed changes

Comment thread router_inference/check_config_prediction_files.py

Comment thread automation/extract_metrics.py

Comment thread automation/process_pr_submission.py Outdated

Comment thread router_inference/check_config_prediction_files.py Outdated

Fixed PR comments.

353fa74

yl231 requested a review from jiarong0907 November 10, 2025 04:52

jiarong0907 reviewed Nov 10, 2025

View reviewed changes

jiarong0907 approved these changes Nov 10, 2025

View reviewed changes

jiarong0907 merged commit e3ac430 into main Nov 10, 2025
10 checks passed

jiarong0907 deleted the submission-test branch November 10, 2025 17:23

yl231 mentioned this pull request May 31, 2026

Update leaderboard: add Nadir (#112), update AgentForge (#117) #121

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated PR-Based Router Submission and Evaluation#13

Automated PR-Based Router Submission and Evaluation#13
jiarong0907 merged 11 commits into
mainfrom
submission-test

yl231 commented Nov 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jiarong0907 left a comment

Uh oh!

jiarong0907 commented Nov 10, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jiarong0907 Nov 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

yl231 commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yl231 commented Nov 8, 2025

Summary

Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jiarong0907 left a comment

Choose a reason for hiding this comment

Uh oh!

jiarong0907 commented Nov 10, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jiarong0907 Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yl231 commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jiarong0907 Nov 10, 2025 •

edited

Loading