Skip to content

Automated PR-Based Router Submission and Evaluation#13

Merged
jiarong0907 merged 11 commits into
mainfrom
submission-test
Nov 10, 2025
Merged

Automated PR-Based Router Submission and Evaluation#13
jiarong0907 merged 11 commits into
mainfrom
submission-test

Conversation

@yl231

@yl231 yl231 commented Nov 8, 2025

Copy link
Copy Markdown
Contributor

Summary

Implements an automated GitHub Actions workflow that evaluates router submissions via Pull Requests. Contributors submit prediction files through PRs, and the system automatically validates, evaluates, and posts results as PR comments.

Changes

New Files:

  • .github/workflows/pr-evaluation.yml - GitHub Actions workflow
  • automation/process_pr_submission.py - Core evaluation script
  • automation/extract_metrics.py - Metrics extraction helper

Modified Files:

  • router_inference/check_config_prediction_files.py - Added dictionary validation and model cost checks
  • llm_evaluation/run.py - Private dataset loading and error handling improvements
  • scripts/process_datasets/prep_datasets.py - Conditional private repo access
  • README.md - Updated submission instructions

yl231 added 10 commits November 7, 2025 13:21
- Add automatic PR comment posting with evaluation results
- Add extract_metrics.py helper to parse evaluation output
- Add upfront model cost validation in check_config_prediction_files.py
- Auto-detect dataset split based on prediction file size (809=sub_10, 8400=full)
- Always prepare dataset in workflow to ensure availability
- Add router config with glm-4-air model
- Add prediction file with 8400 entries (full dataset)
- Remove test1.json prediction file
- Update llm_evaluation/run.py to simplify load_ground_truth_dataset
- Update prep_datasets.py to support private repo dataset loading
- Add --check-generated-result flag to check_config_prediction_files.py
  - Flag defaults to False (for pre-inference validation)
  - When True, validates that all generated_result fields are populated
- Add error handling in compute_arena_score to prevent math domain errors
  - Validates cost and accuracy inputs
  - Provides clear error messages for invalid values
- Add checks in compute_router_metrics to detect when no entries evaluated
  - Prevents silent failures when all entries are skipped
- Fix populate script to create generated_result as dictionary (not string)
- generated_result now matches format expected by evaluation code
- Update README to use PR submission instead of email contact
- Add instructions for submitting routers via Pull Request
- Update validation to check for dictionary format (not string)
- Validate required fields: generated_answer, success, token_usage
- Allow empty generated_answer when success is False (failed entries)
- Remove test configs (test1.json, glm-4-air-router.json)
- Remove test predictions (glm-4-air-router.json)
- Keep only your-router.json config as in main branch
@yl231 yl231 requested a review from jiarong0907 November 8, 2025 07:53
Comment thread .github/workflows/pr-evaluation.yml
Comment thread llm_evaluation/run.py Outdated
Comment thread llm_evaluation/run.py
Comment thread images/routerarena-diagram.png

@jiarong0907 jiarong0907 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solve the comments and we can merge it. Thanks!

@jiarong0907

Copy link
Copy Markdown
Contributor

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-designed automation workflow for evaluating router submissions via GitHub Actions. The changes include new scripts for processing PRs, extracting metrics, and modifications to existing evaluation and validation scripts to support the new flow. The use of git worktrees for isolation is a great choice. My review focuses on improving the robustness and maintainability of the new scripts. Key suggestions include moving an import to the top level for better code style, making metric extraction from logs less fragile, and improving the model cost validation by removing ambiguous partial-string matching in favor of exact matches to prevent potential miscalculations. Overall, this is a strong contribution that greatly improves the project's submission process.

Comment thread router_inference/check_config_prediction_files.py
Comment thread automation/extract_metrics.py
Comment thread automation/process_pr_submission.py Outdated
Comment thread router_inference/check_config_prediction_files.py Outdated
@yl231 yl231 requested a review from jiarong0907 November 10, 2025 04:52

@jiarong0907 jiarong0907 Nov 10, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I improved this file in previous commit, but somehow when merging the conflict, the changes are inverted. See this: https://github.com/RouteWorks/RouterArena/blob/e893355c294b13c6646b4bf2268f646386c86bb6/llm_evaluation/livecodebench_util.py

You current implementation still has uncleaned ingore, and the implementation looks not very elegant to me. Maybe consider merge the current version with my previous version.

@jiarong0907 jiarong0907 merged commit e3ac430 into main Nov 10, 2025
10 checks passed
@jiarong0907 jiarong0907 deleted the submission-test branch November 10, 2025 17:23
@yl231

yl231 commented Nov 10, 2025

Copy link
Copy Markdown
Contributor Author

You should upload the prediction file to RouterArena/router_inference/predictions and make a PR. The workflow will automatically detect your submission and perform the evaluation. In the end, you would see the result of the evaluation similar to below:
example-result

yl231 added a commit that referenced this pull request May 31, 2026
* Update leaderboard: add Nadir (#112), update AgentForge (#117)

- AgentForge (#117): refreshed to faithful post-fix metrics (arena 74.13,
  acc 74.72%, cost $0.13/1K, abnormal 0) — moves #13 -> #2.
- Nadir (#112): new entry, arena 73.33 (acc 74.87%, cost $0.29/1K) at #3.
  Optimality left as — pending review (computed 1.0/1.0/1.0 is an artifact of
  the submission listing only expensive optimality candidates).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Update Nadir Router link in README.md

---------

Co-authored-by: Louie Lu <yl231@datalab2.cs.rice.edu>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants