Automated PR-Based Router Submission and Evaluation#13
Conversation
- Add automatic PR comment posting with evaluation results - Add extract_metrics.py helper to parse evaluation output - Add upfront model cost validation in check_config_prediction_files.py - Auto-detect dataset split based on prediction file size (809=sub_10, 8400=full) - Always prepare dataset in workflow to ensure availability
- Add router config with glm-4-air model - Add prediction file with 8400 entries (full dataset) - Remove test1.json prediction file - Update llm_evaluation/run.py to simplify load_ground_truth_dataset - Update prep_datasets.py to support private repo dataset loading
- Add --check-generated-result flag to check_config_prediction_files.py - Flag defaults to False (for pre-inference validation) - When True, validates that all generated_result fields are populated - Add error handling in compute_arena_score to prevent math domain errors - Validates cost and accuracy inputs - Provides clear error messages for invalid values - Add checks in compute_router_metrics to detect when no entries evaluated - Prevents silent failures when all entries are skipped
- Fix populate script to create generated_result as dictionary (not string) - generated_result now matches format expected by evaluation code - Update README to use PR submission instead of email contact - Add instructions for submitting routers via Pull Request
- Update validation to check for dictionary format (not string) - Validate required fields: generated_answer, success, token_usage - Allow empty generated_answer when success is False (failed entries)
- Remove test configs (test1.json, glm-4-air-router.json) - Remove test predictions (glm-4-air-router.json) - Keep only your-router.json config as in main branch
jiarong0907
left a comment
There was a problem hiding this comment.
Solve the comments and we can merge it. Thanks!
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-designed automation workflow for evaluating router submissions via GitHub Actions. The changes include new scripts for processing PRs, extracting metrics, and modifications to existing evaluation and validation scripts to support the new flow. The use of git worktrees for isolation is a great choice. My review focuses on improving the robustness and maintainability of the new scripts. Key suggestions include moving an import to the top level for better code style, making metric extraction from logs less fragile, and improving the model cost validation by removing ambiguous partial-string matching in favor of exact matches to prevent potential miscalculations. Overall, this is a strong contribution that greatly improves the project's submission process.
There was a problem hiding this comment.
I believe I improved this file in previous commit, but somehow when merging the conflict, the changes are inverted. See this: https://github.com/RouteWorks/RouterArena/blob/e893355c294b13c6646b4bf2268f646386c86bb6/llm_evaluation/livecodebench_util.py
You current implementation still has uncleaned ingore, and the implementation looks not very elegant to me. Maybe consider merge the current version with my previous version.
* Update leaderboard: add Nadir (#112), update AgentForge (#117) - AgentForge (#117): refreshed to faithful post-fix metrics (arena 74.13, acc 74.72%, cost $0.13/1K, abnormal 0) — moves #13 -> #2. - Nadir (#112): new entry, arena 73.33 (acc 74.87%, cost $0.29/1K) at #3. Optimality left as — pending review (computed 1.0/1.0/1.0 is an artifact of the submission listing only expensive optimality candidates). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Update Nadir Router link in README.md --------- Co-authored-by: Louie Lu <yl231@datalab2.cs.rice.edu> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Summary
Implements an automated GitHub Actions workflow that evaluates router submissions via Pull Requests. Contributors submit prediction files through PRs, and the system automatically validates, evaluates, and posts results as PR comments.
Changes
New Files:
.github/workflows/pr-evaluation.yml- GitHub Actions workflowautomation/process_pr_submission.py- Core evaluation scriptautomation/extract_metrics.py- Metrics extraction helperModified Files:
router_inference/check_config_prediction_files.py- Added dictionary validation and model cost checksllm_evaluation/run.py- Private dataset loading and error handling improvementsscripts/process_datasets/prep_datasets.py- Conditional private repo accessREADME.md- Updated submission instructions