[Feat.] PR evaluation workflow with automatic robustness evaluation by RixinLiu · Pull Request #56 · RouteWorks/RouterArena

RixinLiu · 2025-12-04T20:39:37Z

Summary

Adds automatic robustness evaluation to the RouterArena PR workflow.

Key Features

Robustness Evaluation Integration

The robustness evaluation pipeline is now fully integrated into the existing codebase:

Dataset preparation:
prep_datasets.py automatically downloads the robustness split from the RouterArena dataset.
Prediction generation:
generate_prediction_file.py accepts a --robustness flag to run router inference specifically for the robustness split.
Config and file checking:
check_config_prediction.py accepts a --robustness flag to validate the corresponding prediction file, ensuring correct structure and routing outputs.
Score computation:
llm_evaluation/run.py adds a robustness option to compute robustness metrics independently, without performing LLM inference.
Workflow integration:
pr-evaluation.yml has been updated so that robustness evaluation is automatically executed and included in the workflow results.

README Update

The README has been updated to document all robustness-related features, including dataset preparation, prediction file generation, configuration checks, score computation.

Workflow Verification

Please refer to this closed PR-comment to understand the details.

jiarong0907 · 2025-12-04T20:45:13Z

/gemini review

gemini-code-assist

Code Review

This pull request effectively integrates an automatic robustness evaluation into the PR workflow. The changes are well-structured across dataset preparation, prediction generation, validation, and scoring. The README has also been updated accordingly. I've identified a few areas for improvement, mainly concerning code duplication, error handling, and a minor logging inconsistency. Addressing these will enhance the maintainability and robustness of the new evaluation pipeline.

jiarong0907

See comments.

jiarong0907 · 2025-12-04T20:57:57Z

Did we run a full PR test for all evaluation metrics?

RixinLiu · 2025-12-04T21:34:10Z

Did we run a full PR test for all evaluation metrics?
Yes, here
#55 (comment)

jiarong0907 · 2025-12-04T21:38:20Z

Did we run a full PR test for all evaluation metrics?
Yes, here
#55 (comment)

Is this sufficient? I saw the Arenascore and others are empty.

RixinLiu · 2025-12-04T21:47:18Z

Did we run a full PR test for all evaluation metrics?
Yes, here
#55 (comment)

Is this sufficient? I saw the Arenascore and others are empty.

Yes, it's sufficient, other metrics are empty just because i dont run llm inference, i put placeholder on the token usage, generated_answer, and cost
This comment explain it in details

…ion.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…s.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

jiarong0907 · 2025-12-05T02:37:54Z

/gemini review

gemini-code-assist

Code Review

This pull request successfully integrates an automatic robustness evaluation into the PR workflow. The changes are comprehensive, touching dataset preparation, prediction generation, validation, and scoring. The README is also updated accordingly.

I've identified a few areas for improvement:

There's a minor bug in the logging of the robustness score, which misrepresents the 'flip ratio'.
In several places, broad except Exception blocks are used, which could be narrowed to more specific exceptions to improve error handling and prevent masking other issues.

Overall, this is a great addition to the project. My detailed comments are below.

RixinLiu · 2025-12-08T01:07:33Z

Looks like everything is settled from this issue result

jiarong0907 · 2025-12-08T03:21:37Z

/gemini review

gemini-code-assist

Code Review

This pull request successfully integrates an automatic robustness evaluation into the RouterArena workflow. The changes are well-structured across multiple scripts, including dataset preparation, prediction generation, validation, and score computation. The README.md has also been updated accordingly.

I've found one critical issue: a typo in a filename within scripts/process_datasets/prep_datasets.py which would break the data preparation step for the robustness evaluation. I've also noted a minor issue with a misleading log message. Please see the specific comments for details.

Overall, this is a great feature addition that enhances the evaluation capabilities of RouterArena.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

RixinLiu added 7 commits December 3, 2025 15:58

prep_datasets is ready

393e9d4

generate_prediction_file.py is ready

76e0d9c

check_config_prediction_files.py is ready

a589e93

llm_evaluation/run.py is ready

0d783fa

pr-automation is ready to test

a77f91e

[Debug]

2285c91

[Feat.] PR evaluation workflow with automatic robustness evaluation

865f500

RixinLiu requested a review from jiarong0907 December 4, 2025 20:39