Skip to content

[Feat.] PR evaluation workflow with automatic robustness evaluation#56

Merged
RixinLiu merged 23 commits into
mainfrom
automatic-robustness-evaluation
Dec 8, 2025
Merged

[Feat.] PR evaluation workflow with automatic robustness evaluation#56
RixinLiu merged 23 commits into
mainfrom
automatic-robustness-evaluation

Conversation

@RixinLiu

@RixinLiu RixinLiu commented Dec 4, 2025

Copy link
Copy Markdown
Contributor

Summary

Adds automatic robustness evaluation to the RouterArena PR workflow.

Key Features

Robustness Evaluation Integration

The robustness evaluation pipeline is now fully integrated into the existing codebase:

  • Dataset preparation:
    prep_datasets.py automatically downloads the robustness split from the RouterArena dataset.

  • Prediction generation:
    generate_prediction_file.py accepts a --robustness flag to run router inference specifically for the robustness split.

  • Config and file checking:
    check_config_prediction.py accepts a --robustness flag to validate the corresponding prediction file, ensuring correct structure and routing outputs.

  • Score computation:
    llm_evaluation/run.py adds a robustness option to compute robustness metrics independently, without performing LLM inference.

  • Workflow integration:
    pr-evaluation.yml has been updated so that robustness evaluation is automatically executed and included in the workflow results.

README Update

The README has been updated to document all robustness-related features, including dataset preparation, prediction file generation, configuration checks, score computation.

Workflow Verification

Please refer to this closed PR-comment to understand the details.

@RixinLiu RixinLiu requested a review from jiarong0907 December 4, 2025 20:39
Comment thread .github/workflows/pr-evaluation.yml
@jiarong0907

Copy link
Copy Markdown
Contributor

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively integrates an automatic robustness evaluation into the PR workflow. The changes are well-structured across dataset preparation, prediction generation, validation, and scoring. The README has also been updated accordingly. I've identified a few areas for improvement, mainly concerning code duplication, error handling, and a minor logging inconsistency. Addressing these will enhance the maintainability and robustness of the new evaluation pipeline.

Comment thread automation/process_pr_submission.py
Comment thread automation/process_pr_submission.py Outdated
Comment thread llm_evaluation/run.py Outdated
Comment thread llm_evaluation/run.py
Comment thread scripts/process_datasets/prep_datasets.py Outdated
Comment thread llm_evaluation/run.py Outdated
Comment thread llm_evaluation/run.py Outdated
Comment thread llm_evaluation/run.py Outdated
Comment thread llm_evaluation/run.py Outdated
Comment thread llm_evaluation/run.py Outdated
Comment thread scripts/process_datasets/prep_datasets.py Outdated
Comment thread README.md Outdated
Comment thread README.md

@jiarong0907 jiarong0907 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments.

@jiarong0907

Copy link
Copy Markdown
Contributor

Did we run a full PR test for all evaluation metrics?

@RixinLiu

RixinLiu commented Dec 4, 2025

Copy link
Copy Markdown
Contributor Author

Did we run a full PR test for all evaluation metrics?
Yes, here
#55 (comment)

@jiarong0907

Copy link
Copy Markdown
Contributor

Did we run a full PR test for all evaluation metrics?
Yes, here
#55 (comment)

Is this sufficient? I saw the Arenascore and others are empty.

@RixinLiu

RixinLiu commented Dec 4, 2025

Copy link
Copy Markdown
Contributor Author

Did we run a full PR test for all evaluation metrics?
Yes, here
#55 (comment)

Is this sufficient? I saw the Arenascore and others are empty.

Yes, it's sufficient, other metrics are empty just because i dont run llm inference, i put placeholder on the token usage, generated_answer, and cost
This comment explain it in details

RixinLiu and others added 6 commits December 4, 2025 15:54
…ion.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…s.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@RixinLiu RixinLiu requested a review from jiarong0907 December 5, 2025 02:08
Comment thread global_utils/__init__.py
Comment thread llm_evaluation/run.py Outdated
Comment thread llm_evaluation/run.py Outdated
@jiarong0907

Copy link
Copy Markdown
Contributor

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully integrates an automatic robustness evaluation into the PR workflow. The changes are comprehensive, touching dataset preparation, prediction generation, validation, and scoring. The README is also updated accordingly.

I've identified a few areas for improvement:

  • There's a minor bug in the logging of the robustness score, which misrepresents the 'flip ratio'.
  • In several places, broad except Exception blocks are used, which could be narrowed to more specific exceptions to improve error handling and prevent masking other issues.

Overall, this is a great addition to the project. My detailed comments are below.

Comment thread llm_evaluation/run.py Outdated
Comment thread global_utils/robustness.py Outdated
Comment thread llm_evaluation/run.py Outdated
Comment thread llm_evaluation/run.py Outdated
@RixinLiu RixinLiu requested a review from jiarong0907 December 8, 2025 01:03
@RixinLiu

RixinLiu commented Dec 8, 2025

Copy link
Copy Markdown
Contributor Author

Looks like everything is settled from this issue result

@jiarong0907

Copy link
Copy Markdown
Contributor

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully integrates an automatic robustness evaluation into the RouterArena workflow. The changes are well-structured across multiple scripts, including dataset preparation, prediction generation, validation, and score computation. The README.md has also been updated accordingly.

I've found one critical issue: a typo in a filename within scripts/process_datasets/prep_datasets.py which would break the data preparation step for the robustness evaluation. I've also noted a minor issue with a misleading log message. Please see the specific comments for details.

Overall, this is a great feature addition that enhances the evaluation capabilities of RouterArena.

Comment thread scripts/process_datasets/prep_datasets.py Outdated
Comment thread router_inference/check_config_prediction_files.py Outdated
RixinLiu and others added 5 commits December 7, 2025 21:43
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@RixinLiu RixinLiu merged commit b92f9ec into main Dec 8, 2025
10 checks passed
@jiarong0907 jiarong0907 deleted the automatic-robustness-evaluation branch December 8, 2025 04:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants