-
Notifications
You must be signed in to change notification settings - Fork 27
[Feat.] PR evaluation workflow with automatic robustness evaluation #56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
23 commits
Select commit
Hold shift + click to select a range
393e9d4
prep_datasets is ready
RixinLiu 76e0d9c
generate_prediction_file.py is ready
RixinLiu a589e93
check_config_prediction_files.py is ready
RixinLiu 0d783fa
llm_evaluation/run.py is ready
RixinLiu a77f91e
pr-automation is ready to test
RixinLiu 2285c91
[Debug]
RixinLiu 865f500
[Feat.] PR evaluation workflow with automatic robustness evaluation
RixinLiu 9712ce2
[GEMINI SUGGESTION] Update try-catch in automation/process_pr_submiss…
RixinLiu 56e09c3
[GEMINI SUGGESTION] Update try-catch in llm_evaluation/run.py
RixinLiu ee47815
[GEMINI SUGGESTION] Fix typo in scripts/process_datasets/prep_dataset…
RixinLiu 83e741f
Refactor robustness score compute logic
RixinLiu dbcdcf8
Refine compute_robustness_score implementation
RixinLiu d783823
Handle robustness CLI errors by raising exceptions
RixinLiu ef5f776
Remove type ignore
RixinLiu 6366111
Solve conflict between local utils and global utils
RixinLiu 4120894
Replace arg --calculate-robustness-score with robustness
RixinLiu 407aa2e
Should pass pre-commit
RixinLiu fd8e765
Refine code
RixinLiu 2c24823
Update scripts/process_datasets/prep_datasets.py
RixinLiu 94e22f3
Update router_inference/check_config_prediction_files.py
RixinLiu 6883d6e
Ready to merge
RixinLiu 1bc8176
Ready to merge
RixinLiu 9e60ff7
Remove incorrect files
RixinLiu File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
|
jiarong0907 marked this conversation as resolved.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| # SPDX-FileCopyrightText: Copyright contributors to the RouterArena project | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| """Shared utilities for RouterArena scripts.""" | ||
|
|
||
| from .robustness import compute_robustness_score # noqa: F401 | ||
|
|
||
| __all__ = ["compute_robustness_score"] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,90 @@ | ||
| # SPDX-FileCopyrightText: Copyright contributors to the RouterArena project | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| """Utilities for computing robustness metrics across scripts.""" | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| from typing import Any, Optional | ||
|
|
||
| from universal_model_names import ModelNameManager | ||
|
|
||
| __all__ = ["compute_robustness_score"] | ||
|
|
||
|
|
||
| def _normalize_model_name( | ||
| model_name: Optional[str], name_manager: ModelNameManager | ||
| ) -> Optional[str]: | ||
| """Convert a model name to its universal form, falling back gracefully.""" | ||
| if model_name is None: | ||
| return None | ||
| try: | ||
| return name_manager.get_universal_name(model_name) | ||
| except ValueError: | ||
| return model_name | ||
|
|
||
|
|
||
| def compute_robustness_score( | ||
| full_predictions: list[dict[str, Any]], | ||
| robustness_predictions: list[dict[str, Any]], | ||
| *, | ||
| name_manager: ModelNameManager | None = None, | ||
| ) -> Optional[float]: | ||
| """ | ||
| Compute the robustness flip ratio between full and robustness prediction sets. | ||
|
|
||
| Args: | ||
| full_predictions: Router predictions for the full/sub_10 split. | ||
| robustness_predictions: Predictions collected from the robustness split. | ||
| name_manager: Optional shared instance to reuse universal name cache. | ||
|
|
||
| Returns: | ||
| A float in [0, 1] representing stability (1 - flip ratio), | ||
| or ``None`` if no overlapping entries were found. | ||
| """ | ||
|
|
||
| manager = name_manager or ModelNameManager() | ||
|
|
||
| def get_index(entry: dict[str, Any]) -> Optional[str]: | ||
| """Extract a normalized global index from an entry.""" | ||
| value = entry.get("global index") or entry.get("global_index") | ||
| return str(value) if value is not None else None | ||
|
|
||
| def normalize(name: object) -> Optional[str]: | ||
| """Normalize model names through the shared name manager.""" | ||
| if not name: | ||
| return None | ||
| return _normalize_model_name(str(name), manager) | ||
|
|
||
| # Build a lookup of router selections from the full split. | ||
| full_map = { | ||
| key: entry | ||
| for entry in full_predictions | ||
| if isinstance(entry, dict) | ||
| and not entry.get("for_optimality", False) | ||
| and (key := get_index(entry)) is not None | ||
| } | ||
|
|
||
| if not full_map: | ||
| return None | ||
|
|
||
| matches = [ | ||
| (full_map[key].get("prediction"), entry.get("prediction")) | ||
| for entry in robustness_predictions | ||
| if isinstance(entry, dict) | ||
| and (key := get_index(entry)) is not None | ||
| and key in full_map | ||
| and full_map[key].get("prediction") | ||
| and entry.get("prediction") | ||
| ] | ||
|
|
||
| if not matches: | ||
| return None | ||
|
|
||
| flips = sum( | ||
| 1 | ||
| for full_model, robust_model in matches | ||
| if normalize(full_model) != normalize(robust_model) | ||
| ) | ||
|
|
||
| return 1.0 - flips / len(matches) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.