[DON'T MERGE] Test automatic robsutness evaluation by Eternal579 · Pull Request #55 · RouteWorks/RouterArena

Eternal579 · 2025-12-04T16:16:51Z

No description provided.

…n' into automatic-robustness-evaluation Sync fork

github-actions · 2025-12-04T19:48:21Z

Router Evaluation Results

Router: your-router
Dataset Split: full

RouterArena Metrics

Metric	Value
RouterArena Score	0.0000
Accuracy	0.00%
Total Cost	$0.010815
Avg Cost per Query	$0.000001
Avg Cost per 1K Queries	$0.0013
Number of Queries	8400
Robustness Score	0.2524

Optimality Metrics

Metric	Value
Opt.Sel (Optimal Selection)	0.0000
Opt.Cost (Cost Efficiency)	0.0000
Opt.Acc (Accuracy vs Optimal)	0.0000

Evaluation completed by RouterArena automated workflow

RixinLiu · 2025-12-04T20:03:57Z

I added placeholder (random token usage and random cost) to the prediction files, so other metrics except for robustness score looks like abnormal, but, i double check with what i ran in the local, they are the same, see details below:

$uv run python ./llm_evaluation/run.py your-router full
...
================================================================================
2025-12-04 14:01:35 - __main__ - INFO - Router: your-router
2025-12-04 14:01:35 - __main__ - INFO - ================================================================================
2025-12-04 14:01:35 - __main__ - INFO - Total Queries (Regular): 8400
2025-12-04 14:01:35 - __main__ - INFO - Optimality Entries: 2427 (excluded from RouterArena score)
2025-12-04 14:01:35 - __main__ - INFO - Queries with Accuracy: 8400
2025-12-04 14:01:35 - __main__ - INFO - Queries with Valid Cost: 8400
2025-12-04 14:01:35 - __main__ - INFO - Average Accuracy: 0.0000
2025-12-04 14:01:35 - __main__ - INFO - Total Cost: $0.010815
2025-12-04 14:01:35 - __main__ - INFO - Average Cost per Query: $0.000001
2025-12-04 14:01:35 - __main__ - INFO - Average Cost per 1K Queries: $0.0013
2025-12-04 14:01:35 - __main__ - INFO - RouterArena Score: 0.0000
2025-12-04 14:01:35 - __main__ - INFO - 
--------------------------------------------------------------------------------

The robustness score is same with what i ran in the local, see details below:

$ uv run python ./llm_evaluation/run.py your-router --calculate-robustness-score
warning: The `tool.uv.dev-dependencies` field (used in `pyproject.toml`) is deprecated and will be removed in a future release; use `dependency-groups.dev` instead
2025-12-04 13:50:55 - __main__ - INFO - Computing robustness score using ./router_inference/predictions/your-router.json (full) and ./router_inference/predictions/your-router-robustness.json (robustness)
2025-12-04 13:50:55 - __main__ - INFO - Loaded 10827 predictions from ./router_inference/predictions/your-router.json
2025-12-04 13:50:55 - __main__ - INFO - Matched: 420, Flips: 314
2025-12-04 13:50:55 - __main__ - INFO - Robustness score: 0.2524
2025-12-04 13:50:55 - __main__ - INFO - Robustness metrics saved to ./metrics.json

So this test is successful, should be closed.

Test automatic robsutness evaluation

7de1c55

RixinLiu changed the title ~~Test automatic robsutness evaluation~~ [Don't Merge] Test automatic robsutness evaluation Dec 4, 2025

Eternal579 added 3 commits December 4, 2025 12:58

Merge remote-tracking branch 'upstream/automatic-robustness-evaluatio…

8c0c14f

…n' into automatic-robustness-evaluation Sync fork

add place holder for generated results JUST FOR TEST

8135898

add place holder for generated results JUST FOR TEST

1e69bf3

RixinLiu changed the title ~~[Don't Merge] Test automatic robsutness evaluation~~ [DON'T MERGE] Test automatic robsutness evaluation Dec 4, 2025

RixinLiu closed this Dec 4, 2025

RixinLiu mentioned this pull request Dec 4, 2025

[Feat.] PR evaluation workflow with automatic robustness evaluation #56

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DON'T MERGE] Test automatic robsutness evaluation#55

[DON'T MERGE] Test automatic robsutness evaluation#55
Eternal579 wants to merge 4 commits into
RouteWorks:automatic-robustness-evaluationfrom
Eternal579:automatic-robustness-evaluation

Eternal579 commented Dec 4, 2025

Uh oh!

github-actions Bot commented Dec 4, 2025

Uh oh!

RixinLiu commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Eternal579 commented Dec 4, 2025

Uh oh!

github-actions Bot commented Dec 4, 2025

Router Evaluation Results

RouterArena Metrics

Optimality Metrics

Uh oh!

RixinLiu commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants