Skip to content

[DON'T MERGE] Test automatic robsutness evaluation#55

Closed
Eternal579 wants to merge 4 commits into
RouteWorks:automatic-robustness-evaluationfrom
Eternal579:automatic-robustness-evaluation
Closed

[DON'T MERGE] Test automatic robsutness evaluation#55
Eternal579 wants to merge 4 commits into
RouteWorks:automatic-robustness-evaluationfrom
Eternal579:automatic-robustness-evaluation

Conversation

@Eternal579

Copy link
Copy Markdown

No description provided.

@RixinLiu RixinLiu changed the title Test automatic robsutness evaluation [Don't Merge] Test automatic robsutness evaluation Dec 4, 2025
@github-actions

github-actions Bot commented Dec 4, 2025

Copy link
Copy Markdown

Router Evaluation Results

Router: your-router
Dataset Split: full

RouterArena Metrics

Metric Value
RouterArena Score 0.0000
Accuracy 0.00%
Total Cost $0.010815
Avg Cost per Query $0.000001
Avg Cost per 1K Queries $0.0013
Number of Queries 8400
Robustness Score 0.2524

Optimality Metrics

Metric Value
Opt.Sel (Optimal Selection) 0.0000
Opt.Cost (Cost Efficiency) 0.0000
Opt.Acc (Accuracy vs Optimal) 0.0000

Evaluation completed by RouterArena automated workflow

@RixinLiu

RixinLiu commented Dec 4, 2025

Copy link
Copy Markdown
Contributor

I added placeholder (random token usage and random cost) to the prediction files, so other metrics except for robustness score looks like abnormal, but, i double check with what i ran in the local, they are the same, see details below:

$uv run python ./llm_evaluation/run.py your-router full
...
================================================================================
2025-12-04 14:01:35 - __main__ - INFO - Router: your-router
2025-12-04 14:01:35 - __main__ - INFO - ================================================================================
2025-12-04 14:01:35 - __main__ - INFO - Total Queries (Regular): 8400
2025-12-04 14:01:35 - __main__ - INFO - Optimality Entries: 2427 (excluded from RouterArena score)
2025-12-04 14:01:35 - __main__ - INFO - Queries with Accuracy: 8400
2025-12-04 14:01:35 - __main__ - INFO - Queries with Valid Cost: 8400
2025-12-04 14:01:35 - __main__ - INFO - Average Accuracy: 0.0000
2025-12-04 14:01:35 - __main__ - INFO - Total Cost: $0.010815
2025-12-04 14:01:35 - __main__ - INFO - Average Cost per Query: $0.000001
2025-12-04 14:01:35 - __main__ - INFO - Average Cost per 1K Queries: $0.0013
2025-12-04 14:01:35 - __main__ - INFO - RouterArena Score: 0.0000
2025-12-04 14:01:35 - __main__ - INFO - 
--------------------------------------------------------------------------------

The robustness score is same with what i ran in the local, see details below:

$ uv run python ./llm_evaluation/run.py your-router --calculate-robustness-score
warning: The `tool.uv.dev-dependencies` field (used in `pyproject.toml`) is deprecated and will be removed in a future release; use `dependency-groups.dev` instead
2025-12-04 13:50:55 - __main__ - INFO - Computing robustness score using ./router_inference/predictions/your-router.json (full) and ./router_inference/predictions/your-router-robustness.json (robustness)
2025-12-04 13:50:55 - __main__ - INFO - Loaded 10827 predictions from ./router_inference/predictions/your-router.json
2025-12-04 13:50:55 - __main__ - INFO - Matched: 420, Flips: 314
2025-12-04 13:50:55 - __main__ - INFO - Robustness score: 0.2524
2025-12-04 13:50:55 - __main__ - INFO - Robustness metrics saved to ./metrics.json

So this test is successful, should be closed.

@RixinLiu RixinLiu changed the title [Don't Merge] Test automatic robsutness evaluation [DON'T MERGE] Test automatic robsutness evaluation Dec 4, 2025
@RixinLiu RixinLiu closed this Dec 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants