Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
256 changes: 256 additions & 0 deletions eval/routerarena/leaderboard-pr/README.upstream.updated.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,256 @@
<div align="center">
<img src="images/routerarena_logo_v2.png" alt="RouterArena logo" height="96" />

<br>
<p>
<a href="https://huggingface.co/blog/JerryPotter/who-routes-the-routers"><img alt="Blog" src="https://img.shields.io/badge/Blog-Read-FF5722?logo=rss&logoColor=white&labelColor=555555"></a>
<a href="https://arxiv.org/abs/2510.00202"><img alt="arXiv: RouterArena" src="https://img.shields.io/badge/arXiv-RouterArena-b31b1b?logo=arxiv&logoColor=white&labelColor=555555"></a>
<a href="https://huggingface.co/datasets/RouteWorks/RouterArena"><img alt="Hugging Face Dataset" src="https://img.shields.io/badge/%20Hugging%20Face-Dataset-yellow?logo=huggingface&logoColor=white&labelColor=555555"></a>
<br>
</p>

</div>

<h1 align="center"> Make Router Evaluation Open and Standardized </h1>

<p align="center">
<img src="images/routerarena-diagram.png" alt="RouterArena Diagram" width="700" />
</p>

**RouterArena** is an open evaluation platform and leaderboard for **LLM routers**—systems that automatically select the best model for a given query. As the LLM ecosystem diversifies with models varying in size, capability, and cost, **routing** has become critical for balancing performance and cost. Yet, LLM routers currently lack a standardized evaluation framework to assess how effectively they trade off accuracy, cost, and other related metrics.

RouterArena bridges this gap by providing an open evaluation platform and benchmarking framework for both open-source and commercial routers. It has the following key features:

- 🌍 **Diverse Data Coverage**: A principly-constructed, diverse evaluation dataset spanning 9 domains and 44 categories with easy, medium, and hard difficulty levels.
- 📊 **Comprehensive Metrics**: Five router-critical metrics measuring accuracy, cost, optimality, robustness, and latency.
- ⚙️ **Automated Evaluation**: An automated evaluation framework to simplify the evaluation process for open-source and commercial routers.
- 🏆 **Live Leaderboard**: A live leaderboard to track the performance of routers across multiple dimensions.

*We aim for RouterArena to serve as a foundation for the community to evaluate, understand, and advance LLM routing systems.*

> [!IMPORTANT]
> **RouterArena is an evaluation-only dataset.** Submissions that train, fit, or tune any router component on RouterArena data (including the label files) will be rejected, and any accepted submission found in violation will be withdrawn.

# Current Leaderboard

For more details, please see our [website](https://routeworks.github.io/leaderboard) and [blog](https://huggingface.co/blog/JerryPotter/who-routes-the-routers).

| Rank | Router | Affiliation | Acc-Cost Arena | Accuracy | Cost/1K Queries | Optimal Selection | Optimal Cost | Optimal Accuracy | Latency | Robustness |
|------|--------------------|-----------------------------|--------|----------|---------|-----------------|--------------|----------------|---------|------------|
| 🥇 | [Sqwish Router](https://www.sqwish.ai/) | | 75.27 | 76.40 | $0.18 | 7.41 | 25.10 | 90.47 | — | 100.00 |
| 🥈 | [Nadir Cascade](https://getnadir.com) | | 73.33 | 74.87 | $0.29 | — | — | — | — | 25.48 |
| 🥉 | [Weave Router](https://workweave.ai) | 🎓&nbsp;Weave | 72.82 | 76.32 | $0.94 | — | — | — | — | 100.00 |
| 4 | [OrcaRouter-Adaptive](https://www.orcarouter.ai/) | | 72.08 | 75.54 | $1.00 | — | — | — | — | 22.62 |
| 5 | [Azure-Model-Router](https://ai.azure.com/catalog/models/model-router)&nbsp;[[Web]](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/concepts/model-router) | 💼&nbsp;Microsoft | 71.87 | 72.82 | $0.22 | — | — | — | — | 71.43 |
| 6 | [R2-Router](https://arxiv.org/abs/2602.02823/) | 🎓&nbsp;UCF | 71.60 | 71.23 | $0.06 | 24.51 | 48.70 | 99.85 | — | 45.71 |
| 7 | [Auto Router]() | | 70.05 | 70.17 | $0.12 | 37.58 | 40.02 | 86.04 | — | 49.52 |
| 8 | [vLLM‑SR](https://vllm-semantic-router.com/)&nbsp;[[Code]](https://github.com/vllm-project/semantic-router)&nbsp;[[HF]](https://huggingface.co/llm-semantic-router) | 🎓&nbsp;vLLM SR Team | 67.23 | 66.53 | $0.06 | 84.66 | 90.71 | 89.24 | — | 90.95 |
| 9 | [MIRT‑BERT](https://arxiv.org/pdf/2506.01048)&nbsp;[[Code]](https://github.com/Mercidaiha/IRT-Router) | 🎓&nbsp;USTC | 66.89 | 66.88 | $0.15 | 3.44 | 19.62 | 78.18 | 27.03 | 61.19 |
| 10 | [NIRT‑BERT](https://arxiv.org/pdf/2506.01048)&nbsp;[[Code]](https://github.com/Mercidaiha/IRT-Router) | 🎓&nbsp;USTC | 66.12 | 66.34 | $0.21 | 3.83 | 14.04 | 77.88 | 10.42 | 49.29 |
| 11 | [GPT‑5](https://openai.com/index/introducing-gpt-5/) | 💼&nbsp;OpenAI | 64.32 | 73.96 | $10.02 | — | — | — | — | — |
| 12 | [CARROT](https://arxiv.org/abs/2502.03261)&nbsp;[[Code]](https://github.com/somerstep/CARROT)&nbsp;[[HF]](https://huggingface.co/CARROT-LLM-Routing) | 🎓&nbsp;UMich | 63.87 | 67.21 | $2.06 | 2.68 | 6.77 | 78.63 | 1.50 | 89.05 |
| 13 | [Chayan](https://huggingface.co/adaptive-classifier/chayan)&nbsp;[[HF]](https://huggingface.co/adaptive-classifier/chayan) | 🎓&nbsp;Adaptive&nbsp;Classifier | 63.83 | 64.89 | $0.56 | 43.03 | 43.75 | 88.74 | — | — |
| 14 | [AgentForge Router]() | | 60.12 | 59.16 | $0.09 | — | — | — | — | 100.00 |
| 15 | [RouterBench‑MLP](https://arxiv.org/pdf/2403.12031)&nbsp;[[Code]](https://github.com/withmartian/routerbench)&nbsp;[[HF]](https://huggingface.co/datasets/withmartian/routerbench) | 🎓&nbsp;Martian | 57.56 | 61.62 | $4.83 | 13.39 | 24.45 | 83.32 | 90.91 | 80.00 |
| 16 | [NotDiamond](https://www.notdiamond.ai/) | 💼&nbsp;NotDiamond | 57.29 | 60.83 | $4.10 | 1.55 | 2.14 | 76.81 | — | 55.91 |
| 17 | [GraphRouter](https://arxiv.org/abs/2410.03834)&nbsp;[[Code]](https://github.com/ulab-uiuc/GraphRouter) | 🎓&nbsp;UIUC | 57.22 | 57.00 | $0.34 | 4.73 | 38.33 | 74.25 | 2.70 | 94.29 |
| 18 | [RouterBench‑KNN](https://arxiv.org/pdf/2403.12031)&nbsp;[[Code]](https://github.com/withmartian/routerbench)&nbsp;[[HF]](https://huggingface.co/datasets/withmartian/routerbench) | 🎓&nbsp;Martian | 55.48 | 58.69 | $4.27 | 13.09 | 25.49 | 78.77 | 1.33 | 83.33 |
| 19 | [RouteLLM](https://arxiv.org/abs/2406.18665)&nbsp;[[Code]](https://github.com/lm-sys/RouteLLM)&nbsp;[[HF]](https://huggingface.co/routellm) | 🎓&nbsp;Berkeley | 48.07 | 47.04 | $0.27 | 99.72 | 99.63 | 68.76 | 0.40 | 100.00 |
| 20 | [RouterDC](https://arxiv.org/abs/2409.19886)&nbsp;[[Code]](https://github.com/shuhao02/RouterDC) | 🎓&nbsp;SUSTech | 33.75 | 32.01 | $0.07 | 39.84 | 73.00 | 49.05 | 10.75 | 85.24 |

🎓 Open-source  💼 Closed-source 

<!-- <p align="center">
<img src="images/leaderboard.png" alt="Make GPU Sharing Flexible and Easy" width="500" />
</p> -->

<!-- # Have your router on the leaderboard! -->

# Evaluating Your Router

## 1. Setup

### Step 1.1: Install uv and RouterArena

```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
cd RouterArena
uv sync
```

### Step 1.2: Download Dataset
Download the dataset from [HF dataset](https://huggingface.co/datasets/RouteWorks/RouterArena).

```bash
uv run python ./scripts/process_datasets/prep_datasets.py
```

### Step 1.3: Set Up API Keys (Optional)

In the project root, copy `.env.example` as `.env` and update the API keys in `.env`. This step is **required only if you use our pipeline for LLM inferences**.

```bash
# Example .env file
OPENAI_API_KEY=<Your-Key>
ANTHROPIC_API_KEY=<Your-Key>
# ...
```

See the [`ModelInference`](./llm_inference/model_inference.py) class for the complete list of supported providers and required environment variables. You can extend that class to support more models, or submit a GitHub issue to request support for new providers.

## 2. Get Routing Decisions

Follow the steps below to obtain your router's model choices for each query. Start with the `sub_10` split (a 10% subset) for local testing. Once your setup works, you can evaluate:
- on the `full` dataset for full local evaluation and official leaderboard submission.
- on the `robustness` dataset for robustness evaluation.

### Step 2.1: Prepare Config File

Create a config file in `./router_inference/config/<router_name>.json`. An example config file is included [here](./router_inference/config/your-router.json).

```json
{
"pipeline_params": {
"router_name": "your-router",
"router_cls_name": "your_router_class_name",
"models": [
"gpt-4o-mini",
"claude-3-haiku-20240307",
"gemini-2.0-flash-001"
]
}
}
```

For each model in your config, add an entry with the pricing per million tokens in this format at [`model_cost/model_cost.json`](./model_cost/model_cost.json):

```json
{
"gpt-4o-mini": {
"input_token_price_per_million": 0.15,
"output_token_price_per_million": 0.6
},
}
```

> [!NOTE]
> Ensure all models in your above config files are listed in [`./universal_model_names.py`](./universal_model_names.py). If you add a new model, you must also add the API inference endpoint in [`llm_inference/model_inference.py`](./llm_inference/model_inference.py).

### Step 2.2: Create Your Router Class and Generate Prediction File

Create your own router class by inheriting from `BaseRouter` and implementing the `_get_prediction()` method. See [`router_inference/router/example_router.py`](./router_inference/router/example_router.py) for a complete example.

Then, modify [`router_inference/router/__init__.py`](./router_inference/router/__init__.py) to include your router class:

```python
# Import your router class
from router_inference.router.my_router import MyRouter

__all__ = ["BaseRouter", "ExampleRouter", "MyRouter"]
```

Finally, generate the prediction file:

```bash
uv run python ./router_inference/generate_prediction_file.py your-router [sub_10|full|robustness]
```

> [!NOTE]
> - The `<your-router>` argument must match your config filename (without the `.json` extension). For example, if your config file is `router_inference/config/my-router.json`, use `my-router` as the argument.
> - Your `_get_prediction()` method must return a model name that exists in your config file's `models` list. The base class will automatically validate this.

### Step 2.3: Validate Config and Prediction Files

```bash
uv run python ./router_inference/check_config_prediction_files.py your-router [sub_10|full|robustness]
```

This script checks: (1) all model names are valid, (2) prediction file has correct size (809 for `sub_10`, 8400 for `full`, 420 for `robustness`), and (3) all entries have valid `global_index`, `prompt`, and `prediction` fields.

## 3. Run LLM Inference

Run the inference script to make API calls for each query using the selected models:

```bash
uv run python ./llm_inference/run.py your-router
```

The script loads your prediction file, makes API calls using the models specified in the `prediction` field, and saves results incrementally. It uses cached results when available and saves progress after each query, so you can safely interrupt and resume. Results are saved to `./cached_results/` for reuse across routers.
> [!NOTE]
> - For robustness evaluation, we only measure the model-selection flip ratio after adding noise to the original prompt, so no additional LLM inference is required for this stage.

## 4. Run Router Evaluation

As the last step, run the evaluation script:

```bash
uv run python ./llm_evaluation/run.py your-router [sub_10|full|robustness]
```

> [!TIP]
> - Use `sub_10` or `full` to evaluate on those datasets.
> - Use `robustness` to run robustness-only evaluation (expects `<router_name>-robustness.json`).

# Submitting to the leaderboard

To get your router on the leaderboard, you can open a Pull Request with your router's prediction file to trigger our automated evaluation workflow. Details are as follows:

1. **Add your files**:
- `router_inference/config/<router_name>.json` - Your router configuration
- `router_inference/predictions/<router_name>.json` - Your prediction file with `generated_result` fields populated
- `router_inference/predictions/<router_name>-robustness.json` - Your prediction file for robustness evaluation, no `generated_result` fields needed
2. **Open a Pull Request to `main` branch and call `/evaluate` in the PR comment**
- When the PR is ready for evaluation, call `/evaluate` in the PR comment to trigger the evaluation workflow. See an example [here](https://github.com/RouteWorks/RouterArena/pull/71#issuecomment-3904936480).
- The automated workflow will:
- Validate your submission
- Run evaluation on the full dataset
- Post results as a comment on your PR
- Update the leaderboard upon approval

The Figure below shows the evaluation pipeline.

<p align="center">
<img src="images/pipeline.png" alt="RouterArena Evaluation Pipeline" width="700" />
</p>

# Contributing

We welcome and appreciate contributions and collaborations of any kind.

We use pre-commit to ensure a consistent coding style. You can set it up by

```bash
pip install pre-commit
pre-commit install
```

Before pushing your code, run the following and make sure your code passes all checks.

```bash
pre-commit run --all-files
```

# Contacts

Feel free to contact us for contributions and collaborations.

```
Yifan Lu (yifan.lu@rice.edu)
Rixin Liu (rixin.liu@rice.edu)
Jiarong Xing (jxing@rice.edu)
```

# Citation:
If you find our project helpful, please give us a star and cite us by:

```bibtax
@misc{lu2025routerarenaopenplatformcomprehensive,
title = {RouterArena: An Open Platform for Comprehensive Comparison of LLM Routers},
author = {Yifan Lu and Rixin Liu and Jiayi Yuan and Xingqi Cui and Shenrun Zhang and Hongyi Liu and Jiarong Xing},
year = {2025},
eprint = {2510.00202},
archivePrefix= {arXiv},
primaryClass = {cs.LG},
url = {https://arxiv.org/abs/2510.00202}
}
```
65 changes: 65 additions & 0 deletions eval/routerarena/leaderboard-pr/SUBMIT_INSTRUCTIONS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# RouterArena leaderboard README update — Nadir at #2

PR [RouteWorks/RouterArena#112](https://github.com/RouteWorks/RouterArena/pull/112)
(`nadir-cascade-v2`) was merged on 2026-05-31. The automated evaluation bot
returned the official scores below, but the leaderboard table in
`RouteWorks/RouterArena/README.md` has not yet been regenerated to include the
entry. This directory holds a ready-to-submit README edit that inserts Nadir at
its earned rank.

## Official evaluation result (from the PR's CI bot)

| Metric | Value |
|---|---|
| Acc-Cost Arena score | **0.7333** (→ 73.33) |
| Accuracy | 74.87% |
| Avg cost / 1K queries | $0.2932 (→ $0.29) |
| Robustness | 0.2548 (→ 25.48) |
| Queries | 8,400 (full split) |
| Gate checks | Passed all 4 |

Ranked by **Acc-Cost Arena** (the leaderboard's ranking column), 73.33 lands at
**#2**, behind Sqwish Router (75.27) and ahead of Weave Router (72.82).

> Note: the Robustness score (25.48) is low relative to the top entries
> (100.00). The leaderboard ranks by Acc-Cost Arena, so #2 stands, but expect a
> maintainer to ask about robustness. The PR notes a planned follow-up to
> rebuild the robustness predictions to mirror main routing.

## The change

Insert one row for `Nadir Cascade` as the new 🥈, demote Weave to 🥉 and
OrcaRouter to rank 4, and shift ranks 4–19 down to 5–20. See
`nadir_leaderboard.patch` (a unified diff against
`RouteWorks/RouterArena/main:README.md`) and `README.upstream.updated.md` (the
full updated upstream README for reference).

```
| 🥈 | [Nadir Cascade](https://getnadir.com) | | 73.33 | 74.87 | $0.29 | — | — | — | — | 25.48 |
```

## How to open the upstream PR (from a RouterArena fork)

This repo's automation is scoped to `doramirdor/getnadir.dev` and cannot push
to `RouteWorks/RouterArena`, so the upstream PR has to be opened from a fork.

```bash
# 1. Fork RouteWorks/RouterArena on GitHub (once), then:
git clone https://github.com/<your-username>/RouterArena.git
cd RouterArena
git remote add upstream https://github.com/RouteWorks/RouterArena.git
git fetch upstream && git checkout -b leaderboard-add-nadir upstream/main

# 2. Apply the patch from this directory
git apply /path/to/getnadir.dev/eval/routerarena/leaderboard-pr/nadir_leaderboard.patch
# (if the upstream README moved and the patch won't apply cleanly,
# copy the single Nadir row above into the table manually and renumber)

# 3. Commit and push to your fork
git add README.md
git commit -m "Add Nadir Cascade to leaderboard (#2, Acc-Cost Arena 73.33)"
git push -u origin leaderboard-add-nadir

# 4. Open a PR from your fork's branch to RouteWorks/RouterArena:main
# referencing the merged submission PR #112.
```
Loading