You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -43,21 +43,19 @@ The current leaderboard is computed considering the accuracy and overall cost fo
43
43
44
44
If you want your router on the leaderboard, please contact us via email at yifan.lu@rice.edu or jxing@rice.edu, or submit a GitHub issue. For fairness, we have withheld the ground truth answers for the full dataset. However, you can still test your router using the sub-sampled 10% dataset by following the steps below.
45
45
46
-
## Setup
46
+
## Usage
47
47
48
-
### Step 1: Install uv and RouterArena
48
+
### Step 1: Install uv (if you don't have it)
49
49
50
50
```bash
51
51
curl -LsSf https://astral.sh/uv/install.sh | sh
52
-
cd RouterArena
53
-
uv sync
54
52
```
55
53
56
-
### Step 2: Download Dataset
57
-
Run this command to download the dataset from the [HF dataset](https://huggingface.co/datasets/RouteWorks/RouterArena).
54
+
### Step 2: Install RouterArena
58
55
59
56
```bash
60
-
uv run python ./scripts/process_datasets/prep_datasets.py
57
+
cd RouterArena
58
+
uv sync
61
59
```
62
60
63
61
### Step 3: Set Up API Keys
@@ -66,20 +64,24 @@ This step is **required only if you plan to use our pipeline to make LLM inferen
66
64
67
65
```bash
68
66
# Example .env file
69
-
OPENAI_API_KEY=<Your-Key>
70
-
ANTHROPIC_API_KEY=<Your-Key>
71
-
HF_TOKEN=<Your-Key>
72
-
# ...
67
+
OPENAI_API_KEY=your_openai_key_here
68
+
ANTHROPIC_API_KEY=your_anthropic_key_here
69
+
GOOGLE_API_KEY=your_google_key_here
70
+
MISTRAL_API_KEY=your_mistral_key_here
71
+
HF_TOKEN=your_huggingface_token_here
72
+
# ... add other keys as needed
73
73
```
74
74
75
-
#### Optional:
76
75
See the `ModelInference` class in `RouterArena/llm_inference/model_inference.py` for the complete list of supported providers and required environment variables. You can extend that class to support additional models, or submit a GitHub issue to request support for new providers.
77
76
78
-
## Usage
77
+
### Step 4: Download Dataset
78
+
Run this command to download the dataset from the [HF dataset](https://huggingface.co/datasets/RouteWorks/RouterArena).
79
79
80
-
Follow the steps below to evaluate your router. You can start with the `sub_10` split (10% sub-sampled dataset) to test your setup and code. The `sub_10` split includes ground truth answers for local testing. Once ready, you can evaluate on the `full` dataset for official leaderboard submission.
80
+
```bash
81
+
uv run python ./scripts/process_datasets/prep_datasets.py
82
+
```
81
83
82
-
### Step 1: Prepare Config File
84
+
### Step 5: Prepare Config File and Model Costs
83
85
84
86
Create a config file in `./router_inference/config/<router_name>.json`. We have created an example router for demonstration purposes:
85
87
@@ -99,64 +101,151 @@ Create a config file in `./router_inference/config/<router_name>.json`. We have
99
101
100
102
*Note: The model name must be the same as the one used in `./universal_model_names.py` (see next step for details)*
101
103
102
-
**Important**: For each model in your config, add an entry with the pricing per million tokens in this format:
104
+
**Important**: You also need to add cost information for each model (same model naming requirement as above) in `./model_cost/cost.json`. For each model in your config, add an entry with the pricing per million tokens:
103
105
104
106
```json
105
107
{
106
108
"gpt-4o-mini": {
107
109
"input_token_price_per_million": 0.15,
108
110
"output_token_price_per_million": 0.6
109
111
},
112
+
"claude-3-haiku-20240307": {
113
+
"input_token_price_per_million": 0.25,
114
+
"output_token_price_per_million": 1.25
115
+
},
116
+
"gemini-2.0-flash-001": {
117
+
"input_token_price_per_million": 0.1,
118
+
"output_token_price_per_million": 0.4
119
+
},
120
+
"mistral-medium": {
121
+
"input_token_price_per_million": 2.7,
122
+
"output_token_price_per_million": 8.1
123
+
}
110
124
}
111
125
```
112
126
113
-
### Step 2: Verify Model Names
127
+
### Step 6:
128
+
API providers may use different names for the same model (e.g., `gpt-4o`, `openai/gpt-4o`). We manage this via `./universal_model_names.py`:
129
+
-`universal_names`: canonical model names used in this project
130
+
-`mapping`: maps external provider names to our canonical names
114
131
115
-
Ensure all models in your config are listed in `./universal_model_names.py`. If you add a new model, you must also add the API inference endpoint in`RouterArena/llm_inference/model_inference.py`.
132
+
Please make sure the model you used are listed here, or you have added it there (if you add a model, please make sure you add the API inference endpoint at`RouterArena/llm_inference/model_inference.py`).
116
133
117
-
### Step 3: Generate Router's Prediction File
134
+
### Step 7: Generate Router's Prediction File
118
135
119
-
Generate a template prediction file:
136
+
You need to create a prediction file that contains your router's model selections for each query. You can use the helper script to generate a template prediction file:
120
137
121
138
```bash
122
-
uv run python ./router_inference/generate_prediction_file.py your-router sub_10
139
+
uv run python ./router_inference/generate_prediction_file.py your-router 10
140
+
```
141
+
142
+
This command generates a prediction file at `./router_inference/predictions/your-router.json` for the 10% split. Use `full` instead of `10` for the complete dataset.
143
+
144
+
**Important**: The generated file uses a **placeholder router** that simply cycles through models in the config file sequentially. You **must replace the model choices** in the `prediction` field with your router's actual selections. The script is only meant to provide a template structure with all required fields populated.
145
+
146
+
An example prediction file structure:
147
+
148
+
```json
149
+
[
150
+
{
151
+
"global index": "ArcMMLU_655",
152
+
"prompt": "Question text here...",
153
+
"prediction": "gpt-4o-mini", // Auto generated by the generate_prediction_file.py
154
+
"generated_result": null, // Will be filled after LLM inference
155
+
"cost": null, // Will be filled after evaluation
156
+
"accuracy": null// Will be filled after evaluation
157
+
}
158
+
]
123
159
```
124
160
125
-
Use `full` instead of `sub_10` for the complete dataset. **Important**: Replace the placeholder model choices in the `prediction` field with your router's actual selections.
161
+
Alternatively, you can create the prediction file manually or integrate it into your router's inference pipeline. The `generated_result`, `cost`, and `accuracy` fields can be left as `null` initially—they will be populated by the LLM inference and evaluation in later steps.
126
162
127
-
### Step 4: Validate Config and Prediction Files
163
+
### Step 8: Sanity Check for Config and Prediction Files
128
164
129
-
Validate your config and prediction files before proceeding:
165
+
Before proceeding with LLM inference, it's recommended to validate your config and prediction files using our validation script:
130
166
131
167
```bash
132
-
uv run python ./router_inference/check_config_prediction_files.py your-router sub_10
168
+
uv run python ./router_inference/check_config_prediction_files.py your-router 10
133
169
```
134
170
135
-
This script checks: (1) all model names are valid, (2) prediction file has correct size (809 for `sub_10`, 8400 for `full`), and (3) all entries have valid `global_index`, `prompt`, and `prediction` fields.
171
+
This script performs the following checks:
136
172
137
-
## Run LLM Inference
173
+
1.**Config Validation**: Verifies that all model names in your config file are valid and can be found in `ModelNameManager`
174
+
2.**Prediction File Size**: Ensures your prediction file has the correct number of entries (809 for 10% split, 8400 for full dataset)
175
+
3.**Field Validation**: Validates that each prediction entry:
176
+
- Has a `global_index` that exists in the dataset
177
+
- Has a `prompt` that exactly matches the dataset
178
+
- Has a `prediction` (model selection) that is one of the models listed in your config
138
179
139
-
Run the inference script to make API calls for each query using the selected models:
180
+
If all checks pass, you'll see `✓ ALL CHECKS PASSED!` and can proceed to the next step. If there are errors, the script will list them so you can fix any issues before running LLM inference and evaluation.
181
+
182
+
### Step 9: Run LLM Inference
183
+
184
+
Once your prediction file is ready, run the LLM inference script to make API calls for each query using the selected models:
140
185
141
186
```bash
142
187
uv run python ./llm_inference/run.py your-router
143
188
```
144
189
145
-
The script loads your prediction file, makes API calls using the models specified in the `prediction` field, and saves results incrementally. It uses cached results when available and saves progress after each query, so you can safely interrupt and resume. Results are saved to `./cached_results/` for reuse across routers.
190
+
This script will:
191
+
1.**Load your prediction file** from `./router_inference/predictions/your-router.json`
192
+
2.**Make API calls** for each query using the model specified in the `prediction` field
193
+
3.**Use cached results** when available (if the same model has already processed the same query)
194
+
4.**Save results incrementally** back to the prediction file, updating the `generated_result` field with:
195
+
-`generated_answer`: The model's response
196
+
-`success`: Whether the API call succeeded
197
+
-`token_usage`: Input/output token counts
198
+
-`provider`: The API provider used
199
+
-`error`: Any error message (if failed)
146
200
147
-
**Note**: Requires valid API keys (see Setup Step 3). The script skips entries that already have successful results.
201
+
The script automatically saves progress after each query, so you can safely interrupt and resume later. Results are also saved to `./cached_results/` for reuse across different routers.
148
202
149
-
## LLM Evaluation and Compute RouterArena Score
203
+
**Note**: This step requires valid API keys (see Step 3) for the models you're using. The script will skip entries that already have successful results, making it safe to re-run.
150
204
151
-
**Important**: For the `sub_10` split (testing), you can run evaluation locally and get RouterArena scores. For the `full` dataset (official leaderboard), ground truth answers are not available locally. After running LLM inference on the `full` dataset, submit your prediction file via GitHub issue or contact us at yifan.lu@rice.edu or jxing@rice.edu for official evaluation.
205
+
### Step 10: Run LLM Evaluation
152
206
153
-
For local evaluation on the `sub_10` split, run the evaluation script:
207
+
After LLM inference is complete, evaluate the generated answers to compute accuracy and cost metrics:
154
208
155
209
```bash
156
210
uv run python ./llm_evaluation/run.py your-router sub_10
157
211
```
158
212
159
-
The script evaluates generated answers against ground truth, calculates inference costs, and computes router-level metrics including the RouterArena score (ranging 0-1). It skips already-evaluated entries, making it safe to re-run or resume.
213
+
This script will:
214
+
1.**Load your prediction file** from `./router_inference/predictions/your-router.json`
215
+
2.**Determine the dataset** for each query based on its `global_index` (e.g., "AIME_112" → AIME dataset)
216
+
3.**Evaluate each generated answer** against the ground truth using dataset-specific metrics:
217
+
- Math problems (AIME, MATH, etc.) → `math_metric`
4.**Calculate inference cost** based on token usage and model pricing from `./model_cost/cost.json`
222
+
5.**Save results incrementally** to the prediction file, updating:
223
+
-`accuracy`: Evaluation score (0.0 to 1.0)
224
+
-`cost`: Inference cost in dollars
225
+
226
+
The script uses the `sub_10` split for testing (with ground truth answers available locally). For the full dataset evaluation, use `full` instead, but note that ground truth answers are not available locally—you'll need to submit your predictions via GitHub Issue for official evaluation.
227
+
228
+
The script automatically skips entries that are already evaluated, making it safe to re-run or resume after interruption.
229
+
230
+
### Step 11: Compute RouterArena Score
231
+
232
+
After evaluation is complete, compute your router's RouterArena score:
233
+
234
+
```bash
235
+
uv run python ./router_evaluation/compute_scores.py your-router
236
+
```
237
+
238
+
This script calculates:
239
+
1.**Average Accuracy**: The mean accuracy across all evaluated queries
240
+
2.**Total Cost**: The sum of all inference costs
241
+
3.**Average Cost per 1K Queries**: Total cost normalized to 1000 queries
242
+
4.**RouterArena Score**: A composite score that balances accuracy and cost. It ranges from 0 to 1, with higher scores indicating better trade-offs between accuracy and cost efficiency.
243
+
244
+
**Note**: Scores computed on the `sub_10` split are for testing purposes. To submit your router for the official leaderboard, you need to:
245
+
1. Generate predictions and run evaluation for the `full` dataset
246
+
2. Contact us at yifan.lu@rice.edu or jxing@rice.edu, or submit a GitHub issue with your results
247
+
248
+
The leaderboard rankings are based on RouterArena scores computed on the full dataset.
160
249
161
250
## Citation:
162
251
If you find our project helpful, please give us a star and cite us by:
0 commit comments