Skip to content

Commit c77814f

Browse files
authored
Revert "Merge research-pipeline: better router evaluation pipeline and readme.md "
1 parent 2166456 commit c77814f

15 files changed

Lines changed: 208 additions & 194 deletions

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ repos:
7070
hooks:
7171
- id: mypy-local
7272
name: Run mypy for local Python installation
73-
entry: tools/mypy.sh 1 "local"
73+
entry: tools/mypy.sh 0 "local"
7474
language: python
7575
types: [python]
7676
pass_filenames: false

README.md

Lines changed: 123 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -43,21 +43,19 @@ The current leaderboard is computed considering the accuracy and overall cost fo
4343

4444
If you want your router on the leaderboard, please contact us via email at yifan.lu@rice.edu or jxing@rice.edu, or submit a GitHub issue. For fairness, we have withheld the ground truth answers for the full dataset. However, you can still test your router using the sub-sampled 10% dataset by following the steps below.
4545

46-
## Setup
46+
## Usage
4747

48-
### Step 1: Install uv and RouterArena
48+
### Step 1: Install uv (if you don't have it)
4949

5050
```bash
5151
curl -LsSf https://astral.sh/uv/install.sh | sh
52-
cd RouterArena
53-
uv sync
5452
```
5553

56-
### Step 2: Download Dataset
57-
Run this command to download the dataset from the [HF dataset](https://huggingface.co/datasets/RouteWorks/RouterArena).
54+
### Step 2: Install RouterArena
5855

5956
```bash
60-
uv run python ./scripts/process_datasets/prep_datasets.py
57+
cd RouterArena
58+
uv sync
6159
```
6260

6361
### Step 3: Set Up API Keys
@@ -66,20 +64,24 @@ This step is **required only if you plan to use our pipeline to make LLM inferen
6664

6765
```bash
6866
# Example .env file
69-
OPENAI_API_KEY=<Your-Key>
70-
ANTHROPIC_API_KEY=<Your-Key>
71-
HF_TOKEN=<Your-Key>
72-
# ...
67+
OPENAI_API_KEY=your_openai_key_here
68+
ANTHROPIC_API_KEY=your_anthropic_key_here
69+
GOOGLE_API_KEY=your_google_key_here
70+
MISTRAL_API_KEY=your_mistral_key_here
71+
HF_TOKEN=your_huggingface_token_here
72+
# ... add other keys as needed
7373
```
7474

75-
#### Optional:
7675
See the `ModelInference` class in `RouterArena/llm_inference/model_inference.py` for the complete list of supported providers and required environment variables. You can extend that class to support additional models, or submit a GitHub issue to request support for new providers.
7776

78-
## Usage
77+
### Step 4: Download Dataset
78+
Run this command to download the dataset from the [HF dataset](https://huggingface.co/datasets/RouteWorks/RouterArena).
7979

80-
Follow the steps below to evaluate your router. You can start with the `sub_10` split (10% sub-sampled dataset) to test your setup and code. The `sub_10` split includes ground truth answers for local testing. Once ready, you can evaluate on the `full` dataset for official leaderboard submission.
80+
```bash
81+
uv run python ./scripts/process_datasets/prep_datasets.py
82+
```
8183

82-
### Step 1: Prepare Config File
84+
### Step 5: Prepare Config File and Model Costs
8385

8486
Create a config file in `./router_inference/config/<router_name>.json`. We have created an example router for demonstration purposes:
8587

@@ -99,64 +101,151 @@ Create a config file in `./router_inference/config/<router_name>.json`. We have
99101

100102
*Note: The model name must be the same as the one used in `./universal_model_names.py` (see next step for details)*
101103

102-
**Important**: For each model in your config, add an entry with the pricing per million tokens in this format:
104+
**Important**: You also need to add cost information for each model (same model naming requirement as above) in `./model_cost/cost.json`. For each model in your config, add an entry with the pricing per million tokens:
103105

104106
```json
105107
{
106108
"gpt-4o-mini": {
107109
"input_token_price_per_million": 0.15,
108110
"output_token_price_per_million": 0.6
109111
},
112+
"claude-3-haiku-20240307": {
113+
"input_token_price_per_million": 0.25,
114+
"output_token_price_per_million": 1.25
115+
},
116+
"gemini-2.0-flash-001": {
117+
"input_token_price_per_million": 0.1,
118+
"output_token_price_per_million": 0.4
119+
},
120+
"mistral-medium": {
121+
"input_token_price_per_million": 2.7,
122+
"output_token_price_per_million": 8.1
123+
}
110124
}
111125
```
112126

113-
### Step 2: Verify Model Names
127+
### Step 6:
128+
API providers may use different names for the same model (e.g., `gpt-4o`, `openai/gpt-4o`). We manage this via `./universal_model_names.py`:
129+
- `universal_names`: canonical model names used in this project
130+
- `mapping`: maps external provider names to our canonical names
114131

115-
Ensure all models in your config are listed in `./universal_model_names.py`. If you add a new model, you must also add the API inference endpoint in `RouterArena/llm_inference/model_inference.py`.
132+
Please make sure the model you used are listed here, or you have added it there (if you add a model, please make sure you add the API inference endpoint at `RouterArena/llm_inference/model_inference.py`).
116133

117-
### Step 3: Generate Router's Prediction File
134+
### Step 7: Generate Router's Prediction File
118135

119-
Generate a template prediction file:
136+
You need to create a prediction file that contains your router's model selections for each query. You can use the helper script to generate a template prediction file:
120137

121138
```bash
122-
uv run python ./router_inference/generate_prediction_file.py your-router sub_10
139+
uv run python ./router_inference/generate_prediction_file.py your-router 10
140+
```
141+
142+
This command generates a prediction file at `./router_inference/predictions/your-router.json` for the 10% split. Use `full` instead of `10` for the complete dataset.
143+
144+
**Important**: The generated file uses a **placeholder router** that simply cycles through models in the config file sequentially. You **must replace the model choices** in the `prediction` field with your router's actual selections. The script is only meant to provide a template structure with all required fields populated.
145+
146+
An example prediction file structure:
147+
148+
```json
149+
[
150+
{
151+
"global index": "ArcMMLU_655",
152+
"prompt": "Question text here...",
153+
"prediction": "gpt-4o-mini", // Auto generated by the generate_prediction_file.py
154+
"generated_result": null, // Will be filled after LLM inference
155+
"cost": null, // Will be filled after evaluation
156+
"accuracy": null // Will be filled after evaluation
157+
}
158+
]
123159
```
124160

125-
Use `full` instead of `sub_10` for the complete dataset. **Important**: Replace the placeholder model choices in the `prediction` field with your router's actual selections.
161+
Alternatively, you can create the prediction file manually or integrate it into your router's inference pipeline. The `generated_result`, `cost`, and `accuracy` fields can be left as `null` initially—they will be populated by the LLM inference and evaluation in later steps.
126162

127-
### Step 4: Validate Config and Prediction Files
163+
### Step 8: Sanity Check for Config and Prediction Files
128164

129-
Validate your config and prediction files before proceeding:
165+
Before proceeding with LLM inference, it's recommended to validate your config and prediction files using our validation script:
130166

131167
```bash
132-
uv run python ./router_inference/check_config_prediction_files.py your-router sub_10
168+
uv run python ./router_inference/check_config_prediction_files.py your-router 10
133169
```
134170

135-
This script checks: (1) all model names are valid, (2) prediction file has correct size (809 for `sub_10`, 8400 for `full`), and (3) all entries have valid `global_index`, `prompt`, and `prediction` fields.
171+
This script performs the following checks:
136172

137-
## Run LLM Inference
173+
1. **Config Validation**: Verifies that all model names in your config file are valid and can be found in `ModelNameManager`
174+
2. **Prediction File Size**: Ensures your prediction file has the correct number of entries (809 for 10% split, 8400 for full dataset)
175+
3. **Field Validation**: Validates that each prediction entry:
176+
- Has a `global_index` that exists in the dataset
177+
- Has a `prompt` that exactly matches the dataset
178+
- Has a `prediction` (model selection) that is one of the models listed in your config
138179

139-
Run the inference script to make API calls for each query using the selected models:
180+
If all checks pass, you'll see `✓ ALL CHECKS PASSED!` and can proceed to the next step. If there are errors, the script will list them so you can fix any issues before running LLM inference and evaluation.
181+
182+
### Step 9: Run LLM Inference
183+
184+
Once your prediction file is ready, run the LLM inference script to make API calls for each query using the selected models:
140185

141186
```bash
142187
uv run python ./llm_inference/run.py your-router
143188
```
144189

145-
The script loads your prediction file, makes API calls using the models specified in the `prediction` field, and saves results incrementally. It uses cached results when available and saves progress after each query, so you can safely interrupt and resume. Results are saved to `./cached_results/` for reuse across routers.
190+
This script will:
191+
1. **Load your prediction file** from `./router_inference/predictions/your-router.json`
192+
2. **Make API calls** for each query using the model specified in the `prediction` field
193+
3. **Use cached results** when available (if the same model has already processed the same query)
194+
4. **Save results incrementally** back to the prediction file, updating the `generated_result` field with:
195+
- `generated_answer`: The model's response
196+
- `success`: Whether the API call succeeded
197+
- `token_usage`: Input/output token counts
198+
- `provider`: The API provider used
199+
- `error`: Any error message (if failed)
146200

147-
**Note**: Requires valid API keys (see Setup Step 3). The script skips entries that already have successful results.
201+
The script automatically saves progress after each query, so you can safely interrupt and resume later. Results are also saved to `./cached_results/` for reuse across different routers.
148202

149-
## LLM Evaluation and Compute RouterArena Score
203+
**Note**: This step requires valid API keys (see Step 3) for the models you're using. The script will skip entries that already have successful results, making it safe to re-run.
150204

151-
**Important**: For the `sub_10` split (testing), you can run evaluation locally and get RouterArena scores. For the `full` dataset (official leaderboard), ground truth answers are not available locally. After running LLM inference on the `full` dataset, submit your prediction file via GitHub issue or contact us at yifan.lu@rice.edu or jxing@rice.edu for official evaluation.
205+
### Step 10: Run LLM Evaluation
152206

153-
For local evaluation on the `sub_10` split, run the evaluation script:
207+
After LLM inference is complete, evaluate the generated answers to compute accuracy and cost metrics:
154208

155209
```bash
156210
uv run python ./llm_evaluation/run.py your-router sub_10
157211
```
158212

159-
The script evaluates generated answers against ground truth, calculates inference costs, and computes router-level metrics including the RouterArena score (ranging 0-1). It skips already-evaluated entries, making it safe to re-run or resume.
213+
This script will:
214+
1. **Load your prediction file** from `./router_inference/predictions/your-router.json`
215+
2. **Determine the dataset** for each query based on its `global_index` (e.g., "AIME_112" → AIME dataset)
216+
3. **Evaluate each generated answer** against the ground truth using dataset-specific metrics:
217+
- Math problems (AIME, MATH, etc.) → `math_metric`
218+
- Multiple-choice questions (MMLUPro, ArcMMLU, etc.) → `mcq_accuracy`
219+
- Code problems (LiveCodeBench) → `code_accuracy`
220+
- And other specialized metrics as needed
221+
4. **Calculate inference cost** based on token usage and model pricing from `./model_cost/cost.json`
222+
5. **Save results incrementally** to the prediction file, updating:
223+
- `accuracy`: Evaluation score (0.0 to 1.0)
224+
- `cost`: Inference cost in dollars
225+
226+
The script uses the `sub_10` split for testing (with ground truth answers available locally). For the full dataset evaluation, use `full` instead, but note that ground truth answers are not available locally—you'll need to submit your predictions via GitHub Issue for official evaluation.
227+
228+
The script automatically skips entries that are already evaluated, making it safe to re-run or resume after interruption.
229+
230+
### Step 11: Compute RouterArena Score
231+
232+
After evaluation is complete, compute your router's RouterArena score:
233+
234+
```bash
235+
uv run python ./router_evaluation/compute_scores.py your-router
236+
```
237+
238+
This script calculates:
239+
1. **Average Accuracy**: The mean accuracy across all evaluated queries
240+
2. **Total Cost**: The sum of all inference costs
241+
3. **Average Cost per 1K Queries**: Total cost normalized to 1000 queries
242+
4. **RouterArena Score**: A composite score that balances accuracy and cost. It ranges from 0 to 1, with higher scores indicating better trade-offs between accuracy and cost efficiency.
243+
244+
**Note**: Scores computed on the `sub_10` split are for testing purposes. To submit your router for the official leaderboard, you need to:
245+
1. Generate predictions and run evaluation for the `full` dataset
246+
2. Contact us at yifan.lu@rice.edu or jxing@rice.edu, or submit a GitHub issue with your results
247+
248+
The leaderboard rankings are based on RouterArena scores computed on the full dataset.
160249

161250
## Citation:
162251
If you find our project helpful, please give us a star and cite us by:

llm_evaluation/eval_reasoning.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
superglue_exact_match,
1616
superglue_clozetest,
1717
)
18-
from datasets import load_from_disk # type: ignore[import-not-found,import-untyped]
18+
from datasets import load_from_disk
1919

2020
# Dataset to metric mapping
2121
dataset2metric = {

llm_evaluation/evaluate_models.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
import glob
1919
from typing import Dict, List, Any, Optional
2020
import sys
21-
from tqdm import tqdm # type: ignore[import-untyped]
21+
from tqdm import tqdm
2222

2323
# Add the current directory to Python path to import eval modules
2424
sys.path.append(os.path.dirname(os.path.abspath(__file__)))

llm_evaluation/livecodebench_util.py

Lines changed: 16 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -488,44 +488,20 @@ def reliability_guard(maximum_memory_bytes: Optional[int] = None):
488488

489489
import builtins
490490

491-
from typing import Any, cast
492-
493-
builtins.exit = cast(Any, None) # type: ignore[assignment]
494-
builtins.quit = cast(Any, None) # type: ignore[assignment]
491+
from typing import Any
495492

496493
# Prepare Any-typed aliases to avoid mypy assignment errors
494+
builtins_mod: Any = builtins
497495
os_mod: Any = os
496+
shutil_mod: Any = shutil
498497
subprocess_mod: Any = subprocess
498+
modules_any: Any = sys.modules
499499

500500
os.environ["OMP_NUM_THREADS"] = "1"
501501

502-
os.kill = cast(Any, None) # type: ignore[assignment]
503-
os.system = cast(Any, None) # type: ignore[assignment]
504-
os.putenv = cast(Any, None) # type: ignore[assignment]
505-
os.remove = cast(Any, None) # type: ignore[assignment]
506-
os.removedirs = cast(Any, None) # type: ignore[assignment]
507-
os.rmdir = cast(Any, None) # type: ignore[assignment]
508-
os.fchdir = cast(Any, None) # type: ignore[assignment]
509-
os.setuid = cast(Any, None) # type: ignore[assignment]
510-
os.fork = cast(Any, None) # type: ignore[assignment]
511-
os.forkpty = cast(Any, None) # type: ignore[assignment]
512-
os.killpg = cast(Any, None) # type: ignore[assignment]
513-
os.rename = cast(Any, None) # type: ignore[assignment]
514-
os.renames = cast(Any, None) # type: ignore[assignment]
515-
os.truncate = cast(Any, None) # type: ignore[assignment]
516-
os.replace = cast(Any, None) # type: ignore[assignment]
517-
os.unlink = cast(Any, None) # type: ignore[assignment]
518-
os.fchmod = cast(Any, None) # type: ignore[assignment]
519-
os.fchown = cast(Any, None) # type: ignore[assignment]
520-
os.chmod = cast(Any, None) # type: ignore[assignment]
521-
os.chown = cast(Any, None) # type: ignore[assignment]
522-
os.chroot = cast(Any, None) # type: ignore[assignment]
523-
os.fchdir = cast(Any, None) # type: ignore[assignment]
524-
os.lchflags = cast(Any, None) # type: ignore[attr-defined,assignment]
525-
os.lchmod = cast(Any, None) # type: ignore[attr-defined,assignment]
526-
os.lchown = cast(Any, None) # type: ignore[assignment]
527-
os.getcwd = cast(Any, None) # type: ignore[assignment]
528-
os.chdir = cast(Any, None) # type: ignore[assignment]
502+
# Disable selected builtins
503+
setattr(builtins_mod, "exit", None)
504+
setattr(builtins_mod, "quit", None)
529505

530506
# Disable destructive os functions (guard where platform-specific)
531507
for name in [
@@ -561,22 +537,19 @@ def reliability_guard(maximum_memory_bytes: Optional[int] = None):
561537
except Exception:
562538
pass
563539

564-
shutil.rmtree = cast(Any, None) # type: ignore[assignment]
565-
shutil.move = cast(Any, None) # type: ignore[assignment]
566-
shutil.chown = cast(Any, None) # type: ignore[assignment]
540+
# Disable dangerous shutil functions
541+
for name in ["rmtree", "move", "chown"]:
542+
try:
543+
setattr(shutil_mod, name, None)
544+
except Exception:
545+
pass
567546

568547
# Disable subprocess.Popen
569548
setattr(subprocess_mod, "Popen", None)
570549

571-
setattr(subprocess, "Popen", cast(Any, None)) # type: ignore[misc]
572-
573-
# __builtins__["help"] = None # this line is commented out as it results into error
574-
575-
sys.modules["ipdb"] = None # type: ignore[assignment]
576-
sys.modules["joblib"] = None # type: ignore[assignment]
577-
sys.modules["resource"] = None # type: ignore[assignment]
578-
sys.modules["psutil"] = None # type: ignore[assignment]
579-
sys.modules["tkinter"] = None # type: ignore[assignment]
550+
# Hide selected modules
551+
for name in ["ipdb", "joblib", "resource", "psutil", "tkinter"]:
552+
modules_any[name] = None
580553

581554

582555
def save_original_references():

llm_evaluation/metric_utils.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,15 @@
22
# SPDX-License-Identifier: Apache-2.0
33

44
import re
5-
import regex # type: ignore[import-untyped]
5+
import regex
66
from math import isclose
77

88
from typing import Any, Optional, List
99

10-
from latex2sympy2 import latex2sympy # type: ignore[import-not-found,import-untyped]
11-
from sympy import N, simplify # type: ignore[import-not-found,import-untyped]
12-
from sympy.parsing.latex import parse_latex # type: ignore[import-not-found,import-untyped]
13-
from sympy.parsing.sympy_parser import parse_expr # type: ignore[import-not-found,import-untyped]
10+
from latex2sympy2 import latex2sympy
11+
from sympy import N, simplify
12+
from sympy.parsing.latex import parse_latex
13+
from sympy.parsing.sympy_parser import parse_expr
1414

1515

1616
def choice_answer_clean(pred: str) -> str:

llm_evaluation/metrics.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,13 @@
66
import json
77
import copy
88

9-
import jieba # type: ignore[import-not-found,import-untyped]
10-
from fuzzywuzzy import fuzz # type: ignore[import-not-found,import-untyped]
9+
import jieba
10+
from fuzzywuzzy import fuzz
1111
import difflib
1212

1313
from collections import Counter
14-
from rouge import Rouge # type: ignore[import-not-found,import-untyped]
15-
import regex # type: ignore[import-untyped]
14+
from rouge import Rouge
15+
import regex
1616

1717
from metric_utils import (
1818
choice_answer_clean,

0 commit comments

Comments
 (0)