Skip to content

Commit c54eb68

Browse files
committed
Merge remote-tracking branch 'origin/main' into feature/integrate-gpqa-dataset
2 parents 1537d80 + 951eddf commit c54eb68

19 files changed

Lines changed: 195003 additions & 563 deletions

README.md

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -34,11 +34,11 @@ For more details, please see our [website](https://routeworks.github.io/leaderbo
3434

3535
| Rank | Router | Affiliation | Acc-Cost Arena | Accuracy | Cost/1K Queries | Optimal Selection | Optimal Cost | Optimal Accuracy | Latency | Robustness |
3636
|------|--------------------|-----------------------------|--------|----------|---------|-----------------|--------------|----------------|---------|------------|
37-
| 🥇 | [MIRT‑BERT](https://arxiv.org/pdf/2506.01048) [[Code]](https://github.com/Mercidaiha/IRT-Router) | 🎓 USTC | 66.89 | 66.88 | $0.15 | 3.44 | 19.62 | 78.18 | 27.03 | 61.19 |
38-
| 🥈 | [Azure‑Router](https://ai.azure.com/catalog/models/model-router) [[Web]](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/concepts/model-router) | 💼 Microsoft | 66.66 | 68.09 | $0.54 | 22.52 | 46.32 | 81.96 | | 54.07 |
39-
| 🥉 | [NIRT‑BERT](https://arxiv.org/pdf/2506.01048) [[Code]](https://github.com/Mercidaiha/IRT-Router) | 🎓 USTC | 66.12 | 66.34 | $0.21 | 3.83 | 14.04 | 77.88 | 10.42 | 49.29 |
40-
| 4 | [GPT‑5](https://openai.com/index/introducing-gpt-5/)| 💼 OpenAI | 64.32 | 73.96 | $10.02 | | | | | |
41-
| 5 | [vLLM‑SR](https://vllm-semantic-router.com/) [[Code]](https://github.com/vllm-project/semantic-router) [[HF]](https://huggingface.co/llm-semantic-router) | 🎓 vLLM SR Team | 64.32 | 67.28 | $1.67 | 4.79 | 12.54 | 79.33 | 0.19 | 35.00 |
37+
| 🥇 | [vLLM‑SR](https://vllm-semantic-router.com/) [[Code]](https://github.com/vllm-project/semantic-router) [[HF]](https://huggingface.co/llm-semantic-router) | 🎓 vLLM SR Team | 67.23 | 66.53 | $0.06 | 94.10 | 90.12 | 100.00 | | 90.95 |
38+
| 🥈 | [MIRT‑BERT](https://arxiv.org/pdf/2506.01048) [[Code]](https://github.com/Mercidaiha/IRT-Router) | 🎓 USTC | 66.89 | 66.88 | $0.15 | 3.44 | 19.62 | 78.18 | 27.03 | 61.19 |
39+
| 🥉 | [Azure‑Router](https://ai.azure.com/catalog/models/model-router) [[Web]](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/concepts/model-router) | 💼 Microsoft | 66.66 | 68.09 | $0.54 | 22.52 | 46.32 | 81.96 | | 54.07 |
40+
| 4 | [NIRT‑BERT](https://arxiv.org/pdf/2506.01048) [[Code]](https://github.com/Mercidaiha/IRT-Router) | 🎓 USTC | 66.12 | 66.34 | $0.21 | 3.83 | 14.04 | 77.88 | 10.42 | 49.29 |
41+
| 5 | [GPT‑5](https://openai.com/index/introducing-gpt-5/)| 💼 OpenAI | 64.32 | 73.96 | $10.02 | | | | | |
4242
| 6 | [CARROT](https://arxiv.org/abs/2502.03261) [[Code]](https://github.com/somerstep/CARROT) [[HF]](https://huggingface.co/CARROT-LLM-Routing) | 🎓 UMich | 63.87 | 67.21 | $2.06 | 2.68 | 6.77 | 78.63 | 1.50 | 89.05 |
4343
| 7 | [Chayan](https://huggingface.co/adaptive-classifier/chayan) [[HF]](https://huggingface.co/adaptive-classifier/chayan) | 🎓 Adaptive Classifier | 63.83 | 64.89 | $0.56 | 43.03 | 43.75 | 88.74 |||
4444
| 8 | [RouterBench‑MLP](https://arxiv.org/pdf/2403.12031) [[Code]](https://github.com/withmartian/routerbench) [[HF]](https://huggingface.co/datasets/withmartian/routerbench) | 🎓 Martian | 57.56 | 61.62 | $4.83 | 13.39 | 24.45 | 83.32 | 90.91 | 80.00 |
@@ -102,6 +102,7 @@ Create a config file in `./router_inference/config/<router_name>.json`. An examp
102102
{
103103
"pipeline_params": {
104104
"router_name": "your-router",
105+
"router_cls_name": "your_router_class_name",
105106
"models": [
106107
"gpt-4o-mini",
107108
"claude-3-haiku-20240307",
@@ -111,7 +112,7 @@ Create a config file in `./router_inference/config/<router_name>.json`. An examp
111112
}
112113
```
113114

114-
For each model in your config, add an entry with the pricing per million tokens in this format at [`model_cost/cost.json`](./model_cost/cost.json):
115+
For each model in your config, add an entry with the pricing per million tokens in this format at [`model_cost/model_cost.json`](./model_cost/model_cost.json):
115116

116117
```json
117118
{
@@ -129,12 +130,13 @@ For each model in your config, add an entry with the pricing per million tokens
129130

130131
Create your own router class by inheriting from `BaseRouter` and implementing the `_get_prediction()` method. See [`router_inference/router/example_router.py`](./router_inference/router/example_router.py) for a complete example.
131132

132-
Then, modify [`router_inference/generate_prediction_file.py`](./router_inference/generate_prediction_file.py#L150) to use your router class:
133+
Then, modify [`router_inference/router/__init__.py`](./router_inference/router/__init__.py) to include your router class:
133134

134135
```python
135-
# Replace ExampleRouter with your router class
136+
# Import your router class
136137
from router_inference.router.my_router import MyRouter
137-
router = MyRouter(args.router_name)
138+
139+
__all__ = ["BaseRouter", "ExampleRouter", "MyRouter"]
138140
```
139141

140142
Finally, generate the prediction file:

config/eval_config/zero-shot/GPQA.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,4 +15,4 @@
1515
"output_config": "output_config.json"
1616
}
1717
}
18-
}
18+
}

llm_evaluation/batch_evaluate.py

Lines changed: 133 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -4,18 +4,23 @@
44
"""
55
Batch Model Evaluation Script
66
7-
This script runs evaluation for multiple models in parallel using the universal model names
8-
from universal_model_names.py. It can process up to 16 models concurrently for efficiency.
7+
This script runs evaluation for multiple models sequentially.
8+
It processes models one by one, but each model evaluation uses
9+
query-level parallelism for efficiency.
10+
11+
The script evaluates all models that:
12+
1. Have a cached results file (.jsonl) in the cached_results directory
13+
2. Are listed in model_cost.json (for cost calculation)
914
1015
Usage:
11-
python batch_evaluate.py [--cached-results-dir CACHED_RESULTS_DIR] [--max-workers MAX_WORKERS] [--models MODEL1 MODEL2 ...]
16+
python batch_evaluate.py [--cached-results-dir CACHED_RESULTS_DIR] [--num-workers NUM_WORKERS] [--model-cost-path PATH]
1217
"""
1318

1419
import os
1520
import sys
1621
import argparse
1722
import subprocess
18-
import concurrent.futures
23+
import json
1924
from typing import List, Optional
2025
import time
2126

@@ -28,9 +33,48 @@
2833
print(f"Loaded {len(universal_names)} universal model names")
2934
except ImportError:
3035
print(
31-
"Error: Could not import universal_model_names. Make sure the file exists in the parent directory."
36+
"Warning: Could not import universal_model_names. Model name validation may be limited."
3237
)
33-
sys.exit(1)
38+
universal_names = []
39+
40+
41+
def load_models_from_cost_config(
42+
cost_config_path: Optional[str], project_root: str
43+
) -> List[str]:
44+
"""
45+
Load list of model names from model_cost.json.
46+
47+
Args:
48+
cost_config_path: Path to model_cost.json (can be relative or absolute).
49+
If None or empty, constructs path as project_root/model_cost/model_cost.json
50+
project_root: Path to project root directory
51+
52+
Returns:
53+
List of model names (keys from model_cost.json)
54+
"""
55+
# If no path provided, use default location in project root
56+
if not cost_config_path:
57+
cost_config_path = os.path.join(project_root, "model_cost", "model_cost.json")
58+
elif not os.path.isabs(cost_config_path):
59+
# If relative path, make it relative to project root
60+
cost_config_path = os.path.join(project_root, cost_config_path)
61+
62+
if not os.path.exists(cost_config_path):
63+
print(f"Error: model_cost.json not found at: {cost_config_path}")
64+
return []
65+
66+
try:
67+
with open(cost_config_path, "r", encoding="utf-8") as f:
68+
cost_config = json.load(f)
69+
70+
models = list(cost_config.keys())
71+
print(f"Loaded {len(models)} models from {cost_config_path}")
72+
return models
73+
except (json.JSONDecodeError, IOError) as e:
74+
print(
75+
f"Error: Could not load or parse cost configuration from {cost_config_path}: {e}"
76+
)
77+
return []
3478

3579

3680
def check_cached_results_exist(cached_results_dir: str, model_name: str) -> bool:
@@ -69,7 +113,7 @@ def get_available_models(
69113

70114

71115
def run_evaluation(
72-
model_name: str, cached_results_dir: str, rerun: bool = False
116+
model_name: str, cached_results_dir: str, num_workers: int = 16, rerun: bool = False
73117
) -> dict:
74118
"""Run evaluation for a single model."""
75119
start_time = time.time()
@@ -86,6 +130,8 @@ def run_evaluation(
86130
model_name,
87131
"--cached-results-dir",
88132
cached_results_dir,
133+
"--num-workers",
134+
str(num_workers),
89135
]
90136

91137
# Add rerun flag if specified
@@ -94,35 +140,73 @@ def run_evaluation(
94140

95141
print(f"🔄 Starting evaluation for {model_name}")
96142

97-
result = subprocess.run(
143+
# Run with real-time output streaming
144+
# Use Popen to stream output while still being able to capture it for error reporting
145+
process = subprocess.Popen(
98146
cmd,
99147
stdout=subprocess.PIPE,
100-
stderr=subprocess.PIPE,
148+
stderr=subprocess.STDOUT, # Merge stderr into stdout
101149
universal_newlines=True,
102-
timeout=43200, # 12 hours timeout per model
150+
bufsize=1, # Line buffered
103151
)
104152

153+
# Stream output in real time and collect it
154+
stdout_lines: list[str] = []
155+
try:
156+
import sys as sys_module
157+
158+
# Read lines and print in real time
159+
stdout_stream = process.stdout
160+
if stdout_stream is None:
161+
raise RuntimeError("stdout is None despite PIPE")
162+
while True:
163+
line = stdout_stream.readline()
164+
if not line:
165+
if process.poll() is not None:
166+
break # Process finished
167+
continue
168+
sys_module.stdout.write(line)
169+
sys_module.stdout.flush()
170+
stdout_lines.append(line)
171+
172+
# Process will finish when readline returns None
173+
# Wait to ensure it's fully terminated (12-hour timeout)
174+
process.wait(timeout=43200)
175+
176+
except (KeyboardInterrupt, Exception) as e:
177+
if process.poll() is None:
178+
process.kill()
179+
process.wait()
180+
if isinstance(e, KeyboardInterrupt):
181+
raise
182+
# Continue to handle as error below
183+
184+
stdout_text = "".join(stdout_lines)
105185
duration = time.time() - start_time
106186

107-
if result.returncode == 0:
187+
if process.returncode == 0:
108188
print(f"✅ Completed {model_name} in {duration:.1f}s")
109189
return {
110190
"model_name": model_name,
111191
"status": "success",
112192
"duration": duration,
113-
"stdout": result.stdout,
114-
"stderr": result.stderr,
193+
"stdout": stdout_text,
194+
"stderr": "", # Merged into stdout
115195
}
116196
else:
117197
print(f"❌ Failed {model_name} after {duration:.1f}s")
118-
print(f" Error: {result.stderr.strip()}")
198+
# Extract error from last few lines if available
199+
error_msg = (
200+
"\n".join(stdout_lines[-10:]) if stdout_lines else "Unknown error"
201+
)
202+
print(f" Error: {error_msg.strip()}")
119203
return {
120204
"model_name": model_name,
121205
"status": "failed",
122206
"duration": duration,
123-
"stdout": result.stdout,
124-
"stderr": result.stderr,
125-
"return_code": result.returncode,
207+
"stdout": stdout_text,
208+
"stderr": "", # Merged into stdout
209+
"return_code": process.returncode,
126210
}
127211

128212
except subprocess.TimeoutExpired:
@@ -156,27 +240,50 @@ def main():
156240
help="Directory containing cached results (default: ../cached_results/)",
157241
)
158242
parser.add_argument(
159-
"--max-workers",
243+
"--num-workers",
160244
type=int,
161245
default=16,
162-
help="Maximum number of parallel evaluations (default: 16)",
246+
help="Number of parallel workers for query-level evaluation (default: 16)",
163247
)
164248
parser.add_argument(
165249
"--rerun",
166250
action="store_true",
167251
help="Force re-evaluation of all entries, even if already evaluated",
168252
)
253+
parser.add_argument(
254+
"--model-cost-path",
255+
type=str,
256+
default=None,
257+
help="Path to model_cost.json file (default: {project_root}/model_cost/model_cost.json). Can be absolute or relative to project root.",
258+
)
169259

170260
args = parser.parse_args()
171261

262+
# Handle deprecated --max-workers if it was provided and not default
263+
# If the user used --max-workers but not --num-workers, use that value for num-workers
264+
# We still process models sequentially as requested.
265+
num_workers = args.num_workers
266+
267+
# Get project root directory (parent of llm_evaluation/)
268+
script_dir = os.path.dirname(os.path.abspath(__file__))
269+
project_root = os.path.dirname(script_dir)
270+
172271
# Validate cached results directory
173272
if not os.path.exists(args.cached_results_dir):
174273
print(
175274
f"Error: Cached results directory does not exist: {args.cached_results_dir}"
176275
)
177276
return 1
178277

179-
model_list = universal_names
278+
# Load models from model_cost.json instead of universal_names
279+
# Default path is project_root/model_cost/model_cost.json
280+
model_list = load_models_from_cost_config(args.model_cost_path, project_root)
281+
282+
if not model_list:
283+
print(
284+
"Error: No models found in model_cost.json. Cannot proceed with evaluation."
285+
)
286+
return 1
180287

181288
# Get available models (those with cached results)
182289
available_models = get_available_models(args.cached_results_dir, model_list)
@@ -186,29 +293,18 @@ def main():
186293
return 1
187294

188295
print(f"\n🚀 Starting batch evaluation of {len(available_models)} models")
189-
print(f"📊 Using {min(args.max_workers, len(available_models))} parallel workers")
296+
print(f"📊 Using {num_workers} parallel workers per model (sequential models)")
190297
print(f"📁 Cached results directory: {args.cached_results_dir}")
191298
print("=" * 80)
192299

193-
# Run evaluations in parallel
300+
# Run evaluations sequentially (one model at a time)
301+
# Each model evaluation will internally use query-level parallelism
194302
start_time = time.time()
195303
results = []
196304

197-
with concurrent.futures.ThreadPoolExecutor(
198-
max_workers=args.max_workers
199-
) as executor:
200-
# Submit all tasks
201-
future_to_model = {
202-
executor.submit(
203-
run_evaluation, model, args.cached_results_dir, args.rerun
204-
): model
205-
for model in available_models
206-
}
207-
208-
# Collect results as they complete
209-
for future in concurrent.futures.as_completed(future_to_model):
210-
result = future.result()
211-
results.append(result)
305+
for model in available_models:
306+
result = run_evaluation(model, args.cached_results_dir, num_workers, args.rerun)
307+
results.append(result)
212308

213309
# Print final summary
214310
total_duration = time.time() - start_time

0 commit comments

Comments
 (0)