RouteWorks
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 1 addition & 1 deletion b/‎.pre-commit-config.yaml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md‎
Lines changed: 123 additions & 34 deletions b/‎README.md‎
Lines changed: 123 additions & 34 deletions
diff --git a/‎llm_evaluation/eval_reasoning.py‎
Lines changed: 1 addition & 1 deletion b/‎llm_evaluation/eval_reasoning.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎llm_evaluation/evaluate_models.py‎
Lines changed: 1 addition & 1 deletion b/‎llm_evaluation/evaluate_models.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎llm_evaluation/livecodebench_util.py‎
Lines changed: 16 additions & 43 deletions b/‎llm_evaluation/livecodebench_util.py‎
Lines changed: 16 additions & 43 deletions
diff --git a/‎llm_evaluation/metric_utils.py‎
Lines changed: 5 additions & 5 deletions b/‎llm_evaluation/metric_utils.py‎
Lines changed: 5 additions & 5 deletions
diff --git a/‎llm_evaluation/metrics.py‎
Lines changed: 4 additions & 4 deletions b/‎llm_evaluation/metrics.py‎
Lines changed: 4 additions & 4 deletions
@@ -70,7 +70,7 @@ repos:
   hooks:
   - id: mypy-local
     name: Run mypy for local Python installation
-    entry: tools/mypy.sh 1 "local"
+    entry: tools/mypy.sh 0 "local"
     language: python
     types: [python]
     pass_filenames: false
 
@@ -43,21 +43,19 @@ The current leaderboard is computed considering the accuracy and overall cost fo
 
 If you want your router on the leaderboard, please contact us via email at yifan.lu@rice.edu or jxing@rice.edu, or submit a GitHub issue. For fairness, we have withheld the ground truth answers for the full dataset. However, you can still test your router using the sub-sampled 10% dataset by following the steps below.
 
-## Setup
+## Usage
 
-### Step 1: Install uv and RouterArena
+### Step 1: Install uv (if you don't have it)
 
 ```bash
 curl -LsSf https://astral.sh/uv/install.sh | sh
-cd RouterArena
-uv sync
 ```
 
-### Step 2: Download Dataset
-Run this command to download the dataset from the [HF dataset](https://huggingface.co/datasets/RouteWorks/RouterArena).
+### Step 2: Install RouterArena
 
 ```bash
-uv run python ./scripts/process_datasets/prep_datasets.py
+cd RouterArena
+uv sync
 ```
 
 ### Step 3: Set Up API Keys
@@ -66,20 +64,24 @@ This step is **required only if you plan to use our pipeline to make LLM inferen
 
 ```bash
 # Example .env file
-OPENAI_API_KEY=<Your-Key>
-ANTHROPIC_API_KEY=<Your-Key>
-HF_TOKEN=<Your-Key>
-# ...
+OPENAI_API_KEY=your_openai_key_here
+ANTHROPIC_API_KEY=your_anthropic_key_here
+GOOGLE_API_KEY=your_google_key_here
+MISTRAL_API_KEY=your_mistral_key_here
+HF_TOKEN=your_huggingface_token_here
+# ... add other keys as needed
 ```
 
-#### Optional:
 See the `ModelInference` class in `RouterArena/llm_inference/model_inference.py` for the complete list of supported providers and required environment variables. You can extend that class to support additional models, or submit a GitHub issue to request support for new providers.
 
-## Usage
+### Step 4: Download Dataset
+Run this command to download the dataset from the [HF dataset](https://huggingface.co/datasets/RouteWorks/RouterArena).
 
-Follow the steps below to evaluate your router. You can start with the `sub_10` split (10% sub-sampled dataset) to test your setup and code. The `sub_10` split includes ground truth answers for local testing. Once ready, you can evaluate on the `full` dataset for official leaderboard submission.
+```bash
+uv run python ./scripts/process_datasets/prep_datasets.py
+```
 
-### Step 1: Prepare Config File
+### Step 5: Prepare Config File and Model Costs
 
 Create a config file in `./router_inference/config/<router_name>.json`. We have created an example router for demonstration purposes:
 
@@ -99,64 +101,151 @@ Create a config file in `./router_inference/config/<router_name>.json`. We have
 
 *Note: The model name must be the same as the one used in `./universal_model_names.py` (see next step for details)*
 
-**Important**: For each model in your config, add an entry with the pricing per million tokens in this format:
+**Important**: You also need to add cost information for each model (same model naming requirement as above) in `./model_cost/cost.json`. For each model in your config, add an entry with the pricing per million tokens:
 
 ```json
 {
   "gpt-4o-mini": {
     "input_token_price_per_million": 0.15,
     "output_token_price_per_million": 0.6
   },
+  "claude-3-haiku-20240307": {
+    "input_token_price_per_million": 0.25,
+    "output_token_price_per_million": 1.25
+  },
+  "gemini-2.0-flash-001": {
+    "input_token_price_per_million": 0.1,
+    "output_token_price_per_million": 0.4
+  },
+  "mistral-medium": {
+    "input_token_price_per_million": 2.7,
+    "output_token_price_per_million": 8.1
+  }
 }
 ```
 
-### Step 2: Verify Model Names
+### Step 6:
+API providers may use different names for the same model (e.g., `gpt-4o`, `openai/gpt-4o`). We manage this via `./universal_model_names.py`:
+- `universal_names`: canonical model names used in this project
+- `mapping`: maps external provider names to our canonical names
 
-Ensure all models in your config are listed in `./universal_model_names.py`. If you add a new model, you must also add the API inference endpoint in `RouterArena/llm_inference/model_inference.py`.
+Please make sure the model you used are listed here, or you have added it there (if you add a model, please make sure you add the API inference endpoint at `RouterArena/llm_inference/model_inference.py`).
 
-### Step 3: Generate Router's Prediction File
+### Step 7: Generate Router's Prediction File
 
-Generate a template prediction file:
+You need to create a prediction file that contains your router's model selections for each query. You can use the helper script to generate a template prediction file:
 
 ```bash
-uv run python ./router_inference/generate_prediction_file.py your-router sub_10
+uv run python ./router_inference/generate_prediction_file.py your-router 10
+```
+
+This command generates a prediction file at `./router_inference/predictions/your-router.json` for the 10% split. Use `full` instead of `10` for the complete dataset.
+
+**Important**: The generated file uses a **placeholder router** that simply cycles through models in the config file sequentially. You **must replace the model choices** in the `prediction` field with your router's actual selections. The script is only meant to provide a template structure with all required fields populated.
+
+An example prediction file structure:
+
+```json
+[
+  {
+    "global index": "ArcMMLU_655",
+    "prompt": "Question text here...",
+    "prediction": "gpt-4o-mini",  // Auto generated by the generate_prediction_file.py
+    "generated_result": null,     // Will be filled after LLM inference
+    "cost": null,                 // Will be filled after evaluation
+    "accuracy": null              // Will be filled after evaluation
+  }
+]
 ```
 
-Use `full` instead of `sub_10` for the complete dataset. **Important**: Replace the placeholder model choices in the `prediction` field with your router's actual selections.
+Alternatively, you can create the prediction file manually or integrate it into your router's inference pipeline. The `generated_result`, `cost`, and `accuracy` fields can be left as `null` initially—they will be populated by the LLM inference and evaluation in later steps.
 
-### Step 4: Validate Config and Prediction Files
+### Step 8: Sanity Check for Config and Prediction Files
 
-Validate your config and prediction files before proceeding:
+Before proceeding with LLM inference, it's recommended to validate your config and prediction files using our validation script:
 
 ```bash
-uv run python ./router_inference/check_config_prediction_files.py your-router sub_10
+uv run python ./router_inference/check_config_prediction_files.py your-router 10
 ```
 
-This script checks: (1) all model names are valid, (2) prediction file has correct size (809 for `sub_10`, 8400 for `full`), and (3) all entries have valid `global_index`, `prompt`, and `prediction` fields.
+This script performs the following checks:
 
-## Run LLM Inference
+1. **Config Validation**: Verifies that all model names in your config file are valid and can be found in `ModelNameManager`
+2. **Prediction File Size**: Ensures your prediction file has the correct number of entries (809 for 10% split, 8400 for full dataset)
+3. **Field Validation**: Validates that each prediction entry:
+   - Has a `global_index` that exists in the dataset
+   - Has a `prompt` that exactly matches the dataset
+   - Has a `prediction` (model selection) that is one of the models listed in your config
 
-Run the inference script to make API calls for each query using the selected models:
+If all checks pass, you'll see `✓ ALL CHECKS PASSED!` and can proceed to the next step. If there are errors, the script will list them so you can fix any issues before running LLM inference and evaluation.
+
+### Step 9: Run LLM Inference
+
+Once your prediction file is ready, run the LLM inference script to make API calls for each query using the selected models:
 
 ```bash
 uv run python ./llm_inference/run.py your-router
 ```
 
-The script loads your prediction file, makes API calls using the models specified in the `prediction` field, and saves results incrementally. It uses cached results when available and saves progress after each query, so you can safely interrupt and resume. Results are saved to `./cached_results/` for reuse across routers.
+This script will:
+1. **Load your prediction file** from `./router_inference/predictions/your-router.json`
+2. **Make API calls** for each query using the model specified in the `prediction` field
+3. **Use cached results** when available (if the same model has already processed the same query)
+4. **Save results incrementally** back to the prediction file, updating the `generated_result` field with:
+   - `generated_answer`: The model's response
+   - `success`: Whether the API call succeeded
+   - `token_usage`: Input/output token counts
+   - `provider`: The API provider used
+   - `error`: Any error message (if failed)
 
-**Note**: Requires valid API keys (see Setup Step 3). The script skips entries that already have successful results.
+The script automatically saves progress after each query, so you can safely interrupt and resume later. Results are also saved to `./cached_results/` for reuse across different routers.
 
-## LLM Evaluation and Compute RouterArena Score
+**Note**: This step requires valid API keys (see Step 3) for the models you're using. The script will skip entries that already have successful results, making it safe to re-run.
 
-**Important**: For the `sub_10` split (testing), you can run evaluation locally and get RouterArena scores. For the `full` dataset (official leaderboard), ground truth answers are not available locally. After running LLM inference on the `full` dataset, submit your prediction file via GitHub issue or contact us at yifan.lu@rice.edu or jxing@rice.edu for official evaluation.
+### Step 10: Run LLM Evaluation
 
-For local evaluation on the `sub_10` split, run the evaluation script:
+After LLM inference is complete, evaluate the generated answers to compute accuracy and cost metrics:
 
 ```bash
 uv run python ./llm_evaluation/run.py your-router sub_10
 ```
 
-The script evaluates generated answers against ground truth, calculates inference costs, and computes router-level metrics including the RouterArena score (ranging 0-1). It skips already-evaluated entries, making it safe to re-run or resume.
+This script will:
+1. **Load your prediction file** from `./router_inference/predictions/your-router.json`
+2. **Determine the dataset** for each query based on its `global_index` (e.g., "AIME_112" → AIME dataset)
+3. **Evaluate each generated answer** against the ground truth using dataset-specific metrics:
+   - Math problems (AIME, MATH, etc.) → `math_metric`
+   - Multiple-choice questions (MMLUPro, ArcMMLU, etc.) → `mcq_accuracy`
+   - Code problems (LiveCodeBench) → `code_accuracy`
+   - And other specialized metrics as needed
+4. **Calculate inference cost** based on token usage and model pricing from `./model_cost/cost.json`
+5. **Save results incrementally** to the prediction file, updating:
+   - `accuracy`: Evaluation score (0.0 to 1.0)
+   - `cost`: Inference cost in dollars
+
+The script uses the `sub_10` split for testing (with ground truth answers available locally). For the full dataset evaluation, use `full` instead, but note that ground truth answers are not available locally—you'll need to submit your predictions via GitHub Issue for official evaluation.
+
+The script automatically skips entries that are already evaluated, making it safe to re-run or resume after interruption.
+
+### Step 11: Compute RouterArena Score
+
+After evaluation is complete, compute your router's RouterArena score:
+
+```bash
+uv run python ./router_evaluation/compute_scores.py your-router
+```
+
+This script calculates:
+1. **Average Accuracy**: The mean accuracy across all evaluated queries
+2. **Total Cost**: The sum of all inference costs
+3. **Average Cost per 1K Queries**: Total cost normalized to 1000 queries
+4. **RouterArena Score**: A composite score that balances accuracy and cost. It ranges from 0 to 1, with higher scores indicating better trade-offs between accuracy and cost efficiency.
+
+**Note**: Scores computed on the `sub_10` split are for testing purposes. To submit your router for the official leaderboard, you need to:
+1. Generate predictions and run evaluation for the `full` dataset
+2. Contact us at yifan.lu@rice.edu or jxing@rice.edu, or submit a GitHub issue with your results
+
+The leaderboard rankings are based on RouterArena scores computed on the full dataset.
 
 ## Citation:
 If you find our project helpful, please give us a star and cite us by:
 
@@ -15,7 +15,7 @@
     superglue_exact_match,
     superglue_clozetest,
 )
-from datasets import load_from_disk  # type: ignore[import-not-found,import-untyped]
+from datasets import load_from_disk
 
 # Dataset to metric mapping
 dataset2metric = {
 
@@ -18,7 +18,7 @@
 import glob
 from typing import Dict, List, Any, Optional
 import sys
-from tqdm import tqdm  # type: ignore[import-untyped]
+from tqdm import tqdm
 
 # Add the current directory to Python path to import eval modules
 sys.path.append(os.path.dirname(os.path.abspath(__file__)))
 
@@ -488,44 +488,20 @@ def reliability_guard(maximum_memory_bytes: Optional[int] = None):
 
     import builtins
 
-    from typing import Any, cast
-
-    builtins.exit = cast(Any, None)  # type: ignore[assignment]
-    builtins.quit = cast(Any, None)  # type: ignore[assignment]
+    from typing import Any
 
     # Prepare Any-typed aliases to avoid mypy assignment errors
+    builtins_mod: Any = builtins
     os_mod: Any = os
+    shutil_mod: Any = shutil
     subprocess_mod: Any = subprocess
+    modules_any: Any = sys.modules
 
     os.environ["OMP_NUM_THREADS"] = "1"
 
-    os.kill = cast(Any, None)  # type: ignore[assignment]
-    os.system = cast(Any, None)  # type: ignore[assignment]
-    os.putenv = cast(Any, None)  # type: ignore[assignment]
-    os.remove = cast(Any, None)  # type: ignore[assignment]
-    os.removedirs = cast(Any, None)  # type: ignore[assignment]
-    os.rmdir = cast(Any, None)  # type: ignore[assignment]
-    os.fchdir = cast(Any, None)  # type: ignore[assignment]
-    os.setuid = cast(Any, None)  # type: ignore[assignment]
-    os.fork = cast(Any, None)  # type: ignore[assignment]
-    os.forkpty = cast(Any, None)  # type: ignore[assignment]
-    os.killpg = cast(Any, None)  # type: ignore[assignment]
-    os.rename = cast(Any, None)  # type: ignore[assignment]
-    os.renames = cast(Any, None)  # type: ignore[assignment]
-    os.truncate = cast(Any, None)  # type: ignore[assignment]
-    os.replace = cast(Any, None)  # type: ignore[assignment]
-    os.unlink = cast(Any, None)  # type: ignore[assignment]
-    os.fchmod = cast(Any, None)  # type: ignore[assignment]
-    os.fchown = cast(Any, None)  # type: ignore[assignment]
-    os.chmod = cast(Any, None)  # type: ignore[assignment]
-    os.chown = cast(Any, None)  # type: ignore[assignment]
-    os.chroot = cast(Any, None)  # type: ignore[assignment]
-    os.fchdir = cast(Any, None)  # type: ignore[assignment]
-    os.lchflags = cast(Any, None)  # type: ignore[attr-defined,assignment]
-    os.lchmod = cast(Any, None)  # type: ignore[attr-defined,assignment]
-    os.lchown = cast(Any, None)  # type: ignore[assignment]
-    os.getcwd = cast(Any, None)  # type: ignore[assignment]
-    os.chdir = cast(Any, None)  # type: ignore[assignment]
+    # Disable selected builtins
+    setattr(builtins_mod, "exit", None)
+    setattr(builtins_mod, "quit", None)
 
     # Disable destructive os functions (guard where platform-specific)
     for name in [
@@ -561,22 +537,19 @@ def reliability_guard(maximum_memory_bytes: Optional[int] = None):
         except Exception:
             pass
 
-    shutil.rmtree = cast(Any, None)  # type: ignore[assignment]
-    shutil.move = cast(Any, None)  # type: ignore[assignment]
-    shutil.chown = cast(Any, None)  # type: ignore[assignment]
+    # Disable dangerous shutil functions
+    for name in ["rmtree", "move", "chown"]:
+        try:
+            setattr(shutil_mod, name, None)
+        except Exception:
+            pass
 
     # Disable subprocess.Popen
     setattr(subprocess_mod, "Popen", None)
 
-    setattr(subprocess, "Popen", cast(Any, None))  # type: ignore[misc]
-
-    # __builtins__["help"] = None   # this line is commented out as it results into error
-
-    sys.modules["ipdb"] = None  # type: ignore[assignment]
-    sys.modules["joblib"] = None  # type: ignore[assignment]
-    sys.modules["resource"] = None  # type: ignore[assignment]
-    sys.modules["psutil"] = None  # type: ignore[assignment]
-    sys.modules["tkinter"] = None  # type: ignore[assignment]
+    # Hide selected modules
+    for name in ["ipdb", "joblib", "resource", "psutil", "tkinter"]:
+        modules_any[name] = None
 
 
 def save_original_references():
 
@@ -2,15 +2,15 @@
 # SPDX-License-Identifier: Apache-2.0
 
 import re
-import regex  # type: ignore[import-untyped]
+import regex
 from math import isclose
 
 from typing import Any, Optional, List
 
-from latex2sympy2 import latex2sympy  # type: ignore[import-not-found,import-untyped]
-from sympy import N, simplify  # type: ignore[import-not-found,import-untyped]
-from sympy.parsing.latex import parse_latex  # type: ignore[import-not-found,import-untyped]
-from sympy.parsing.sympy_parser import parse_expr  # type: ignore[import-not-found,import-untyped]
+from latex2sympy2 import latex2sympy
+from sympy import N, simplify
+from sympy.parsing.latex import parse_latex
+from sympy.parsing.sympy_parser import parse_expr
 
 
 def choice_answer_clean(pred: str) -> str:
 
@@ -6,13 +6,13 @@
 import json
 import copy
 
-import jieba  # type: ignore[import-not-found,import-untyped]
-from fuzzywuzzy import fuzz  # type: ignore[import-not-found,import-untyped]
+import jieba
+from fuzzywuzzy import fuzz
 import difflib
 
 from collections import Counter
-from rouge import Rouge  # type: ignore[import-not-found,import-untyped]
-import regex  # type: ignore[import-untyped]
+from rouge import Rouge
+import regex
 
 from metric_utils import (
     choice_answer_clean,
Original file line number	Diff line number	Diff line change
`@@ -15,7 +15,7 @@`
`15`	`15`	`superglue_exact_match,`
`16`	`16`	`superglue_clozetest,`
`17`	`17`	`)`
`18`		`-from datasets import load_from_disk # type: ignore[import-not-found,import-untyped]`
	`18`	`+from datasets import load_from_disk`
`19`	`19`
`20`	`20`	`# Dataset to metric mapping`
`21`	`21`	`dataset2metric = {`