RouteWorks · yl231 · Jan 16, 2026 · Jan 16, 2026 · Jan 16, 2026 · Jan 16, 2026
diff --git a/llm_inference/README_PARALLEL.md b/llm_inference/README_PARALLEL.md
@@ -0,0 +1,140 @@
+# Parallel Inference Quick Start Guide
+
+## Overview
+
+This directory now supports parallel inference for processing models efficiently. The system processes models sequentially but uses multiple workers to parallelize queries within each model.
+
+## Quick Start
+
+### 1. Process All Models (Recommended)
+
+Process all 26 models from `model_cost/model_cost.json` with 16 workers per model:
+
+```bash
+cd /home/yl231/routers/RouterArena
+uv run python llm_inference/batch_inference.py --num-workers 16
+```
+
+**What this does:**
+- Loads all model names from `model_cost/model_cost.json`
+- Processes each model sequentially
+- Uses 16 workers per model to process queries in parallel
+- Skips already processed queries automatically
+- Saves results to `./cached_results/{model_name}.jsonl`
+
+### 2. Process Specific Models
+
+Process only certain models:
+
+```bash
+uv run python llm_inference/batch_inference.py \
+    --models gemini-2.0-flash-001 gpt-5-mini claude-sonnet-4-5 \
+    --num-workers 16
+```
+
+### 3. Single Model Inference
+
+Process a single model with parallel workers:
+
+```bash
+# With 16 workers (parallel)
+uv run python llm_inference/main.py \
+    --model_name gemini-2.0-flash-001 \
+    --num-workers 16
+
+# Sequential (backward compatible)
+uv run python llm_inference/main.py \
+    --model_name gemini-2.0-flash-001
+```
+
+## Configuration Options
+
+### batch_inference.py
+
+| Option | Default | Description |
+|--------|---------|-------------|
+| `--num-workers` | 16 | Number of parallel workers per model |
+| `--models` | All | Specific models to process (space-separated) |
+| `--cache-dir` | `./cached_results` | Cache directory path |
+| `--model-cost-path` | `./model_cost/model_cost.json` | Path to model cost file |
+| `--input-file` | `./llm_inference/datasets/router_data.json` | Input data file |
+
+### main.py
+
+| Option | Default | Description |
+|--------|---------|-------------|
+| `--model_name` | Required | Model name to process |
+| `--num-workers` | 1 | Number of parallel workers (1 = sequential) |
+| `--run-full` | False | Process full dataset |
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────┐
+│  batch_inference.py                                 │
+│  ┌───────────────────────────────────────────────┐ │
+│  │  Load models from model_cost.json (26 models)│ │
+│  │  Load dataset (8400 queries)                  │ │
+│  └───────────────────────────────────────────────┘ │
+│                                                     │
+│  FOR EACH MODEL (Sequential):                      │
+│  ┌───────────────────────────────────────────────┐ │
+│  │  Model 1: gemini-3-pro-preview                │ │
+│  │  ┌─────────────────────────────────────────┐ │ │
+│  │  │  Check cache → 8200 done, 200 remaining │ │ │
+│  │  │  Launch 16 workers (Parallel)            │ │ │
+│  │  │  ┌──────┐ ┌──────┐       ┌──────┐       │ │ │
+│  │  │  │Worker│ │Worker│  ...  │Worker│       │ │ │
+│  │  │  │  1   │ │  2   │       │  16  │       │ │ │
+│  │  │  │~12 q │ │~12 q │       │~13 q │       │ │ │
+│  │  │  └──────┘ └──────┘       └──────┘       │ │ │
+│  │  │  Wait for completion                     │ │ │
+│  │  │  Save results                            │ │ │
+│  │  └─────────────────────────────────────────┘ │ │
+│  └───────────────────────────────────────────────┘ │
+│  ┌───────────────────────────────────────────────┐ │
+│  │  Model 2: gemini-3-flash-preview             │ │
+│  │  (Same parallel processing...)                │ │
+│  └───────────────────────────────────────────────┘ │
+│  ...                                                │
+│  ┌───────────────────────────────────────────────┐ │
+│  │  Model 26: meta-llama_llama-3.1-405b-instruct│ │
+│  │  (Same parallel processing...)                │ │
+│  └───────────────────────────────────────────────┘ │
+└─────────────────────────────────────────────────────┘
+```
+
+## Next Steps
+
+After inference completes:
+
+1. **Run Evaluation**:
+
+   ```bash
+   uv run python llm_evaluation/batch_evaluate.py \
+       --cached-results-dir ./cached_results \
+       --max-workers 16
+   ```
+
+2. **Compute Scores**:
+
+```bash
+   uv run python router_evaluation/compute_scores.py <router_name>
+   ```
+
+## Files Modified/Created
+
+1. ✅ `llm_inference/parallel_inference.py` - Parallel inference manager
+2. ✅ `llm_inference/pipeline.py` - Added parallel support
+3. ✅ `llm_inference/main.py` - Added --num-workers argument
+4. ✅ `llm_inference/batch_inference.py` - Batch processing script
+5. ✅ `docs/PARALLEL_INFERENCE_IMPLEMENTATION.md` - Implementation details
+6. ✅ `llm_inference/README_PARALLEL.md` - This guide
+
+## Support
+
+For issues or questions:
+1. Check logs for error messages
+2. Review `docs/PARALLEL_INFERENCE_IMPLEMENTATION.md` for details
+3. Verify model names in `model_cost/model_cost.json`
+4. Ensure dataset is prepared (run prep_datasets.py)
diff --git a/llm_inference/batch_inference.py b/llm_inference/batch_inference.py
@@ -0,0 +1,248 @@
+# SPDX-FileCopyrightText: Copyright contributors to the RouterArena project
+# SPDX-License-Identifier: Apache-2.0
+
+"""
+Batch LLM Inference Script
+
+This script processes multiple models from model_cost.json sequentially,
+using parallel workers for query processing within each model.
+
+Architecture:
+- Processes models sequentially (one at a time)
+- Within each model, uses k workers for parallel query processing
+- Example: 8400 queries with 16 workers → each worker handles ~525 queries
+
+Usage:
+    # Process all models from model_cost.json with 16 workers per model
+    uv run python llm_inference/batch_inference.py --num-workers 16
+
+    # Process specific models only
+    uv run python llm_inference/batch_inference.py \
+        --models gemini-2.0-flash-001 gpt-5-mini \
+        --num-workers 16
+"""
+
+import argparse
+import json
+import os
+import sys
+import logging
+import datetime
+from typing import List, Optional
+from parallel_inference import ParallelInferenceManager
+
+# Add parent directory to path for imports
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../")))
+
+# Change to project root BEFORE loading .env file
+current_dir = os.path.dirname(os.path.abspath(__file__))
+base_dir = os.path.abspath(os.path.join(current_dir, "../"))
+os.chdir(base_dir)
+
+# Load environment variables from .env file (now in project root)
+try:
+    from dotenv import load_dotenv
+
+    load_dotenv()
+except ImportError:
+    # dotenv is optional
+    pass
+
+
+# Set up logging
+logging.basicConfig(
+    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
+)
+logger = logging.getLogger(__name__)
+
+
+def load_model_list_from_cost_file(
+    model_cost_path: str = "./model_cost/model_cost.json",
+    specified_models: Optional[List[str]] = None,
+) -> List[str]:
+    """
+    Load list of models to process from model_cost.json
+
+    Args:
+        model_cost_path: Path to model_cost.json
+        specified_models: Optional list of specific models to process
+
+    Returns:
+        List of model names to process
+    """
+    if not os.path.exists(model_cost_path):
+        raise FileNotFoundError(f"model_cost.json not found at {model_cost_path}")
+
+    with open(model_cost_path, "r", encoding="utf-8") as f:
+        model_cost = json.load(f)
+
+    all_models = list(model_cost.keys())
+
+    if specified_models:
+        # Validate specified models exist in model_cost.json
+        invalid_models = [m for m in specified_models if m not in all_models]
+        if invalid_models:
+            logger.warning(
+                f"These models not found in model_cost.json: {invalid_models}"
+            )
+
+        models = [m for m in specified_models if m in all_models]
+        logger.info(
+            f"Processing {len(models)} specified models (out of {len(all_models)} available)"
+        )
+    else:
+        models = all_models
+        logger.info(f"Processing all {len(models)} models from model_cost.json")
+
+    return models
+
+
+def main():
+    """Main function to handle batch inference."""
+    parser = argparse.ArgumentParser(
+        description="Batch LLM Inference - Process multiple models from model_cost.json",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Process all models from model_cost.json with 16 workers per model
+  uv run python llm_inference/batch_inference.py --num-workers 16
+
+  # Process all models with 2 runs per query
+  uv run python llm_inference/batch_inference.py \\
+      --num-workers 16 \\
+      --num-runs 2
+
+  # Process specific models only with 3 runs per query
+  uv run python llm_inference/batch_inference.py \\
+      --models gemini-2.0-flash-001 gpt-5-mini \\
+      --num-workers 16 \\
+      --num-runs 3
+
+  # Process with custom cache directory and 8 workers
+  uv run python llm_inference/batch_inference.py \\
+      --cache-dir ./my_cache \\
+      --num-workers 8
+        """,
+    )
+
+    parser.add_argument(
+        "--num-workers",
+        type=int,
+        default=16,
+        help="Number of parallel workers per model (default: 16)",
+    )
+    parser.add_argument(
+        "--num-runs",
+        type=int,
+        default=1,
+        help="Target number of successful inference runs per query (default: 1)",
+    )
+    parser.add_argument(
+        "--models",
+        nargs="+",
+        help="Specific models to process (default: all models from model_cost.json)",
+    )
+    parser.add_argument(
+        "--cache-dir",
+        default="./cached_results",
+        help="Directory where cached results are stored (default: ./cached_results)",
+    )
+    parser.add_argument(
+        "--model-cost-path",
+        default="./model_cost/model_cost.json",
+        help="Path to model_cost.json (default: ./model_cost/model_cost.json)",
+    )
+    parser.add_argument(
+        "--input-file",
+        default="./llm_inference/datasets/router_data.json",
+        help="Path to input data file (default: ./llm_inference/datasets/router_data.json)",
+    )
+
+    args = parser.parse_args()
+
+    try:
+        # Note: Working directory already changed to project root at module load time
+        start_time = datetime.datetime.now()
+
+        logger.info("\n" + "=" * 80)
+        logger.info("BATCH INFERENCE STARTING")
+        logger.info("=" * 80)
+        logger.info(f"Start time: {start_time.strftime('%Y-%m-%d %H:%M:%S')}")
+        logger.info(f"Workers per model: {args.num_workers}")
+        logger.info(f"Target runs per query: {args.num_runs}")
+        logger.info(f"Cache directory: {args.cache_dir}")
+        logger.info(f"Input file: {args.input_file}")
+        logger.info("=" * 80 + "\n")
+
+        # Validate input file exists
+        if not os.path.exists(args.input_file):
+            raise FileNotFoundError(
+                f"Input file not found: {args.input_file}\n"
+                f"Please run: uv run python scripts/process_datasets/prep_datasets.py"
+            )
+
+        # Load models to process
+        models = load_model_list_from_cost_file(
+            model_cost_path=args.model_cost_path, specified_models=args.models
+        )
+
+        if not models:
+            logger.error("No models to process!")
+            return 1
+
+        logger.info(f"Models to process: {models}\n")
+
+        # Initialize parallel inference manager
+        manager = ParallelInferenceManager(
+            cache_dir=args.cache_dir, workers=args.num_workers
+        )
+
+        # Load input data once (will be reused for all models)
+        data = manager.load_input_data(args.input_file)
+        logger.info(f"Loaded {len(data)} queries from input file\n")
+
+        # Process all models sequentially
+        all_stats = manager.process_all_models(
+            models=models,
+            data=data,
+            num_workers=args.num_workers,
+            num_runs=args.num_runs,
+        )
+
+        # Final summary
+        end_time = datetime.datetime.now()
+        duration = (end_time - start_time).total_seconds() / 60
+
+        logger.info("\n" + "=" * 80)
+        logger.info("BATCH INFERENCE COMPLETED")
+        logger.info("=" * 80)
+        logger.info(f"End time: {end_time.strftime('%Y-%m-%d %H:%M:%S')}")
+        logger.info(f"Total duration: {duration:.1f} minutes")
+        logger.info("=" * 80)
+
+        # Summary statistics
+        total_processed = sum(s["processed"] for s in all_stats.values())
+        total_successful = sum(s["successful"] for s in all_stats.values())
+        total_failed = sum(s["failed"] for s in all_stats.values())
+
+        logger.info("\nSummary Statistics:")
+        logger.info(f"  Models processed: {len(models)}")
+        logger.info(f"  Total queries processed: {total_processed}")
+        logger.info(f"  Total successful: {total_successful}")
+        logger.info(f"  Total failed: {total_failed}")
+
+        if total_processed > 0:
+            success_rate = (total_successful / total_processed) * 100
+            logger.info(f"  Success rate: {success_rate:.1f}%")
+
+        logger.info("=" * 80 + "\n")
+
+        return 0
+
+    except Exception as e:
+        logger.error(f"Error in batch inference: {e}", exc_info=True)
+        return 1
+
+
+if __name__ == "__main__":
+    sys.exit(main())