This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
This is a vLLM setup project for running large language models locally with OpenAI-compatible API endpoints. The codebase is designed to work in WSL2 (Windows Subsystem for Linux) environments with GPU support.
This project uses uv for Python package management (Python 3.11+ required).
Initial setup:
# Install system dependencies (Ubuntu/WSL2)
./install-systems-deps.sh
# Create and activate virtual environment, install dependencies
./reset_env.sh
source .venv/bin/activateCheck installation:
python check_vllm.pyThe primary way to run the vLLM server is through start_server.sh:
# Run with default model (qwen-7b)
./start_server.sh
# Run specific model
./start_server.sh --model qwen-math-72b
# Custom port
./start_server.sh --model qwen-14b --port 8001
# See all available models and options
./start_server.sh --helpAlternative Python server:
python vllm_server.py --model Qwen/Qwen2.5-7B-Instruct --port 8000Models are pre-configured in start_server.sh with optimized settings (GPU memory utilization, max context length, quantization):
- 7B models (~14GB VRAM):
qwen-7b,qwen-math-7b,mistral-7b,deepseek-math - 14B models (~28GB VRAM):
qwen-14b - 24-32B models (~30GB VRAM):
mistral-small,qwen-32b - 72B models (AWQ 4-bit, ~31GB VRAM):
qwen-72b,qwen-math-72b
Each model shorthand maps to: model_id|gpu_mem_utilization|max_context_length|quantization_method
Basic vLLM functionality test:
python test_vllm.pyThis performs direct inference using the vLLM LLM class (not via API server).
API server test:
# First, start the server in another terminal
./start_server.sh
# Then run the test
python test_math.pyThis tests the running API server at http://localhost:8000/v1 with mathematical problems.
Entry points:
vllm_server.py: Python wrapper for vLLM's OpenAI API server with custom argument handlingstart_server.sh: Bash launcher with pre-configured model presets and user-friendly interface
Server architecture:
- The server uses vLLM's built-in OpenAI-compatible API server (
vllm.entrypoints.openai.api_server) vllm_server.pyconstructs arguments and delegates to vLLM's main server entry point- Models are served with the name "vllm-model" regardless of actual model used
API endpoints (when server is running):
http://localhost:8000/v1/models- List available modelshttp://localhost:8000/v1/completions- Text completionshttp://localhost:8000/v1/chat/completions- Chat completionshttp://localhost:8000/docs- Swagger UI documentation
When launching models, important parameters include:
--gpu-memory-utilization: Fraction of GPU memory to use (0.0-1.0), typically 0.85-0.95--max-model-len: Maximum context length (auto-detected if not specified)--tensor-parallel-size: Number of GPUs for tensor parallelism (default: 1)--quantization: Quantization method (e.g., "awq" for 72B models)--trust-remote-code: Required for some models with custom code--served-model-name: Name used in API responses (set to "vllm-model")
- This project is primarily for running inference, not training
- The codebase contains Russian language comments and messages (designed for Russian-speaking users)
- WSL2-specific: Server binds to
0.0.0.0to allow access from Windows host - The project structure is simple and script-based rather than application-based