Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
244 changes: 244 additions & 0 deletions docs/vllm_integration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,244 @@
# vLLM Integration Guide for RAG-Anything

[vLLM](https://github.com/vllm-project/vllm) is a high-throughput, memory-efficient inference engine for LLMs. It exposes an OpenAI-compatible API, making it a drop-in backend for RAG-Anything in production environments.

## Why vLLM?

| Feature | vLLM | Ollama | LM Studio |
|---------|------|--------|-----------|
| **Continuous batching** | ✅ | ❌ | ❌ |
| **PagedAttention** | ✅ | ❌ | ❌ |
| **Tensor parallelism** | ✅ | ❌ | ❌ |
| **Production throughput** | ✅ High | Moderate | Low |
| **Quantization (AWQ/GPTQ/FP8)** | ✅ | ✅ (GGUF) | ✅ (GGUF) |
| **Multi-GPU support** | ✅ Native | Limited | ❌ |
| **Ease of setup** | Moderate | Easy | Easy |
| **GUI** | ❌ | ❌ | ✅ |

**Choose vLLM when:** You need production-grade throughput, serve multiple concurrent users, or run large models across multiple GPUs.

## Prerequisites

1. **NVIDIA GPU(s)** with CUDA support (compute capability ≥ 7.0)
2. **Python 3.9+**

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Correct Python prerequisite to 3.10+

This guide says Python 3.9+ is supported, but the package metadata requires Python >=3.10 (pyproject.toml), so users on 3.9 following this setup will fail at installation before they can run the integration. Updating this prerequisite avoids a broken onboarding path.

Useful? React with 👍 / 👎.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Raise documented Python minimum to 3.10

The new guide states Python 3.9+, but this repository declares requires-python = ">=3.10" in pyproject.toml, so users following this doc on Python 3.9 will fail during installation before they can run the vLLM example. Please align the prerequisite here with the actual package requirement to avoid a broken setup path.

Useful? React with 👍 / 👎.

3. **vLLM installed:**
```bash
pip install vllm
```
4. **RAG-Anything installed:**
```bash
pip install raganything
```

## Quick Start

### 1. Start vLLM Server

**Chat/Completion model:**
```bash
vllm serve Qwen/Qwen2.5-72B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--port 8000
```

**Embedding model** (separate process, different port):
```bash
vllm serve BAAI/bge-m3 \
--task embedding \
--port 8001
```

### 2. Configure Environment

Create a `.env` file:

```bash
### vLLM Configuration
LLM_BINDING=vllm
LLM_MODEL=Qwen/Qwen2.5-72B-Instruct
LLM_BINDING_HOST=http://localhost:8000/v1
LLM_BINDING_API_KEY=token-abc123

### Embedding via vLLM
EMBEDDING_BINDING=vllm
EMBEDDING_MODEL=BAAI/bge-m3
EMBEDDING_DIM=1024
EMBEDDING_BINDING_HOST=http://localhost:8001/v1
EMBEDDING_BINDING_API_KEY=token-abc123
```

### 3. Run the Example

```bash
cd examples
python vllm_integration_example.py
```

## Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `LLM_BINDING` | — | Set to `vllm` |
| `LLM_MODEL` | `Qwen/Qwen2.5-72B-Instruct` | Model name (must match what vLLM is serving) |
| `LLM_BINDING_HOST` | `http://localhost:8000/v1` | vLLM API base URL |
| `LLM_BINDING_API_KEY` | `token-abc123` | API key (vLLM default: any non-empty string) |
| `EMBEDDING_BINDING` | — | Set to `vllm` |
| `EMBEDDING_MODEL` | `BAAI/bge-m3` | Embedding model name |
| `EMBEDDING_DIM` | `1024` | Embedding dimensions |
| `EMBEDDING_BINDING_HOST` | `http://localhost:8001/v1` | Embedding endpoint URL |
| `EMBEDDING_BINDING_API_KEY` | `token-abc123` | Embedding API key |

## Model Configurations

### Qwen 2.5 (Recommended for RAG)
```bash
vllm serve Qwen/Qwen2.5-72B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 32768
```

### Mistral / Mixtral
```bash
vllm serve mistralai/Mixtral-8x7B-Instruct-v0.1 \
--tensor-parallel-size 2 \
--max-model-len 32768
```

### Llama 3.1 70B
```bash
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 8192
```

### With AWQ Quantization (reduced memory)
```bash
vllm serve Qwen/Qwen2.5-72B-Instruct-AWQ \
--tensor-parallel-size 2 \
--quantization awq \
--max-model-len 32768
```

### With GPTQ Quantization
```bash
vllm serve TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ \
--tensor-parallel-size 2 \
--quantization gptq
```

## Performance Tips

### Tensor Parallelism
Distribute large models across GPUs. Set `--tensor-parallel-size` to the number of GPUs:
```bash
# 4x A100 80GB → can serve 72B models in full precision
vllm serve Qwen/Qwen2.5-72B-Instruct --tensor-parallel-size 4
```

### GPU Memory Utilization
Increase if you have headroom (default 0.9):
```bash
vllm serve ... --gpu-memory-utilization 0.95
```

### Max Model Length
Reduce if you don't need full context (saves memory):
```bash
# RAG chunks are typically <4K tokens; 8192 is often sufficient
vllm serve ... --max-model-len 8192
```

### Concurrency
vLLM handles batching automatically. On the RAG-Anything side, increase `MAX_ASYNC` in your `.env`:
```bash
MAX_ASYNC=16 # vLLM handles concurrent requests efficiently
```

### Speculative Decoding (vLLM ≥ 0.4)
Use a small draft model to speed up generation:
```bash
vllm serve Qwen/Qwen2.5-72B-Instruct \
--speculative-model Qwen/Qwen2.5-0.5B-Instruct \
--num-speculative-tokens 5 \
--tensor-parallel-size 4
```

## Embedding Options

### Option A: vLLM Embedding Server (Recommended)
Run a dedicated vLLM instance for embeddings:
```bash
vllm serve BAAI/bge-m3 --task embedding --port 8001
```

### Option B: Use Ollama for Embeddings
If you already run Ollama, you can mix backends:
```bash
EMBEDDING_BINDING=ollama
EMBEDDING_MODEL=bge-m3:latest
EMBEDDING_BINDING_HOST=http://localhost:11434
```

### Option C: OpenAI Embeddings
Use OpenAI's embedding API alongside vLLM for chat:
```bash
EMBEDDING_BINDING=openai
EMBEDDING_MODEL=text-embedding-3-large
EMBEDDING_DIM=3072
EMBEDDING_BINDING_HOST=https://api.openai.com/v1
EMBEDDING_BINDING_API_KEY=sk-...
```

## Architecture

```
┌──────────────────────┐
│ RAG-Anything │
│ (Document Processing│
│ + Query Engine) │
└──────┬───────────────┘
│ OpenAI-compatible API
┌──────────────────────┐ ┌──────────────────────┐
│ vLLM Chat Server │ │ vLLM Embedding Server│
│ :8000/v1 │ │ :8001/v1 │
│ (Qwen-72B, etc.) │ │ (bge-m3, etc.) │
└──────────────────────┘ └──────────────────────┘
│ │
▼ ▼
┌──────────────────────────────────────────────┐
│ GPU Cluster │
│ PagedAttention · Continuous Batching │
│ Tensor Parallelism · Quantization │
└──────────────────────────────────────────────┘
```

## Troubleshooting

### Connection Refused
```
❌ Connection failed: Connection refused
```
- Ensure vLLM is running: `curl http://localhost:8000/v1/models`
- Check the port matches your `LLM_BINDING_HOST`
- Wait for model loading to complete (large models can take minutes)

### Out of Memory
```
torch.cuda.OutOfMemoryError
```
- Use quantized models (`--quantization awq` or `gptq`)
- Reduce `--max-model-len`
- Increase `--tensor-parallel-size` (more GPUs)
- Lower `--gpu-memory-utilization`

### Model Not Found
```
Model 'xxx' not found
```
- `LLM_MODEL` must match the model name vLLM is serving exactly
- Check available models: `curl http://localhost:8000/v1/models`

### Slow First Request
This is normal — vLLM compiles CUDA kernels on first use. Subsequent requests are fast.
11 changes: 9 additions & 2 deletions env.example
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ MAX_ASYNC=4
### MAX_TOKENS: max tokens send to LLM for entity relation summaries (less than context size of the model)
### MAX_TOKENS: set as num_ctx option for Ollama by API Server
MAX_TOKENS=32768
### LLM Binding type: openai, ollama, lollms, azure_openai, lmstudio
### LLM Binding type: openai, ollama, lollms, azure_openai, lmstudio, vllm
LLM_BINDING=openai
LLM_MODEL=gpt-4o
LLM_BINDING_HOST=https://api.openai.com/v1
Expand All @@ -118,8 +118,15 @@ LLM_BINDING_API_KEY=your_api_key
# AZURE_OPENAI_API_VERSION=2024-08-01-preview
# AZURE_OPENAI_DEPLOYMENT=gpt-4o

### vLLM Configuration (high-throughput production inference)
### See docs/vllm_integration.md for setup guide
# LLM_BINDING=vllm
# LLM_MODEL=Qwen/Qwen2.5-72B-Instruct
# LLM_BINDING_HOST=http://localhost:8000/v1
# LLM_BINDING_API_KEY=token-abc123

### Embedding Configuration
### Embedding Binding type: openai, ollama, lollms, azure_openai, lmstudio
### Embedding Binding type: openai, ollama, lollms, azure_openai, lmstudio, vllm
EMBEDDING_BINDING=ollama
EMBEDDING_MODEL=bge-m3:latest
EMBEDDING_DIM=1024
Expand Down
Loading