Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,48 @@ The latest roadmap is also tracked in [issue #125](https://github.com/ovg-projec
- [x] NVIDIA GPUs
- [ ] AMD GPUs

## FAQ

### What is the difference between kvcached and Paged Attention?

Both technologies involve GPU memory management for LLM inference, but they differ fundamentally in their approach:

| Aspect | Paged Attention | kvcached |
|--------|-----------------|----------|
| **Memory Allocation** | Static reservation at startup | Dynamic on-demand allocation |
| **Scope** | Optimizes single-model serving | Enables multi-model GPU sharing |
| **Idle Memory Usage** | Reserved memory stays allocated | Zero GPU memory when idle |
| **Virtual Memory** | Maps logical blocks to physical | Full OS-style virtual memory abstraction |

**Paged Attention** (used in vLLM, SGLang) organizes KV cache into fixed-size blocks and uses a page table to map logical blocks to physical GPU memory. However, the physical memory is **statically reserved** during engine initialization.

**kvcached** builds on top of this by adding true virtual memory semantics: engines reserve only *virtual* address space initially, and physical GPU memory is allocated **on-demand** as requests arrive. When a model is idle, its physical memory can be reclaimed for other models.

This enables multiple LLMs to elastically share the same GPU without rigid memory partitioning.

### Do I need to set `--gpu-memory-utilization` when using kvcached?

**No.** When kvcached is enabled, it automatically manages GPU memory allocation. Do NOT use:
- vLLM: `--gpu-memory-utilization`
- SGLang: `--mem-fraction-static`

Instead, configure memory limits via kvcached settings (e.g., `kvcached_gpu_utilization` in the controller YAML or via `kvctl`).

### Can kvcached work with prefix caching?

Not yet. Prefix caching requires keeping KV cache blocks allocated across requests, which prevents kvcached from reclaiming memory when models are idle. Use:
- vLLM: `--no-enable-prefix-caching`
- SGLang: `--disable-radix-cache`

### How do I monitor kvcached memory usage?

Use the `kvctl` CLI tool:
```bash
kvctl shell # Interactive shell
kvctl list # Show IPC segments and usage
kvctl kvtop # Launch curses UI for real-time monitoring
```

## Contributing

We are grateful for and open to contributions and collaborations of any kind.
Expand Down
60 changes: 60 additions & 0 deletions controller/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -166,3 +166,63 @@ python test_traffic_monitor.py
| `/action/sleep/{model_name}` | Manually put a model to sleep |
| `/action/wakeup/{model_name}` | Manually wake up a sleeping model |
| `/sleep/candidates` | Models that are candidates for sleep mode |

---

## Troubleshooting Multi-Model Setups

### Memory Allocation Failures

If you encounter errors like "cannot allocate memory" or OOM when starting multiple models:

**1. Do NOT use static memory allocation flags with kvcached**

When kvcached is enabled, it manages GPU memory dynamically. Remove these conflicting flags:
- vLLM: `--gpu-memory-utilization`
- SGLang: `--mem-fraction-static`

Instead, use `kvcached_gpu_utilization` in your YAML config (default: 0.95).

**2. Add startup delays between instances**

When launching multiple models, add delays to allow kvcached to stabilize:
```yaml
instances:
- name: model1
# ... config ...
- name: model2
launch_delay_seconds: 30 # Wait 30s after model1 starts
```

**3. Monitor memory usage with kvctl**

Use the kvcached CLI to monitor real-time memory allocation:
```bash
# Interactive shell
kvctl shell

# List active IPC segments
kvcached> list

# Watch memory usage in real-time
kvcached> watch -n 2

# Launch curses UI for detailed view
kvcached> kvtop
```

### Model Not Responding

1. Check if the model is sleeping: `curl http://localhost:8080/sleep/status`
2. Wake it up: `curl -X POST "http://localhost:8080/action/wakeup/model-name"`
3. Check backend health: `curl "http://localhost:8080/health/model-name"`

### Understanding kvcached vs Engine Memory Settings

| Setting | Purpose | When to Use |
|---------|---------|-------------|
| `kvcached_gpu_utilization` | Max GPU fraction kvcached can use | Always with kvcached |
| `--gpu-memory-utilization` | vLLM static reservation | **Never** with kvcached |
| `--mem-fraction-static` | SGLang static reservation | **Never** with kvcached |

kvcached allocates memory **on-demand** as requests arrive, unlike static allocation which reserves memory upfront. This enables elastic sharing between multiple models.
6 changes: 4 additions & 2 deletions controller/example-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,8 @@ instances: # instances configuration
- "--no-enable-prefix-caching"
- "--host=localhost"
- "--port=12346"
- "--gpu-memory-utilization 0.5"
# NOTE: Do NOT use --gpu-memory-utilization with kvcached.
# kvcached manages memory dynamically via kvcached_gpu_utilization above.
- "--enable-sleep-mode"
- name: instance2
model: Qwen/Qwen3-0.6B
Expand All @@ -49,7 +50,8 @@ instances: # instances configuration
engine_args:
- "--disable-radix-cache"
- "--trust-remote-code"
- "--mem-fraction-static 0.5"
# NOTE: Do NOT use --mem-fraction-static with kvcached.
# kvcached manages memory dynamically via kvcached_gpu_utilization above.
- "--host=localhost"
- "--port=30000"
- "--enable-memory-saver"
Loading