diff --git a/README.md b/README.md index c73d5c22..c483fd14 100644 --- a/README.md +++ b/README.md @@ -214,6 +214,48 @@ The latest roadmap is also tracked in [issue #125](https://github.com/ovg-projec - [x] NVIDIA GPUs - [ ] AMD GPUs +## FAQ + +### What is the difference between kvcached and Paged Attention? + +Both technologies involve GPU memory management for LLM inference, but they differ fundamentally in their approach: + +| Aspect | Paged Attention | kvcached | +|--------|-----------------|----------| +| **Memory Allocation** | Static reservation at startup | Dynamic on-demand allocation | +| **Scope** | Optimizes single-model serving | Enables multi-model GPU sharing | +| **Idle Memory Usage** | Reserved memory stays allocated | Zero GPU memory when idle | +| **Virtual Memory** | Maps logical blocks to physical | Full OS-style virtual memory abstraction | + +**Paged Attention** (used in vLLM, SGLang) organizes KV cache into fixed-size blocks and uses a page table to map logical blocks to physical GPU memory. However, the physical memory is **statically reserved** during engine initialization. + +**kvcached** builds on top of this by adding true virtual memory semantics: engines reserve only *virtual* address space initially, and physical GPU memory is allocated **on-demand** as requests arrive. When a model is idle, its physical memory can be reclaimed for other models. + +This enables multiple LLMs to elastically share the same GPU without rigid memory partitioning. + +### Do I need to set `--gpu-memory-utilization` when using kvcached? + +**No.** When kvcached is enabled, it automatically manages GPU memory allocation. Do NOT use: +- vLLM: `--gpu-memory-utilization` +- SGLang: `--mem-fraction-static` + +Instead, configure memory limits via kvcached settings (e.g., `kvcached_gpu_utilization` in the controller YAML or via `kvctl`). + +### Can kvcached work with prefix caching? + +Not yet. Prefix caching requires keeping KV cache blocks allocated across requests, which prevents kvcached from reclaiming memory when models are idle. Use: +- vLLM: `--no-enable-prefix-caching` +- SGLang: `--disable-radix-cache` + +### How do I monitor kvcached memory usage? + +Use the `kvctl` CLI tool: +```bash +kvctl shell # Interactive shell +kvctl list # Show IPC segments and usage +kvctl kvtop # Launch curses UI for real-time monitoring +``` + ## Contributing We are grateful for and open to contributions and collaborations of any kind. diff --git a/controller/README.md b/controller/README.md index 61c53b66..65e84325 100644 --- a/controller/README.md +++ b/controller/README.md @@ -166,3 +166,63 @@ python test_traffic_monitor.py | `/action/sleep/{model_name}` | Manually put a model to sleep | | `/action/wakeup/{model_name}` | Manually wake up a sleeping model | | `/sleep/candidates` | Models that are candidates for sleep mode | + +--- + +## Troubleshooting Multi-Model Setups + +### Memory Allocation Failures + +If you encounter errors like "cannot allocate memory" or OOM when starting multiple models: + +**1. Do NOT use static memory allocation flags with kvcached** + +When kvcached is enabled, it manages GPU memory dynamically. Remove these conflicting flags: +- vLLM: `--gpu-memory-utilization` +- SGLang: `--mem-fraction-static` + +Instead, use `kvcached_gpu_utilization` in your YAML config (default: 0.95). + +**2. Add startup delays between instances** + +When launching multiple models, add delays to allow kvcached to stabilize: +```yaml +instances: + - name: model1 + # ... config ... + - name: model2 + launch_delay_seconds: 30 # Wait 30s after model1 starts +``` + +**3. Monitor memory usage with kvctl** + +Use the kvcached CLI to monitor real-time memory allocation: +```bash +# Interactive shell +kvctl shell + +# List active IPC segments +kvcached> list + +# Watch memory usage in real-time +kvcached> watch -n 2 + +# Launch curses UI for detailed view +kvcached> kvtop +``` + +### Model Not Responding + +1. Check if the model is sleeping: `curl http://localhost:8080/sleep/status` +2. Wake it up: `curl -X POST "http://localhost:8080/action/wakeup/model-name"` +3. Check backend health: `curl "http://localhost:8080/health/model-name"` + +### Understanding kvcached vs Engine Memory Settings + +| Setting | Purpose | When to Use | +|---------|---------|-------------| +| `kvcached_gpu_utilization` | Max GPU fraction kvcached can use | Always with kvcached | +| `--gpu-memory-utilization` | vLLM static reservation | **Never** with kvcached | +| `--mem-fraction-static` | SGLang static reservation | **Never** with kvcached | + +kvcached allocates memory **on-demand** as requests arrive, unlike static allocation which reserves memory upfront. This enables elastic sharing between multiple models. diff --git a/controller/example-config.yaml b/controller/example-config.yaml index 7ab58284..bb71e7fa 100644 --- a/controller/example-config.yaml +++ b/controller/example-config.yaml @@ -36,7 +36,8 @@ instances: # instances configuration - "--no-enable-prefix-caching" - "--host=localhost" - "--port=12346" - - "--gpu-memory-utilization 0.5" + # NOTE: Do NOT use --gpu-memory-utilization with kvcached. + # kvcached manages memory dynamically via kvcached_gpu_utilization above. - "--enable-sleep-mode" - name: instance2 model: Qwen/Qwen3-0.6B @@ -49,7 +50,8 @@ instances: # instances configuration engine_args: - "--disable-radix-cache" - "--trust-remote-code" - - "--mem-fraction-static 0.5" + # NOTE: Do NOT use --mem-fraction-static with kvcached. + # kvcached manages memory dynamically via kvcached_gpu_utilization above. - "--host=localhost" - "--port=30000" - "--enable-memory-saver"