ovg-project · yurekami · Dec 27, 2025
diff --git a/README.md b/README.md
@@ -214,6 +214,48 @@ The latest roadmap is also tracked in [issue #125](https://github.com/ovg-projec
   - [x] NVIDIA GPUs
   - [ ] AMD GPUs
 
+## FAQ
+
+### What is the difference between kvcached and Paged Attention?
+
+Both technologies involve GPU memory management for LLM inference, but they differ fundamentally in their approach:
+
+| Aspect | Paged Attention | kvcached |
+|--------|-----------------|----------|
+| **Memory Allocation** | Static reservation at startup | Dynamic on-demand allocation |
+| **Scope** | Optimizes single-model serving | Enables multi-model GPU sharing |
+| **Idle Memory Usage** | Reserved memory stays allocated | Zero GPU memory when idle |
+| **Virtual Memory** | Maps logical blocks to physical | Full OS-style virtual memory abstraction |
+
+**Paged Attention** (used in vLLM, SGLang) organizes KV cache into fixed-size blocks and uses a page table to map logical blocks to physical GPU memory. However, the physical memory is **statically reserved** during engine initialization.
+
+**kvcached** builds on top of this by adding true virtual memory semantics: engines reserve only *virtual* address space initially, and physical GPU memory is allocated **on-demand** as requests arrive. When a model is idle, its physical memory can be reclaimed for other models.
+
+This enables multiple LLMs to elastically share the same GPU without rigid memory partitioning.
+
+### Do I need to set `--gpu-memory-utilization` when using kvcached?
+
+**No.** When kvcached is enabled, it automatically manages GPU memory allocation. Do NOT use:
+- vLLM: `--gpu-memory-utilization`
+- SGLang: `--mem-fraction-static`
+
+Instead, configure memory limits via kvcached settings (e.g., `kvcached_gpu_utilization` in the controller YAML or via `kvctl`).
+
+### Can kvcached work with prefix caching?
+
+Not yet. Prefix caching requires keeping KV cache blocks allocated across requests, which prevents kvcached from reclaiming memory when models are idle. Use:
+- vLLM: `--no-enable-prefix-caching`
+- SGLang: `--disable-radix-cache`
+
+### How do I monitor kvcached memory usage?
+
+Use the `kvctl` CLI tool:
+```bash
+kvctl shell          # Interactive shell
+kvctl list           # Show IPC segments and usage
+kvctl kvtop          # Launch curses UI for real-time monitoring
+```
+
 ## Contributing
 
 We are grateful for and open to contributions and collaborations of any kind.

diff --git a/controller/README.md b/controller/README.md
@@ -166,3 +166,63 @@ python test_traffic_monitor.py
 | `/action/sleep/{model_name}` | Manually put a model to sleep |
 | `/action/wakeup/{model_name}` | Manually wake up a sleeping model |
 | `/sleep/candidates` | Models that are candidates for sleep mode |
+
+---
+
+## Troubleshooting Multi-Model Setups
+
+### Memory Allocation Failures
+
+If you encounter errors like "cannot allocate memory" or OOM when starting multiple models:
+
+**1. Do NOT use static memory allocation flags with kvcached**
+
+When kvcached is enabled, it manages GPU memory dynamically. Remove these conflicting flags:
+- vLLM: `--gpu-memory-utilization`
+- SGLang: `--mem-fraction-static`
+
+Instead, use `kvcached_gpu_utilization` in your YAML config (default: 0.95).
+
+**2. Add startup delays between instances**
+
+When launching multiple models, add delays to allow kvcached to stabilize:
+```yaml
+instances:
+  - name: model1
+    # ... config ...
+  - name: model2
+    launch_delay_seconds: 30  # Wait 30s after model1 starts
+```
+
+**3. Monitor memory usage with kvctl**
+
+Use the kvcached CLI to monitor real-time memory allocation:
+```bash
+# Interactive shell
+kvctl shell
+
+# List active IPC segments
+kvcached> list
+
+# Watch memory usage in real-time
+kvcached> watch -n 2
+
+# Launch curses UI for detailed view
+kvcached> kvtop
+```
+
+### Model Not Responding
+
+1. Check if the model is sleeping: `curl http://localhost:8080/sleep/status`
+2. Wake it up: `curl -X POST "http://localhost:8080/action/wakeup/model-name"`
+3. Check backend health: `curl "http://localhost:8080/health/model-name"`
+
+### Understanding kvcached vs Engine Memory Settings
+
+| Setting | Purpose | When to Use |
+|---------|---------|-------------|
+| `kvcached_gpu_utilization` | Max GPU fraction kvcached can use | Always with kvcached |
+| `--gpu-memory-utilization` | vLLM static reservation | **Never** with kvcached |
+| `--mem-fraction-static` | SGLang static reservation | **Never** with kvcached |
+
+kvcached allocates memory **on-demand** as requests arrive, unlike static allocation which reserves memory upfront. This enables elastic sharing between multiple models.
diff --git a/controller/example-config.yaml b/controller/example-config.yaml
@@ -36,7 +36,8 @@ instances: # instances configuration
       - "--no-enable-prefix-caching"
       - "--host=localhost"
       - "--port=12346"
-      - "--gpu-memory-utilization 0.5"
+      # NOTE: Do NOT use --gpu-memory-utilization with kvcached.
+      # kvcached manages memory dynamically via kvcached_gpu_utilization above.
       - "--enable-sleep-mode"
   - name: instance2
     model: Qwen/Qwen3-0.6B
@@ -49,7 +50,8 @@ instances: # instances configuration
     engine_args:
       - "--disable-radix-cache"
       - "--trust-remote-code"
-      - "--mem-fraction-static 0.5"
+      # NOTE: Do NOT use --mem-fraction-static with kvcached.
+      # kvcached manages memory dynamically via kvcached_gpu_utilization above.
       - "--host=localhost"
       - "--port=30000"
       - "--enable-memory-saver"