xiaguan · odysa · Jun 7, 2026 · Jun 8, 2026
diff --git a/docs/index.md b/docs/index.md
@@ -25,12 +25,10 @@ Organized by domain (model line / subsystem / playbook / lesson) instead of by l
 | Path | TL;DR |
 | --- | --- |
 | `models/qwen3/roadmap.md` | Qwen3-4B roadmap (2026-06 review): line is the maturity bar; open set is #220 RoPE OOB, per-row batch sampling, zero TP coverage, zero-adapter-only LoRA gate, dropped prefix-cache observability, stale docs. Sequenced Now/Next/Later + cleanup ledger. |
-| `models/qwen3/model-crate.md` | `pegainfer-qwen3-4b` owns Qwen3 config/weights/executor/scheduler/tests/kernel plan; root sees generic `EngineHandle`; split-K retuned to `256/64`, with 4k/64 serving TPOT p50 at `6.46ms` on RTX 5090. |
+| `models/qwen3/crate-layout.md` | `pegainfer-qwen3-4b` owns Qwen3 config/weights/executor/scheduler/tests/kernel plan; kernel surface and build live in `pegainfer-kernels`; server sees only `start_engine() -> EngineHandle`. Records the split-K decode gate (`256/64`, `padded_bs<=2 && seq>=1024`) and CUPTI/bench gotchas. |
 | `models/qwen3/prefix-cache.md` | Prefix caching on by default for Qwen3-4B: full-block kvbm radix matching at the executor, suffix-only prefill. Repeated ~1900-token prompt TTFT 141.8 → 16.3ms p50 (8.7×); warm TTFT ≈ TPOT + ~5ms setup. Includes the RoPE scalar-path corruption fix and the drain-the-stream TTFT measurement pitfall. |
 | `models/qwen3/accuracy-gate.md` | Qwen3-4B instance of the logits golden gate (`tests/hf_golden_gate.rs`): 48 teacher-forced sequences / 816 positions vs a stored HF bf16 golden, replayed over bs=1 / batched eager / CUDA-graph. Strict guards: regret check + mean ≤ 0.06 + p99 ≤ 0.20; absolute max printed but not asserted (coverage-unstable). Methodology in `subsystems/correctness/`. |
-| `models/qwen3/kernels-crate.md` | Phase 1 split implemented and 5090-verified: Qwen3-4B kernel surface lives in `pegainfer-kernels`; release build, test-target compile, accuracy gate, and bench snapshot pass. |
-| `models/qwen3/tp-design.md` | Qwen3 tensor-parallel design: `TP=2` milestone scope plus the controller/worker broadcast execution model, request identity, and coarse-grained step protocol for future TP/MoE work. |
-| `models/qwen3/kv-pressure-hang.md` | Issue #85 Qwen3-4B KV pressure hang fixed by full-lifetime scheduler KV admission, waiting-queue deferral, cleanup on disconnect/error, impossible-request errors, scheduler/bridge gates, and real `vllm bench serve` QPS=2 `500/500` pass with post-pressure completion healthy. |
+| `models/qwen3/tp-design.md` | Qwen3 TP runtime as implemented: controller/worker broadcast execution (`RankWorker` per rank, coarse `StepCommand`, barrier per step), plan/resolve/effects scheduler boundaries, TP=2 partitioning spec. Open: TP correctness coverage, vocab-parallel embedding/lm_head, TP CUDA-graph. |
 
 ## models / qwen35
 
@@ -134,6 +132,7 @@ Organized by domain (model line / subsystem / playbook / lesson) instead of by l
 | `lessons/moe-dplb-decode-imbalance.md` | DPLB lesson for future PegaFlow/WiDeep MoE+EP serving: decode-side DP imbalance is a sticky KV-state problem; engines should emit raw progress while external router/proxy derive load and routing. |
 | `lessons/moe-zero-prefill-long-prefill.md` | ZeRO-Prefill lesson for future long-prefill MoE serving: once a router selects long-P work, maximize batch throughput by preserving compute-bound execution, hiding expert-weight movement, respecting KV handoff boundaries, and measuring bottlenecks before committing to an AsyncEP-style backend. |
 | `lessons/exact-match-gate-thread-cublas.md` | Two durable lessons from a Qwen3.5 e2e gibberish bug: worker threads that run a model must rebind the CUDA context and init thread-local cuBLAS handles, and exact-match greedy gates are sensitive to equal-logit top1 choices (keep a single FlashInfer selector). |
+| `lessons/kv-full-lifetime-reservation.md` | Schedulers without preemption must admit on full-lifetime KV budget (not prefill footprint), reject never-fits requests explicitly, release KV on every exit path, and prove recovery with a post-pressure request. From qwen3 issue #85; reused for kimi-k2 #239. |
 | `lessons/kimi-bringup-numerics.md` | Three MoE+TP greedy-parity / reporting lessons from Kimi-K2 bring-up, reusable on any MoE+TP decode engine gated on token-id parity: reduce hidden states in F32 not BF16 (BF16 bulk all-reduce silently breaks greedy); don't merge shared+routed expert reduce into one collective (breaks cold-batch greedy); always report p50+p99, never just mean (tail dominates on barrier-synced MoE+EP decode). |
 
 ## benchmarks

diff --git a/docs/lessons/kv-full-lifetime-reservation.md b/docs/lessons/kv-full-lifetime-reservation.md
@@ -0,0 +1,15 @@
+# KV Admission Is a Full-Lifetime Reservation (Until Preemption Exists)
+
+**TL;DR**: A scheduler without preemption must admit requests on their full-lifetime KV budget, not their prefill footprint — and must explicitly reject requests that can never fit, release KV on every exit path, and prove recovery with a post-pressure request. Learned from the Qwen3-4B issue #85 pressure hang (fixed in PR #131); the same admission rule was reused for Kimi-K2 paged KV (#239).
+
+## The failure mode
+
+Admitting on prefill-only capacity lets many active requests grow into new KV pages together until a decode step cannot allocate. The server then enters a half-alive state: `/v1/models` answers, completions hang forever. The observed symptom (issue #85: `vllm bench serve` QPS=2 over Qwen3-4B) looked like a deadlock but was an admission-accounting bug plus leaked request state.
+
+## The rules
+
+1. **Reserve the full lifetime at admission.** Active requests reserve the remaining pages they may need until `max_tokens`. A pending request is admitted only if its prompt plus maximum generated-token KV footprint fits *after* those reservations. This is deliberately conservative — it defers earlier than a preemption-capable scheduler would — but it makes decode-time allocation failure impossible without implementing preemption.
+2. **Defer the temporarily-over-budget; reject the impossible.** A request that fits the model instance but not the current free pool stays in the waiting queue. A request larger than the instance's total usable KV capacity must be rejected explicitly (as a request *error*, not an empty success) — otherwise it waits forever and blocks the queue head.
+3. **Count tokens actually written to KV, not tokens returned to the client.** In a prefill/decode split, the sampled token does not occupy KV until it is fed back as the next decode input, so a request returning `N` completion tokens occupies at most `prompt_len + N - 1` KV tokens. Review bots will confidently tell you the formula is `prompt_len + max_tokens`; check what the kernels write before "fixing" it.
+4. **Every exit path must release request state.** KV pages are RAII-returned only when request state drops, so client disconnect, execution error, and send-failure paths all need to route through the owner `drop_request` — finishing normally is the only path that happens for free.
+5. **Pressure-test evidence needs a post-pressure probe.** Because the failure mode keeps the health endpoints alive, "the benchmark completed" is not enough: the gate is pressure-client success *plus* a fresh completion returning afterwards.
diff --git a/docs/models/deepseek-v4/kernel-paths.md b/docs/models/deepseek-v4/kernel-paths.md
@@ -9,15 +9,15 @@
   - `docs/index.md` - showed DeepSeek V4 support, kernel boundary, and Qwen3 kernel extraction as the relevant prior work.
   - `docs/models/deepseek-v4/support.md` - confirmed DeepSeek V4 currently has native MP8 runtime, TileLang build-time kernels, exact E2E coverage, and a documented CUDA split by subsystem.
   - `docs/subsystems/kernels/pegainfer-kernels-boundary.md` - confirmed kernels belong in the shared kernels crate, while model DAG/runtime policy stays in the model crate.
-  - `docs/models/qwen3/kernels-crate.md` - established the existing crate-first split and the role of `pegainfer-kernels/KERNELS.md`.
+  - `docs/models/qwen3/crate-layout.md` (at the time `kernels-crate.md`) - records kernel ownership in `pegainfer-kernels` and the role of `pegainfer-kernels/KERNELS.md`.
   - `docs/conventions/coding-style.md` - reminded that GPU kernels deserve targeted tests, while broad behavior is better covered by integration/E2E.
   - `pegainfer-kernels/build.rs` - showed DeepSeek kernels are feature-gated by filename prefix in a flat `csrc/` scan, and TileLang generation was hard-coded to the old flat `tools/tilelang/gen_deepseek_v4_tilelang.py` path.
   - `pegainfer-kernels/KERNELS.md` - currently indexes Qwen3 and only mentions DeepSeek as compatibility symbols, so DSV4 has no routing table.
   - `pegainfer-kernels/csrc/deepseek_*.cu` and `pegainfer-kernels/csrc/deepseek_common.cuh` - confirmed the CUDA side is already split by subsystem but still lives in the root kernel source directory.
   - `pegainfer-deepseek-v4/src/runtime/*` - confirmed runtime calls reach DeepSeek symbols through `pegainfer_kernels::ffi`, so path cleanup should not require runtime API changes.
 - **Relevant history**:
   - `docs/models/deepseek-v4/support.md` records that the current DeepSeek CUDA glue is intentionally split by subsystem; this cleanup should preserve that split instead of merging files.
-  - `docs/models/qwen3/kernels-crate.md` moved kernel ownership into `pegainfer-kernels`; the same pattern supports moving model-specific source into a clearer subdirectory without changing model runtime ownership.
+  - The Qwen3 kernels-crate split (recorded today in `docs/models/qwen3/crate-layout.md`) moved kernel ownership into `pegainfer-kernels`; the same pattern supports moving model-specific source into a clearer subdirectory without changing model runtime ownership.
 - **Plan**:
   1. First slice: move DeepSeek V4 CUDA sources from `pegainfer-kernels/csrc/deepseek_*.cu` and `deepseek_common.cuh` into `pegainfer-kernels/csrc/deepseek_v4/`, then update `pegainfer-kernels/build.rs` to discover CUDA files recursively and feature-gate DeepSeek by path instead of flat filename prefix.
   2. Keep object file names stable or explicitly namespace them so `ar` input names remain collision-free when sources live in subdirectories.

diff --git a/docs/models/deepseek-v4/pplx-ep-integration.md b/docs/models/deepseek-v4/pplx-ep-integration.md
@@ -41,7 +41,7 @@
 - `SendBuf / RecvBuf`：裸 device pointer + elem_count + elem_size + 可选 scale pointer；调用方持有底层 allocation 的所有权。
 - `RdmaBackend`（`src/backend/rdma.rs`）：私有类型，四个 trait 方法全是 `todo!()`，构造函数当前只存了 `EpTopology`，没拿 `AllToAllContext`。
 
-### pplx wrapper（`crates/pegainfer-comm-p2p-all-to-all/`）
+### pplx wrapper（`pegainfer-comm/crates/pegainfer-comm-p2p-all-to-all/`）
 
 - `AllToAllContext::new(...)`：21 个参数，需要外部传入 `TransferEngine`、`rank_handles`、预注册的 send/recv buffer + MR、host pointer arrays（sync/send/recv），构造时启动一个 `"p2p_all_to_all Worker"` 后台线程，固定 CPU 亲和性。
 - 调用形态是 **四步**（不是 trait 现在写的两步）：

diff --git a/docs/models/qwen3/crate-layout.md b/docs/models/qwen3/crate-layout.md
@@ -0,0 +1,49 @@
+# Qwen3-4B Crate Layout
+
+> **TL;DR:** `pegainfer-qwen3-4b` owns Qwen3 config, weights, executor, scheduler, LoRA, tests, and the kernel routing plan; the kernel surface (CUDA/Triton/FlashInfer source, build, FFI, reusable ops) lives in `pegainfer-kernels`; the server sees only `start_engine() -> EngineHandle`. Replaces the 2026-05 `model-crate.md`/`kernels-crate.md` extraction records — bring-up history lives in git, this doc describes what exists.
+>
+> **Last touched:** 2026-06
+
+## Crate boundary
+
+Dependency direction: `pegainfer-qwen3-4b` → `pegainfer-core` + `pegainfer-kernels` + `pegainfer-kv-cache`. The server (`pegainfer-server`) depends on the model crate only at registry/startup glue; it never sees `Qwen3Model`, KV state, TP rank workers, or prefill/decode plans.
+
+| What | Where |
+| --- | --- |
+| Config / weights / RoPE cache | `pegainfer-qwen3-4b/src/{config,weights}.rs` |
+| Executor (single-GPU + TP rank workers, CUDA graphs, split-K gate) | `src/executor.rs`, `src/batch_decode*.rs`, `src/prefill.rs`, `src/unified_forward.rs` |
+| Scheduler (admission, plan → resolve → effects) | `src/scheduler.rs` + `src/scheduler/{plan,resolve,effects}.rs` |
+| LoRA load/unload/activation | `src/lora.rs` |
+| Kernel routing index (model DAG phase → reusable kernel) | `src/kernel_plan.rs` (typed Rust, not a hand-maintained manifest) |
+| Reusable kernel wrappers, FFI, tensor helpers | `pegainfer-kernels/src/` |
+| CUDA source, Triton AOT, FlashInfer submodule, nvcc build | `pegainfer-kernels/csrc/`, `tools/triton/`, `build.rs` |
+| Human/LLM kernel routing table | `pegainfer-kernels/KERNELS.md` |
+| Tests | `tests/{hf_golden_gate,prefix_cache,lora_smoke,scheduler_robustness}.rs` |
+| Report binaries (feature `kernel-report`) | `src/bin/{qwen3_kernel_report,qwen3_model_report}.rs`; `qwen3_decode_context` is the fixed-context decode probe |
+
+Build and test commands are in the repo-root `CLAUDE.md`; per-op report tooling is documented in `docs/subsystems/kernels/kernel-op-reports.md`.
+
+## Public surface
+
+- `start_engine(model_path, EngineLoadOptions) -> EngineHandle` — the only entry the server uses.
+- `start_engine_with_lora_control(...)` — same, plus LoRA load/unload control.
+- `pegainfer_qwen3_4b::runtime` — deliberate low-level escape hatch re-exporting `Qwen3Executor` and the prefill/decode/unified plan types. It is the production phase boundary used by the scheduler and model-local tools; the server must not use it.
+- `kernel_plan()` — model-owned index from DAG phases to reusable kernels.
+
+There is no Criterion bench target in this crate (`autobenches = false`). The old `qwen3_runtime`, `qwen3_attention`, and `qwen3_kernel_snapshot` benches were retired; kernel measurement goes through the `kernel-report` binaries instead.
+
+## Decode split-K gate (load-bearing perf facts)
+
+Low-batch long-context decode under-fills the GPU on the non-partition FlashInfer paged decode path (grid is `(batch, num_kv_heads)`, so `bs=1` launches 8 CTAs scanning the whole KV context — ~7% of peak DRAM bandwidth, CUPTI-verified). The runtime therefore gates FlashInfer split-K decode:
+
+- split-K when `padded_bs <= 2 && max_seq_len >= 1024`, otherwise non-partition;
+- tuned to `SPLIT_KV_CHUNK_TOKENS=256`, `SPLIT_KV_MAX_CHUNKS_PER_REQUEST=64` (cold-L2 CUPTI sweep, 2026-05, RTX 5090);
+- CUDA graph cache is keyed by `(batch_bucket, attention_path)` — a request can cross the split-K threshold mid-decode and needs a separate graph capture.
+
+Effect at the time of tuning: 4k/64 serving steady TPOT p50 `11.7ms → 6.46ms` on RTX 5090. The batch sweep is why the gate is conservative: at `kv_len=1024`, split-K only wins for `bs<=2`.
+
+## Gotchas worth keeping
+
+- **CUPTI Range Profiler crashes on verbose range names** (`NVPW_CUDA_Profiler_DecodeCounters` inside `libnvperf_host.so`). Use compact range names like `qk/non_partition/b1/k1024`, keep metadata in the JSON output. The first profiled launch also needs an unprofiled warmup launch or CUDA lazy init pollutes its GPU time.
+- **FlashInfer C++ objects need `stdc++` linked for test binaries** — owned by `pegainfer-kernels/build.rs`; symptom is link failures only in test targets.
+- **Single-layer synthetic kernel benches lie about DRAM** — the working set fits in L2 (RTX 5090: 96MiB), so event-timer "effective bandwidth" can exceed 100% of peak. Use CUPTI `dram__bytes_*` counters for utilization claims.