Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 3 additions & 4 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,12 +25,10 @@ Organized by domain (model line / subsystem / playbook / lesson) instead of by l
| Path | TL;DR |
| --- | --- |
| `models/qwen3/roadmap.md` | Qwen3-4B roadmap (2026-06 review): line is the maturity bar; open set is #220 RoPE OOB, per-row batch sampling, zero TP coverage, zero-adapter-only LoRA gate, dropped prefix-cache observability, stale docs. Sequenced Now/Next/Later + cleanup ledger. |
| `models/qwen3/model-crate.md` | `pegainfer-qwen3-4b` owns Qwen3 config/weights/executor/scheduler/tests/kernel plan; root sees generic `EngineHandle`; split-K retuned to `256/64`, with 4k/64 serving TPOT p50 at `6.46ms` on RTX 5090. |
| `models/qwen3/crate-layout.md` | `pegainfer-qwen3-4b` owns Qwen3 config/weights/executor/scheduler/tests/kernel plan; kernel surface and build live in `pegainfer-kernels`; server sees only `start_engine() -> EngineHandle`. Records the split-K decode gate (`256/64`, `padded_bs<=2 && seq>=1024`) and CUPTI/bench gotchas. |
| `models/qwen3/prefix-cache.md` | Prefix caching on by default for Qwen3-4B: full-block kvbm radix matching at the executor, suffix-only prefill. Repeated ~1900-token prompt TTFT 141.8 → 16.3ms p50 (8.7×); warm TTFT ≈ TPOT + ~5ms setup. Includes the RoPE scalar-path corruption fix and the drain-the-stream TTFT measurement pitfall. |
| `models/qwen3/accuracy-gate.md` | Qwen3-4B instance of the logits golden gate (`tests/hf_golden_gate.rs`): 48 teacher-forced sequences / 816 positions vs a stored HF bf16 golden, replayed over bs=1 / batched eager / CUDA-graph. Strict guards: regret check + mean ≤ 0.06 + p99 ≤ 0.20; absolute max printed but not asserted (coverage-unstable). Methodology in `subsystems/correctness/`. |
| `models/qwen3/kernels-crate.md` | Phase 1 split implemented and 5090-verified: Qwen3-4B kernel surface lives in `pegainfer-kernels`; release build, test-target compile, accuracy gate, and bench snapshot pass. |
| `models/qwen3/tp-design.md` | Qwen3 tensor-parallel design: `TP=2` milestone scope plus the controller/worker broadcast execution model, request identity, and coarse-grained step protocol for future TP/MoE work. |
| `models/qwen3/kv-pressure-hang.md` | Issue #85 Qwen3-4B KV pressure hang fixed by full-lifetime scheduler KV admission, waiting-queue deferral, cleanup on disconnect/error, impossible-request errors, scheduler/bridge gates, and real `vllm bench serve` QPS=2 `500/500` pass with post-pressure completion healthy. |
| `models/qwen3/tp-design.md` | Qwen3 TP runtime as implemented: controller/worker broadcast execution (`RankWorker` per rank, coarse `StepCommand`, barrier per step), plan/resolve/effects scheduler boundaries, TP=2 partitioning spec. Open: TP correctness coverage, vocab-parallel embedding/lm_head, TP CUDA-graph. |

## models / qwen35

Expand Down Expand Up @@ -134,6 +132,7 @@ Organized by domain (model line / subsystem / playbook / lesson) instead of by l
| `lessons/moe-dplb-decode-imbalance.md` | DPLB lesson for future PegaFlow/WiDeep MoE+EP serving: decode-side DP imbalance is a sticky KV-state problem; engines should emit raw progress while external router/proxy derive load and routing. |
| `lessons/moe-zero-prefill-long-prefill.md` | ZeRO-Prefill lesson for future long-prefill MoE serving: once a router selects long-P work, maximize batch throughput by preserving compute-bound execution, hiding expert-weight movement, respecting KV handoff boundaries, and measuring bottlenecks before committing to an AsyncEP-style backend. |
| `lessons/exact-match-gate-thread-cublas.md` | Two durable lessons from a Qwen3.5 e2e gibberish bug: worker threads that run a model must rebind the CUDA context and init thread-local cuBLAS handles, and exact-match greedy gates are sensitive to equal-logit top1 choices (keep a single FlashInfer selector). |
| `lessons/kv-full-lifetime-reservation.md` | Schedulers without preemption must admit on full-lifetime KV budget (not prefill footprint), reject never-fits requests explicitly, release KV on every exit path, and prove recovery with a post-pressure request. From qwen3 issue #85; reused for kimi-k2 #239. |
| `lessons/kimi-bringup-numerics.md` | Three MoE+TP greedy-parity / reporting lessons from Kimi-K2 bring-up, reusable on any MoE+TP decode engine gated on token-id parity: reduce hidden states in F32 not BF16 (BF16 bulk all-reduce silently breaks greedy); don't merge shared+routed expert reduce into one collective (breaks cold-batch greedy); always report p50+p99, never just mean (tail dominates on barrier-synced MoE+EP decode). |

## benchmarks
Expand Down
15 changes: 15 additions & 0 deletions docs/lessons/kv-full-lifetime-reservation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# KV Admission Is a Full-Lifetime Reservation (Until Preemption Exists)

**TL;DR**: A scheduler without preemption must admit requests on their full-lifetime KV budget, not their prefill footprint — and must explicitly reject requests that can never fit, release KV on every exit path, and prove recovery with a post-pressure request. Learned from the Qwen3-4B issue #85 pressure hang (fixed in PR #131); the same admission rule was reused for Kimi-K2 paged KV (#239).

## The failure mode

Admitting on prefill-only capacity lets many active requests grow into new KV pages together until a decode step cannot allocate. The server then enters a half-alive state: `/v1/models` answers, completions hang forever. The observed symptom (issue #85: `vllm bench serve` QPS=2 over Qwen3-4B) looked like a deadlock but was an admission-accounting bug plus leaked request state.

## The rules

1. **Reserve the full lifetime at admission.** Active requests reserve the remaining pages they may need until `max_tokens`. A pending request is admitted only if its prompt plus maximum generated-token KV footprint fits *after* those reservations. This is deliberately conservative — it defers earlier than a preemption-capable scheduler would — but it makes decode-time allocation failure impossible without implementing preemption.
2. **Defer the temporarily-over-budget; reject the impossible.** A request that fits the model instance but not the current free pool stays in the waiting queue. A request larger than the instance's total usable KV capacity must be rejected explicitly (as a request *error*, not an empty success) — otherwise it waits forever and blocks the queue head.
3. **Count tokens actually written to KV, not tokens returned to the client.** In a prefill/decode split, the sampled token does not occupy KV until it is fed back as the next decode input, so a request returning `N` completion tokens occupies at most `prompt_len + N - 1` KV tokens. Review bots will confidently tell you the formula is `prompt_len + max_tokens`; check what the kernels write before "fixing" it.
4. **Every exit path must release request state.** KV pages are RAII-returned only when request state drops, so client disconnect, execution error, and send-failure paths all need to route through the owner `drop_request` — finishing normally is the only path that happens for free.
5. **Pressure-test evidence needs a post-pressure probe.** Because the failure mode keeps the health endpoints alive, "the benchmark completed" is not enough: the gate is pressure-client success *plus* a fresh completion returning afterwards.
4 changes: 2 additions & 2 deletions docs/models/deepseek-v4/kernel-paths.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,15 @@
- `docs/index.md` - showed DeepSeek V4 support, kernel boundary, and Qwen3 kernel extraction as the relevant prior work.
- `docs/models/deepseek-v4/support.md` - confirmed DeepSeek V4 currently has native MP8 runtime, TileLang build-time kernels, exact E2E coverage, and a documented CUDA split by subsystem.
- `docs/subsystems/kernels/pegainfer-kernels-boundary.md` - confirmed kernels belong in the shared kernels crate, while model DAG/runtime policy stays in the model crate.
- `docs/models/qwen3/kernels-crate.md` - established the existing crate-first split and the role of `pegainfer-kernels/KERNELS.md`.
- `docs/models/qwen3/crate-layout.md` (at the time `kernels-crate.md`) - records kernel ownership in `pegainfer-kernels` and the role of `pegainfer-kernels/KERNELS.md`.
- `docs/conventions/coding-style.md` - reminded that GPU kernels deserve targeted tests, while broad behavior is better covered by integration/E2E.
- `pegainfer-kernels/build.rs` - showed DeepSeek kernels are feature-gated by filename prefix in a flat `csrc/` scan, and TileLang generation was hard-coded to the old flat `tools/tilelang/gen_deepseek_v4_tilelang.py` path.
- `pegainfer-kernels/KERNELS.md` - currently indexes Qwen3 and only mentions DeepSeek as compatibility symbols, so DSV4 has no routing table.
- `pegainfer-kernels/csrc/deepseek_*.cu` and `pegainfer-kernels/csrc/deepseek_common.cuh` - confirmed the CUDA side is already split by subsystem but still lives in the root kernel source directory.
- `pegainfer-deepseek-v4/src/runtime/*` - confirmed runtime calls reach DeepSeek symbols through `pegainfer_kernels::ffi`, so path cleanup should not require runtime API changes.
- **Relevant history**:
- `docs/models/deepseek-v4/support.md` records that the current DeepSeek CUDA glue is intentionally split by subsystem; this cleanup should preserve that split instead of merging files.
- `docs/models/qwen3/kernels-crate.md` moved kernel ownership into `pegainfer-kernels`; the same pattern supports moving model-specific source into a clearer subdirectory without changing model runtime ownership.
- The Qwen3 kernels-crate split (recorded today in `docs/models/qwen3/crate-layout.md`) moved kernel ownership into `pegainfer-kernels`; the same pattern supports moving model-specific source into a clearer subdirectory without changing model runtime ownership.
- **Plan**:
1. First slice: move DeepSeek V4 CUDA sources from `pegainfer-kernels/csrc/deepseek_*.cu` and `deepseek_common.cuh` into `pegainfer-kernels/csrc/deepseek_v4/`, then update `pegainfer-kernels/build.rs` to discover CUDA files recursively and feature-gate DeepSeek by path instead of flat filename prefix.
2. Keep object file names stable or explicitly namespace them so `ar` input names remain collision-free when sources live in subdirectories.
Expand Down
2 changes: 1 addition & 1 deletion docs/models/deepseek-v4/pplx-ep-integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@
- `SendBuf / RecvBuf`:裸 device pointer + elem_count + elem_size + 可选 scale pointer;调用方持有底层 allocation 的所有权。
- `RdmaBackend`(`src/backend/rdma.rs`):私有类型,四个 trait 方法全是 `todo!()`,构造函数当前只存了 `EpTopology`,没拿 `AllToAllContext`。

### pplx wrapper(`crates/pegainfer-comm-p2p-all-to-all/`)
### pplx wrapper(`pegainfer-comm/crates/pegainfer-comm-p2p-all-to-all/`)

- `AllToAllContext::new(...)`:21 个参数,需要外部传入 `TransferEngine`、`rank_handles`、预注册的 send/recv buffer + MR、host pointer arrays(sync/send/recv),构造时启动一个 `"p2p_all_to_all Worker"` 后台线程,固定 CPU 亲和性。
- 调用形态是 **四步**(不是 trait 现在写的两步):
Expand Down
49 changes: 49 additions & 0 deletions docs/models/qwen3/crate-layout.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Qwen3-4B Crate Layout

> **TL;DR:** `pegainfer-qwen3-4b` owns Qwen3 config, weights, executor, scheduler, LoRA, tests, and the kernel routing plan; the kernel surface (CUDA/Triton/FlashInfer source, build, FFI, reusable ops) lives in `pegainfer-kernels`; the server sees only `start_engine() -> EngineHandle`. Replaces the 2026-05 `model-crate.md`/`kernels-crate.md` extraction records — bring-up history lives in git, this doc describes what exists.
>
> **Last touched:** 2026-06

## Crate boundary

Dependency direction: `pegainfer-qwen3-4b` → `pegainfer-core` + `pegainfer-kernels` + `pegainfer-kv-cache`. The server (`pegainfer-server`) depends on the model crate only at registry/startup glue; it never sees `Qwen3Model`, KV state, TP rank workers, or prefill/decode plans.

| What | Where |
| --- | --- |
| Config / weights / RoPE cache | `pegainfer-qwen3-4b/src/{config,weights}.rs` |
| Executor (single-GPU + TP rank workers, CUDA graphs, split-K gate) | `src/executor.rs`, `src/batch_decode*.rs`, `src/prefill.rs`, `src/unified_forward.rs` |
| Scheduler (admission, plan → resolve → effects) | `src/scheduler.rs` + `src/scheduler/{plan,resolve,effects}.rs` |
| LoRA load/unload/activation | `src/lora.rs` |
| Kernel routing index (model DAG phase → reusable kernel) | `src/kernel_plan.rs` (typed Rust, not a hand-maintained manifest) |
| Reusable kernel wrappers, FFI, tensor helpers | `pegainfer-kernels/src/` |
| CUDA source, Triton AOT, FlashInfer submodule, nvcc build | `pegainfer-kernels/csrc/`, `tools/triton/`, `build.rs` |
| Human/LLM kernel routing table | `pegainfer-kernels/KERNELS.md` |
| Tests | `tests/{hf_golden_gate,prefix_cache,lora_smoke,scheduler_robustness}.rs` |
| Report binaries (feature `kernel-report`) | `src/bin/{qwen3_kernel_report,qwen3_model_report}.rs`; `qwen3_decode_context` is the fixed-context decode probe |

Build and test commands are in the repo-root `CLAUDE.md`; per-op report tooling is documented in `docs/subsystems/kernels/kernel-op-reports.md`.

## Public surface

- `start_engine(model_path, EngineLoadOptions) -> EngineHandle` — the only entry the server uses.
- `start_engine_with_lora_control(...)` — same, plus LoRA load/unload control.
- `pegainfer_qwen3_4b::runtime` — deliberate low-level escape hatch re-exporting `Qwen3Executor` and the prefill/decode/unified plan types. It is the production phase boundary used by the scheduler and model-local tools; the server must not use it.
- `kernel_plan()` — model-owned index from DAG phases to reusable kernels.

There is no Criterion bench target in this crate (`autobenches = false`). The old `qwen3_runtime`, `qwen3_attention`, and `qwen3_kernel_snapshot` benches were retired; kernel measurement goes through the `kernel-report` binaries instead.

## Decode split-K gate (load-bearing perf facts)

Low-batch long-context decode under-fills the GPU on the non-partition FlashInfer paged decode path (grid is `(batch, num_kv_heads)`, so `bs=1` launches 8 CTAs scanning the whole KV context — ~7% of peak DRAM bandwidth, CUPTI-verified). The runtime therefore gates FlashInfer split-K decode:

- split-K when `padded_bs <= 2 && max_seq_len >= 1024`, otherwise non-partition;
- tuned to `SPLIT_KV_CHUNK_TOKENS=256`, `SPLIT_KV_MAX_CHUNKS_PER_REQUEST=64` (cold-L2 CUPTI sweep, 2026-05, RTX 5090);
- CUDA graph cache is keyed by `(batch_bucket, attention_path)` — a request can cross the split-K threshold mid-decode and needs a separate graph capture.

Effect at the time of tuning: 4k/64 serving steady TPOT p50 `11.7ms → 6.46ms` on RTX 5090. The batch sweep is why the gate is conservative: at `kv_len=1024`, split-K only wins for `bs<=2`.

## Gotchas worth keeping

- **CUPTI Range Profiler crashes on verbose range names** (`NVPW_CUDA_Profiler_DecodeCounters` inside `libnvperf_host.so`). Use compact range names like `qk/non_partition/b1/k1024`, keep metadata in the JSON output. The first profiled launch also needs an unprofiled warmup launch or CUDA lazy init pollutes its GPU time.
- **FlashInfer C++ objects need `stdc++` linked for test binaries** — owned by `pegainfer-kernels/build.rs`; symptom is link failures only in test targets.
- **Single-layer synthetic kernel benches lie about DRAM** — the working set fits in L2 (RTX 5090: 96MiB), so event-timer "effective bandwidth" can exceed 100% of peak. Use CUPTI `dram__bytes_*` counters for utilization claims.
Loading