diff --git a/docs/index.md b/docs/index.md index 165274a7..b34be50d 100644 --- a/docs/index.md +++ b/docs/index.md @@ -25,12 +25,10 @@ Organized by domain (model line / subsystem / playbook / lesson) instead of by l | Path | TL;DR | | --- | --- | | `models/qwen3/roadmap.md` | Qwen3-4B roadmap (2026-06 review): line is the maturity bar; open set is #220 RoPE OOB, per-row batch sampling, zero TP coverage, zero-adapter-only LoRA gate, dropped prefix-cache observability, stale docs. Sequenced Now/Next/Later + cleanup ledger. | -| `models/qwen3/model-crate.md` | `pegainfer-qwen3-4b` owns Qwen3 config/weights/executor/scheduler/tests/kernel plan; root sees generic `EngineHandle`; split-K retuned to `256/64`, with 4k/64 serving TPOT p50 at `6.46ms` on RTX 5090. | +| `models/qwen3/crate-layout.md` | `pegainfer-qwen3-4b` owns Qwen3 config/weights/executor/scheduler/tests/kernel plan; kernel surface and build live in `pegainfer-kernels`; server sees only `start_engine() -> EngineHandle`. Records the split-K decode gate (`256/64`, `padded_bs<=2 && seq>=1024`) and CUPTI/bench gotchas. | | `models/qwen3/prefix-cache.md` | Prefix caching on by default for Qwen3-4B: full-block kvbm radix matching at the executor, suffix-only prefill. Repeated ~1900-token prompt TTFT 141.8 → 16.3ms p50 (8.7×); warm TTFT ≈ TPOT + ~5ms setup. Includes the RoPE scalar-path corruption fix and the drain-the-stream TTFT measurement pitfall. | | `models/qwen3/accuracy-gate.md` | Qwen3-4B instance of the logits golden gate (`tests/hf_golden_gate.rs`): 48 teacher-forced sequences / 816 positions vs a stored HF bf16 golden, replayed over bs=1 / batched eager / CUDA-graph. Strict guards: regret check + mean ≤ 0.06 + p99 ≤ 0.20; absolute max printed but not asserted (coverage-unstable). Methodology in `subsystems/correctness/`. | -| `models/qwen3/kernels-crate.md` | Phase 1 split implemented and 5090-verified: Qwen3-4B kernel surface lives in `pegainfer-kernels`; release build, test-target compile, accuracy gate, and bench snapshot pass. | -| `models/qwen3/tp-design.md` | Qwen3 tensor-parallel design: `TP=2` milestone scope plus the controller/worker broadcast execution model, request identity, and coarse-grained step protocol for future TP/MoE work. | -| `models/qwen3/kv-pressure-hang.md` | Issue #85 Qwen3-4B KV pressure hang fixed by full-lifetime scheduler KV admission, waiting-queue deferral, cleanup on disconnect/error, impossible-request errors, scheduler/bridge gates, and real `vllm bench serve` QPS=2 `500/500` pass with post-pressure completion healthy. | +| `models/qwen3/tp-design.md` | Qwen3 TP runtime as implemented: controller/worker broadcast execution (`RankWorker` per rank, coarse `StepCommand`, barrier per step), plan/resolve/effects scheduler boundaries, TP=2 partitioning spec. Open: TP correctness coverage, vocab-parallel embedding/lm_head, TP CUDA-graph. | ## models / qwen35 @@ -134,6 +132,7 @@ Organized by domain (model line / subsystem / playbook / lesson) instead of by l | `lessons/moe-dplb-decode-imbalance.md` | DPLB lesson for future PegaFlow/WiDeep MoE+EP serving: decode-side DP imbalance is a sticky KV-state problem; engines should emit raw progress while external router/proxy derive load and routing. | | `lessons/moe-zero-prefill-long-prefill.md` | ZeRO-Prefill lesson for future long-prefill MoE serving: once a router selects long-P work, maximize batch throughput by preserving compute-bound execution, hiding expert-weight movement, respecting KV handoff boundaries, and measuring bottlenecks before committing to an AsyncEP-style backend. | | `lessons/exact-match-gate-thread-cublas.md` | Two durable lessons from a Qwen3.5 e2e gibberish bug: worker threads that run a model must rebind the CUDA context and init thread-local cuBLAS handles, and exact-match greedy gates are sensitive to equal-logit top1 choices (keep a single FlashInfer selector). | +| `lessons/kv-full-lifetime-reservation.md` | Schedulers without preemption must admit on full-lifetime KV budget (not prefill footprint), reject never-fits requests explicitly, release KV on every exit path, and prove recovery with a post-pressure request. From qwen3 issue #85; reused for kimi-k2 #239. | | `lessons/kimi-bringup-numerics.md` | Three MoE+TP greedy-parity / reporting lessons from Kimi-K2 bring-up, reusable on any MoE+TP decode engine gated on token-id parity: reduce hidden states in F32 not BF16 (BF16 bulk all-reduce silently breaks greedy); don't merge shared+routed expert reduce into one collective (breaks cold-batch greedy); always report p50+p99, never just mean (tail dominates on barrier-synced MoE+EP decode). | ## benchmarks diff --git a/docs/lessons/kv-full-lifetime-reservation.md b/docs/lessons/kv-full-lifetime-reservation.md new file mode 100644 index 00000000..ae31bc87 --- /dev/null +++ b/docs/lessons/kv-full-lifetime-reservation.md @@ -0,0 +1,15 @@ +# KV Admission Is a Full-Lifetime Reservation (Until Preemption Exists) + +**TL;DR**: A scheduler without preemption must admit requests on their full-lifetime KV budget, not their prefill footprint — and must explicitly reject requests that can never fit, release KV on every exit path, and prove recovery with a post-pressure request. Learned from the Qwen3-4B issue #85 pressure hang (fixed in PR #131); the same admission rule was reused for Kimi-K2 paged KV (#239). + +## The failure mode + +Admitting on prefill-only capacity lets many active requests grow into new KV pages together until a decode step cannot allocate. The server then enters a half-alive state: `/v1/models` answers, completions hang forever. The observed symptom (issue #85: `vllm bench serve` QPS=2 over Qwen3-4B) looked like a deadlock but was an admission-accounting bug plus leaked request state. + +## The rules + +1. **Reserve the full lifetime at admission.** Active requests reserve the remaining pages they may need until `max_tokens`. A pending request is admitted only if its prompt plus maximum generated-token KV footprint fits *after* those reservations. This is deliberately conservative — it defers earlier than a preemption-capable scheduler would — but it makes decode-time allocation failure impossible without implementing preemption. +2. **Defer the temporarily-over-budget; reject the impossible.** A request that fits the model instance but not the current free pool stays in the waiting queue. A request larger than the instance's total usable KV capacity must be rejected explicitly (as a request *error*, not an empty success) — otherwise it waits forever and blocks the queue head. +3. **Count tokens actually written to KV, not tokens returned to the client.** In a prefill/decode split, the sampled token does not occupy KV until it is fed back as the next decode input, so a request returning `N` completion tokens occupies at most `prompt_len + N - 1` KV tokens. Review bots will confidently tell you the formula is `prompt_len + max_tokens`; check what the kernels write before "fixing" it. +4. **Every exit path must release request state.** KV pages are RAII-returned only when request state drops, so client disconnect, execution error, and send-failure paths all need to route through the owner `drop_request` — finishing normally is the only path that happens for free. +5. **Pressure-test evidence needs a post-pressure probe.** Because the failure mode keeps the health endpoints alive, "the benchmark completed" is not enough: the gate is pressure-client success *plus* a fresh completion returning afterwards. diff --git a/docs/models/deepseek-v4/kernel-paths.md b/docs/models/deepseek-v4/kernel-paths.md index e581719c..da66ba1e 100644 --- a/docs/models/deepseek-v4/kernel-paths.md +++ b/docs/models/deepseek-v4/kernel-paths.md @@ -9,7 +9,7 @@ - `docs/index.md` - showed DeepSeek V4 support, kernel boundary, and Qwen3 kernel extraction as the relevant prior work. - `docs/models/deepseek-v4/support.md` - confirmed DeepSeek V4 currently has native MP8 runtime, TileLang build-time kernels, exact E2E coverage, and a documented CUDA split by subsystem. - `docs/subsystems/kernels/pegainfer-kernels-boundary.md` - confirmed kernels belong in the shared kernels crate, while model DAG/runtime policy stays in the model crate. - - `docs/models/qwen3/kernels-crate.md` - established the existing crate-first split and the role of `pegainfer-kernels/KERNELS.md`. + - `docs/models/qwen3/crate-layout.md` (at the time `kernels-crate.md`) - records kernel ownership in `pegainfer-kernels` and the role of `pegainfer-kernels/KERNELS.md`. - `docs/conventions/coding-style.md` - reminded that GPU kernels deserve targeted tests, while broad behavior is better covered by integration/E2E. - `pegainfer-kernels/build.rs` - showed DeepSeek kernels are feature-gated by filename prefix in a flat `csrc/` scan, and TileLang generation was hard-coded to the old flat `tools/tilelang/gen_deepseek_v4_tilelang.py` path. - `pegainfer-kernels/KERNELS.md` - currently indexes Qwen3 and only mentions DeepSeek as compatibility symbols, so DSV4 has no routing table. @@ -17,7 +17,7 @@ - `pegainfer-deepseek-v4/src/runtime/*` - confirmed runtime calls reach DeepSeek symbols through `pegainfer_kernels::ffi`, so path cleanup should not require runtime API changes. - **Relevant history**: - `docs/models/deepseek-v4/support.md` records that the current DeepSeek CUDA glue is intentionally split by subsystem; this cleanup should preserve that split instead of merging files. - - `docs/models/qwen3/kernels-crate.md` moved kernel ownership into `pegainfer-kernels`; the same pattern supports moving model-specific source into a clearer subdirectory without changing model runtime ownership. + - The Qwen3 kernels-crate split (recorded today in `docs/models/qwen3/crate-layout.md`) moved kernel ownership into `pegainfer-kernels`; the same pattern supports moving model-specific source into a clearer subdirectory without changing model runtime ownership. - **Plan**: 1. First slice: move DeepSeek V4 CUDA sources from `pegainfer-kernels/csrc/deepseek_*.cu` and `deepseek_common.cuh` into `pegainfer-kernels/csrc/deepseek_v4/`, then update `pegainfer-kernels/build.rs` to discover CUDA files recursively and feature-gate DeepSeek by path instead of flat filename prefix. 2. Keep object file names stable or explicitly namespace them so `ar` input names remain collision-free when sources live in subdirectories. diff --git a/docs/models/deepseek-v4/pplx-ep-integration.md b/docs/models/deepseek-v4/pplx-ep-integration.md index 13b66e2f..e0ff82eb 100644 --- a/docs/models/deepseek-v4/pplx-ep-integration.md +++ b/docs/models/deepseek-v4/pplx-ep-integration.md @@ -41,7 +41,7 @@ - `SendBuf / RecvBuf`:裸 device pointer + elem_count + elem_size + 可选 scale pointer;调用方持有底层 allocation 的所有权。 - `RdmaBackend`(`src/backend/rdma.rs`):私有类型,四个 trait 方法全是 `todo!()`,构造函数当前只存了 `EpTopology`,没拿 `AllToAllContext`。 -### pplx wrapper(`crates/pegainfer-comm-p2p-all-to-all/`) +### pplx wrapper(`pegainfer-comm/crates/pegainfer-comm-p2p-all-to-all/`) - `AllToAllContext::new(...)`:21 个参数,需要外部传入 `TransferEngine`、`rank_handles`、预注册的 send/recv buffer + MR、host pointer arrays(sync/send/recv),构造时启动一个 `"p2p_all_to_all Worker"` 后台线程,固定 CPU 亲和性。 - 调用形态是 **四步**(不是 trait 现在写的两步): diff --git a/docs/models/qwen3/crate-layout.md b/docs/models/qwen3/crate-layout.md new file mode 100644 index 00000000..f45a396a --- /dev/null +++ b/docs/models/qwen3/crate-layout.md @@ -0,0 +1,49 @@ +# Qwen3-4B Crate Layout + +> **TL;DR:** `pegainfer-qwen3-4b` owns Qwen3 config, weights, executor, scheduler, LoRA, tests, and the kernel routing plan; the kernel surface (CUDA/Triton/FlashInfer source, build, FFI, reusable ops) lives in `pegainfer-kernels`; the server sees only `start_engine() -> EngineHandle`. Replaces the 2026-05 `model-crate.md`/`kernels-crate.md` extraction records — bring-up history lives in git, this doc describes what exists. +> +> **Last touched:** 2026-06 + +## Crate boundary + +Dependency direction: `pegainfer-qwen3-4b` → `pegainfer-core` + `pegainfer-kernels` + `pegainfer-kv-cache`. The server (`pegainfer-server`) depends on the model crate only at registry/startup glue; it never sees `Qwen3Model`, KV state, TP rank workers, or prefill/decode plans. + +| What | Where | +| --- | --- | +| Config / weights / RoPE cache | `pegainfer-qwen3-4b/src/{config,weights}.rs` | +| Executor (single-GPU + TP rank workers, CUDA graphs, split-K gate) | `src/executor.rs`, `src/batch_decode*.rs`, `src/prefill.rs`, `src/unified_forward.rs` | +| Scheduler (admission, plan → resolve → effects) | `src/scheduler.rs` + `src/scheduler/{plan,resolve,effects}.rs` | +| LoRA load/unload/activation | `src/lora.rs` | +| Kernel routing index (model DAG phase → reusable kernel) | `src/kernel_plan.rs` (typed Rust, not a hand-maintained manifest) | +| Reusable kernel wrappers, FFI, tensor helpers | `pegainfer-kernels/src/` | +| CUDA source, Triton AOT, FlashInfer submodule, nvcc build | `pegainfer-kernels/csrc/`, `tools/triton/`, `build.rs` | +| Human/LLM kernel routing table | `pegainfer-kernels/KERNELS.md` | +| Tests | `tests/{hf_golden_gate,prefix_cache,lora_smoke,scheduler_robustness}.rs` | +| Report binaries (feature `kernel-report`) | `src/bin/{qwen3_kernel_report,qwen3_model_report}.rs`; `qwen3_decode_context` is the fixed-context decode probe | + +Build and test commands are in the repo-root `CLAUDE.md`; per-op report tooling is documented in `docs/subsystems/kernels/kernel-op-reports.md`. + +## Public surface + +- `start_engine(model_path, EngineLoadOptions) -> EngineHandle` — the only entry the server uses. +- `start_engine_with_lora_control(...)` — same, plus LoRA load/unload control. +- `pegainfer_qwen3_4b::runtime` — deliberate low-level escape hatch re-exporting `Qwen3Executor` and the prefill/decode/unified plan types. It is the production phase boundary used by the scheduler and model-local tools; the server must not use it. +- `kernel_plan()` — model-owned index from DAG phases to reusable kernels. + +There is no Criterion bench target in this crate (`autobenches = false`). The old `qwen3_runtime`, `qwen3_attention`, and `qwen3_kernel_snapshot` benches were retired; kernel measurement goes through the `kernel-report` binaries instead. + +## Decode split-K gate (load-bearing perf facts) + +Low-batch long-context decode under-fills the GPU on the non-partition FlashInfer paged decode path (grid is `(batch, num_kv_heads)`, so `bs=1` launches 8 CTAs scanning the whole KV context — ~7% of peak DRAM bandwidth, CUPTI-verified). The runtime therefore gates FlashInfer split-K decode: + +- split-K when `padded_bs <= 2 && max_seq_len >= 1024`, otherwise non-partition; +- tuned to `SPLIT_KV_CHUNK_TOKENS=256`, `SPLIT_KV_MAX_CHUNKS_PER_REQUEST=64` (cold-L2 CUPTI sweep, 2026-05, RTX 5090); +- CUDA graph cache is keyed by `(batch_bucket, attention_path)` — a request can cross the split-K threshold mid-decode and needs a separate graph capture. + +Effect at the time of tuning: 4k/64 serving steady TPOT p50 `11.7ms → 6.46ms` on RTX 5090. The batch sweep is why the gate is conservative: at `kv_len=1024`, split-K only wins for `bs<=2`. + +## Gotchas worth keeping + +- **CUPTI Range Profiler crashes on verbose range names** (`NVPW_CUDA_Profiler_DecodeCounters` inside `libnvperf_host.so`). Use compact range names like `qk/non_partition/b1/k1024`, keep metadata in the JSON output. The first profiled launch also needs an unprofiled warmup launch or CUDA lazy init pollutes its GPU time. +- **FlashInfer C++ objects need `stdc++` linked for test binaries** — owned by `pegainfer-kernels/build.rs`; symptom is link failures only in test targets. +- **Single-layer synthetic kernel benches lie about DRAM** — the working set fits in L2 (RTX 5090: 96MiB), so event-timer "effective bandwidth" can exceed 100% of peak. Use CUPTI `dram__bytes_*` counters for utilization claims. diff --git a/docs/models/qwen3/kernels-crate.md b/docs/models/qwen3/kernels-crate.md deleted file mode 100644 index f6af61fb..00000000 --- a/docs/models/qwen3/kernels-crate.md +++ /dev/null @@ -1,111 +0,0 @@ -# Qwen3 Kernels Crate Extraction - -**Created**: 2026-05-03 -**Status**: complete -**TL;DR**: Phase 1 now extracts the Qwen3-4B dense full-attention kernel surface into `crates/pegainfer-kernels`, with a compact kernel index so future LLM sessions can jump from model DAG nodes to Rust wrappers, FFI symbols, CUDA/Triton sources, and shape constraints. `KvPool`, `PagePool`, and `SamplingParams` stay in the root runtime. Local metadata/format checks pass; GPU release build, release test-target compilation, release clippy, Qwen3-4B e2e, and `bench_serving snapshot` pass. - -## Preparation - -- **Read**: - - `docs/index.md` - confirmed the relevant architecture, kernel, TP, benchmarking, and Qwen3 history docs. - - `docs/subsystems/kernels/pegainfer-kernels-boundary.md` - recorded the per-model engine direction, but its near-term ordering needs to be corrected from ledger-first to crate-first. - - `docs/models/qwen3/tp-design.md` - confirmed Qwen3-4B TP constraints and runtime hazards around per-thread CUDA/cuBLAS state. - - `src/model/qwen3/*`, `src/ops/*`, `src/ffi.rs`, `src/tensor.rs`, `src/kv_pool.rs`, `src/page_pool.rs`, and `build.rs` - mapped the current Qwen3-4B kernel calls, tensor/runtime dependencies, paged KV metadata, and CUDA/Triton build pipeline. -- **Relevant history**: - - `docs/models/qwen3/tp-design.md` shows that Qwen3 execution is already rank-local and step-oriented, so the kernel crate must not hide device binding or TP collective points. -- **Plan**: - 1. Convert the repository into a Cargo workspace while keeping the root `pegainfer` package as the server/control-plane crate. - 2. Create `crates/pegainfer-kernels` with the Qwen3-4B kernel surface: kernel ABI tensor helpers, Qwen3-used `ops`, FFI declarations, CUDA/Triton build support, and Qwen3 paged-attention layout metadata helpers. - 3. Move Qwen3 call sites to import `pegainfer_kernels::{ops, tensor}` and remove direct Qwen3 dependence on root-local `ops`, `ffi`, and `tensor` modules. - 4. Preserve repository build health. If Qwen3.5 still requires symbols from the old combined CUDA library, either keep those symbols as compatibility exports in the kernels crate or explicitly document and gate any temporary Qwen3-only limitation before making code changes. - 5. Add a kernel index for LLM navigation under the new crate: - - `KERNELS.md`: short human/LLM routing table from `qwen3_4b::::` to Rust wrapper, FFI symbol, source file, backend, shape/layout constraints, and status. - - Machine-readable model DAG metadata should wait for the Qwen3-4B model crate, where it can be generated or validated from model code instead of hand-maintained in the generic kernels crate. - 6. Update `docs/subsystems/kernels/pegainfer-kernels-boundary.md` and `docs/index.md` so the recorded next step is crate-first, with ledger/trace/simulator as metadata products of the crate boundary. - 7. Verify with `cargo test --release` or, if the local environment blocks full release tests, at least `cargo check --release` and report the exact blocker. -- **Risks / open questions**: - - A strict Qwen3-only CUDA extraction can conflict with the current default binary because Qwen3.5 still compiles in the same root crate and references some shared FFI symbols. The safest implementation may need to move the link/build owner to `pegainfer-kernels` while only stabilizing and indexing the Qwen3 API first. - - `kv_pool` and `page_pool` sit between model state and kernel metadata. For Phase 1, only the kernel-facing layout/descriptor pieces should move if needed; scheduler-owned allocation policy should remain in the root crate unless compilation forces a narrower split. - - Build-script path handling is fragile when moving kernel source into `crates/pegainfer-kernels/`. The plan should prefer one build owner and avoid compiling the same C symbols in both root and dependency crates. - -## Execution Log - -### Step 1: Create kernels crate and move build ownership -- Converted the repository into a Cargo workspace with `crates/pegainfer-kernels`. -- Added `pegainfer-kernels` as a root dependency. -- Moved CUDA source from root `csrc/` to `crates/pegainfer-kernels/csrc/`. -- Moved Triton AOT files from root `tools/triton/` to `crates/pegainfer-kernels/tools/triton/`. -- Moved the FlashInfer submodule path from `third_party/flashinfer` to `crates/pegainfer-kernels/third_party/flashinfer`. -- Replaced the root `build.rs` with an intentionally empty build script; `crates/pegainfer-kernels/build.rs` now owns CUDA/Triton compilation. - -- Moved kernel-owned ABI and operator code into `crates/pegainfer-kernels/src/`: `ffi`, tensor helpers, paged-KV geometry metadata, and the Qwen3-used `ops` modules. -- Kept `KvPool`, `PagePool`, and `SamplingParams` in the root crate because they are runtime allocation/policy state, not kernels. -- Replaced root `src/ffi.rs` and `src/tensor.rs` with compatibility re-exports. -- Replaced root `src/ops.rs` with re-exports from `pegainfer-kernels` plus thin root adapters for sampling, paged prefill planning, paged attention layout conversion, and the remaining Qwen3.5 recurrent wrapper. -- Removed duplicate root `src/ops/{attention,elementwise,embedding,linear,norm,sampling}.rs`. -- Kept `src/ops/recurrent.rs` in root for now because it depends on Qwen3.5's model-local `GdrChunkwiseScratch35`; moving that would expand Phase 1 beyond Qwen3-4B. - -### Step 3: Add kernel index for LLM navigation -- Added `crates/pegainfer-kernels/KERNELS.md`. -- The index maps each Qwen3-4B op ID to phase, Rust wrapper, FFI symbol, source file, backend, and shape/layout notes. -- Removed the initial `kernel_manifest/qwen3_4b.toml` idea from the kernels crate. A hand-maintained machine-readable manifest in the generic kernel crate would drift; the right place is the future Qwen3-4B model crate, where the manifest can describe the model DAG and be generated or checked against code. - -### Step 4: Documentation updates -- Updated `CLAUDE.md`, `README.md`, and `docs/playbooks/developer-onboarding.md` to point CUDA/Triton paths at `crates/pegainfer-kernels/`. -- Updated `docs/subsystems/kernels/pegainfer-kernels-boundary.md` to record crate-first ordering before ledger/simulator work. - -### Step 5: Verification -- `cargo metadata --no-deps --format-version 1` succeeded and showed both workspace packages: root `pegainfer` and `pegainfer-kernels`. -- `cargo fmt --all` applied formatting, then `cargo fmt --all --check` passed. -- `PEGAINFER_CUDA_SM=120 cargo check --release` reached the `pegainfer-kernels` build script and failed at `nvcc` execution because this machine has no `nvcc`. - -### Step 6: GPU release compile -- Avoided overwriting `` because that validation checkout has unrelated uncommitted work. -- Synced the local working tree to `` with `rsync`, excluding `.git/`, `target/`, `.venv/`, and `models/`. -- Copied the existing validation FlashInfer submodule contents from `/third_party/flashinfer` into `crates/pegainfer-kernels/third_party/flashinfer` inside the build directory. -- `PEGAINFER_CUDA_SM=120 cargo build --release` passed on the CUDA validation host. First pass exposed two Rust warnings from this split (`SamplingParams::is_greedy` unused and root `PrefillPagedPlan` visibility too wide); both were cleaned up. -- Re-synced and reran `PEGAINFER_CUDA_SM=120 cargo build --release`; it passed in 14.16s with only build-script informational warnings. -- `PEGAINFER_CUDA_SM=120 cargo test --release --no-run` passed in 12.28s and compiled all unit, binary, e2e, paged-attention, and regen test targets. - -### Step 7: GPU e2e and serving benchmark -- Ran Qwen3-4B e2e on the same validation build directory: - - `PEGAINFER_CUDA_SM=120 PEGAINFER_TEST_MODEL_PATH= cargo test --release --test e2e -- --nocapture` - - Result: pass, 1 test passed in 9.36s. - - Covered greedy golden outputs, multi-request generation, and consumer-drop scheduler survival. -- Ran the standard in-process serving snapshot: - - `RUST_LOG=warn PEGAINFER_CUDA_SM=120 cargo run --release --bin bench_serving -- --model-path snapshot` - - Result: pass. - - RTX 5090 Qwen3-4B snapshot: - - `prefill_heavy (10000,1)`: TTFT p50 `501.93ms`, p99 `503.75ms`. - - `decode_heavy (1024,256)`: TPOT p50 `7.40ms`, p99 `7.46ms`. - - Snapshot was written on the validation build dir at `bench_snapshots/rtx-5090/qwen3-4b.json`. - - Pulled the snapshot back into the local repo as `bench_snapshots/rtx-5090/qwen3-4b.json` so it can be committed with the crate split. - - The isolated rsync build directory intentionally excludes `.git/`, so the generated `commit` field was `unknown`; after pulling it back, set it to the current local `HEAD` short hash `3448f87`. -- Checked ``; it is not present on this CUDA validation host, so no Qwen3.5 e2e was run. - -### Step 8: GPU clippy and final local checks -- Ran local `cargo fmt --all --check`: pass. -- Ran local `cargo metadata --no-deps --format-version 1`: pass. -- Synced the current working tree to ``. -- Ran `PEGAINFER_CUDA_SM=120 cargo clippy --release --all-targets -- -D warnings` on the CUDA validation host: pass in 1m42s. - -### Unexpected -- Local `cargo check --release` reached `pegainfer-kernels` build script but failed because this machine does not have `nvcc`; the user will provide a GPU build machine for compilation. -- A second `cargo check --release -p pegainfer-kernels --lib` without `PEGAINFER_CUDA_SM` failed earlier at GPU SM detection, which is expected on this local machine without `nvidia-smi`. -- The validation checkout was dirty, so verification used a separate validation build directory instead of modifying that checkout. -- The validation build directory does not include `.git/`, so `bench_serving snapshot` reports `commit: unknown`. - -## Debrief - -- **Outcome**: Implemented and validated the crate-first Phase 1 split. Kernel source, Triton source, FlashInfer submodule ownership, CUDA/Triton build script, FFI, kernel ABI tensor helpers, paged-KV layout metadata, and Qwen3-used Rust ops now live under `crates/pegainfer-kernels`. Root `pegainfer` keeps server/model code, `KvPool`, `PagePool`, `SamplingParams`, and thin compatibility adapters. The split passes local format/metadata checks, GPU release build/test-target compilation, release clippy, Qwen3-4B e2e, and the standard Qwen3-4B `bench_serving snapshot`. -- **Pitfalls encountered**: - - Root `src/ops/recurrent.rs` cannot be moved cleanly in this pass because it takes Qwen3.5's `GdrChunkwiseScratch35` type. Moving it would pull hybrid-model scratch ownership into the kernels crate, which is outside the Qwen3-4B Phase 1 scope. - - Initially moved `KvPool`, `PagePool`, and `SamplingParams` into the kernels crate. That was too broad; those belong to runtime policy and have been moved back to root. - - Local compile verification is blocked by missing `nvcc`, so GPU compile verification should happen on a CUDA build host. -- **Lessons learned**: - - The kernel crate should own source and build artifacts physically, not only re-export copied Rust wrappers. Keeping `csrc/`, `tools/triton/`, and `third_party/flashinfer` in root creates exactly the duplicate context we are trying to remove. - - The human/LLM routing index belongs beside the kernels crate because it helps edit reusable kernels. Machine-readable model DAG manifests should not live there unless they are generated or validated; they belong with the model crate that owns the DAG. -- **Follow-ups**: - - Phase 2 can extract the Qwen3 model crate on top of `pegainfer-kernels`. - - In the Qwen3 model crate, define the model-owned kernel DAG and decide whether any TOML/JSON manifest is generated from Rust code, validated against wrappers, or avoided entirely in favor of trace IDs emitted directly from the executor. - - Run Qwen3.5 e2e separately on a box with `` if later changes touch the compatibility kernels or recurrent wrappers. diff --git a/docs/models/qwen3/kv-pressure-hang.md b/docs/models/qwen3/kv-pressure-hang.md deleted file mode 100644 index 3c3c1a76..00000000 --- a/docs/models/qwen3/kv-pressure-hang.md +++ /dev/null @@ -1,138 +0,0 @@ -# Qwen3 KV Pressure Hang - -**Created**: 2026-05-15 -**Status**: complete - -**TL;DR**: Qwen3-4B scheduler admission now reserves each admitted request's full KV lifetime budget, keeps temporarily over-budget requests in the waiting queue, rejects only requests that can never fit this model instance, reports those rejects to vLLM as request errors, and releases request state on client-drop/execution-error paths. RTX 5090 issue #85 `vllm bench serve` QPS=2 now completes `500/500` with `0` failures, and post-pressure `/v1/completions` still returns. - -## Preparation - -- **Read**: - - `docs/index.md` - routed this issue to Qwen3 batching, scheduler, and benchmark docs. - - `docs/subsystems/scheduler/scheduler.md` - contains the exact `vllm bench serve` QPS=2 workload from issue #85 and the expected serving evidence shape. - - `docs/subsystems/scheduler/scheduler.md` - explains Qwen3 scheduler, paged KV, and the page-pool design contract. - - `docs/conventions/bench-regression.md` - gives benchmark evidence discipline and threshold language. - - `.codex/harness/README.md` - confirms the verification ladder and safety boundaries. - - `.codex/harness/commands.md` - provides Qwen3 e2e, server, and benchmark commands. - - `.codex/harness/verification.md` - classifies this as serving/scheduler behavior needing a narrow repro plus HTTP/benchmark evidence. - - `pegainfer-qwen3-4b/src/scheduler.rs` - admission control currently defers requests under KV pressure. - - `pegainfer-qwen3-4b/src/scheduler/plan.rs` - execution plans currently consume pending requests before failures are handled. - - `pegainfer-qwen3-4b/src/scheduler/effects.rs` - successful finishes drop request state; scheduler execution errors do not. - - `pegainfer-qwen3-4b/src/executor.rs` - `drop_request` is the existing owner API for releasing per-request KV state. - - `pegainfer-core/src/kv_pool.rs` and `pegainfer-core/src/page_pool.rs` - KV pages are RAII-returned only when request state is dropped. - - GitHub issue #85 - observed server stays alive but completions hang after QPS=2 KV pressure. -- **Relevant history**: - - `docs/subsystems/scheduler/scheduler.md` - QPS=2 varied workload is near capacity and already had some failed requests; the fix must handle pressure explicitly rather than claim higher throughput. - - `docs/subsystems/scheduler/scheduler.md` - page-pool RAII is the intended cleanup mechanism; scheduler must call the owner drop path when abandoning a request. -- **Plan**: - 1. Add a scheduler-level regression using a fake executor so admission deadlock and execution-error cleanup are testable without GPU/model weights. - 2. Refactor Qwen3 scheduler admission into a small helper that rejects requests that can never fit in the KV pool and keeps temporarily deferred requests. - 3. Preserve touched request IDs for each execution plan; if a prefill/decode/unified step fails, send explicit errors and call `drop_request` for active plus plan-pending requests. - 4. Run `cargo fmt --check`, the targeted Qwen3 scheduler/lib tests, and `git diff --check` locally. - 5. Run a read-only DeepSeek diff review focused on missed cleanup/admission cases. - 6. Use the authorized remote GPU host for Qwen3-4B e2e and the issue #85 `vllm bench serve` workload; verify a post-pressure completion returns. -- **Risks / open questions**: - - The real hang could include another path beyond leaked KV pages; the pressure test is the裁判. - - The QPS=2 benchmark is long and may fail some requests by design; the claim boundary is recovery/no permanent hang, not zero failures or performance improvement. - -## Execution Log - -### Step 1: Scheduler cleanup and admission regression -- Made `start_with_executor`/`scheduler_loop` generic over `ModelExecutor` so scheduler behavior can be tested with a fake executor without GPU/model weights. -- Added fake-executor regression coverage for: - - requests that can never fit being rejected without blocking later work; - - temporary KV pressure keeping requests waiting until full KV budget is available; - - decode errors surfacing as `TokenEvent::Error`, dropping request state, and allowing recovery; - - client/receiver drop releasing request state. -- Changed `DecodeEffect::EmitAndContinue` send-failure handling to call `drop_request` before retiring the active request. -- Result: remote RTX 5090 `cargo test --release -p pegainfer-qwen3-4b --lib scheduler -- --nocapture` passed, `4 passed`. - -### Step 2: Maintainer feedback refinement -- The maintainer clarified that the basic fix should keep requests that cannot get KV allocation in the waiting queue; preemption can be deferred. -- Updated scheduler admission from prefill-only accounting to full lifetime accounting: - - active requests reserve the remaining pages they may need until `max_tokens`; - - pending requests are admitted only if their prompt plus maximum generated-token KV footprint fits after those active reservations; - - temporarily over-budget pending requests stay in `deferred`; - - only requests larger than this model instance's total usable KV capacity are rejected to avoid permanent head-of-line deadlock. -- This is intentionally conservative: it may defer earlier than a preemption-capable scheduler would, but it prevents decode-time allocation failure for newly admitted batches without implementing preemption in this PR. - -### Step 3: Build and static gates -- Remote environment: - - GPU: NVIDIA GeForce RTX 5090, driver `580.76.05`, 32607 MiB. - - CUDA: `nvcc` `13.0.88`, `PEGAINFER_CUDA_SM=120`. - - Rust: `rustc 1.97.0-nightly (7c3c88f42 2026-05-14)`. - - Model: `models/Qwen3-4B`, HF revision metadata `1cfa9a7208912126459214e8b04321603b3df60c`. -- Commands: - - `cargo fmt --check` — passed. - - `cargo test --release -p pegainfer-qwen3-4b --lib scheduler -- --nocapture` — passed, `4 passed`. - - `cargo clippy --release -p pegainfer-qwen3-4b --lib -- -D warnings` — passed. - - `cargo build --release -p pegainfer-server` — passed. -- Local command: - - `~/.cargo/bin/cargo fmt --check` — passed. - -### Step 4: E2E and serving pressure validation -- Installed `vllm 0.21.0` in the validation venv to run the issue's real `vllm bench serve` client. -- Ran a host-local exact e2e check against the validation model snapshot: - - `PEGAINFER_TEST_MODEL_PATH=models/Qwen3-4B cargo test --release -p pegainfer-qwen3-4b --test e2e -- --nocapture` - - Result after local fixture regeneration for that model snapshot: passed, `1 passed`. - - PR review later found the regenerated fixture was not portable to the standard local model snapshot, so the repository `test_data/Qwen3-4B.json` change was reverted and this e2e result is not used as a merge gate. -- Ran a small issue-shaped benchmark first: - - `vllm bench serve ... --num-prompts 20 --request-rate 2 ...` - - Result: `20/20` successful, `0` failed. -- Ran the full issue #85 workload against the rebuilt server: - - `vllm bench serve --backend openai --model models/Qwen3-4B --port 8000 --dataset-name random --random-input-len 2048 --random-output-len 128 --random-range-ratio 0.5 --num-prompts 500 --request-rate 2 --seed 42 --ignore-eos --temperature 0 --tokenizer models/Qwen3-4B` - - Result: `500` successful, `0` failed, duration `250.89s`, peak concurrency `13`, throughput `1.99 req/s`, mean TTFT `129.27ms`, mean TPOT `12.54ms`. -- Post-pressure checks: - - `/v1/models` returned model id `models/Qwen3-4B`. - - `timeout 30 curl ... /v1/completions` returned HTTP completion text and usage with `completion_tokens=16`. -- Also ran an overload HTTP probe (`200` concurrent-ish long requests at `80 rps`); it returned explicit HTTP 500s quickly and a post-pressure completion still returned. This was not the acceptance gate, but it confirmed the server did not enter the old half-alive hang state under more aggressive pressure. - -### Step 5: Compatibility fix encountered during validation -- Remote CUDA 13.0 initially failed with the existing `cudarc` `cuda-13010` feature because the driver/runtime lacked `cuDevSmResourceSplit`. -- Kept the workspace on `cuda-13010`; changing the shared `cudarc` feature would widen the PR's collaboration surface beyond issue #85. -- Fixed `qwen3_decode_context` test-target compilation by linking `cudaProfilerStart/Stop` directly from `cudart`; the symbols were not exposed through `pegainfer_core::ffi`. - -### Step 6: Final diff hygiene -- `git diff --check` — passed. -- Confirmed the remote pegainfer server process was stopped after validation. - -### Step 7: Maintainer-style review follow-up -- Re-reviewed the changed scheduler and bridge paths after the main fix. -- Found one API-contract issue: `TokenEvent::Rejected` was being translated to vLLM `EngineCoreFinishReason::Stop`, which would make an impossible KV request look like an empty successful response. -- Changed `pegainfer-server/src/vllm_frontend.rs` so `Rejected` maps to `EngineCoreFinishReason::Error` with the rejection message as `stop_reason`. -- Added `vllm_frontend::tests::rejected_request_is_reported_as_error`. -- Remote RTX 5090 command: - - `cargo test --release -p pegainfer-server rejected_request_is_reported_as_error --lib` — passed, `1 passed`. - -### Step 8: PR review comment follow-up -- Read PR #131 review comments from `gemini-code-assist`. The comments claimed the KV budget formulas should use `prompt_len + max_tokens` and `prompt_len + generated_count`. -- Source check showed that Qwen3 prefill writes only prompt tokens to KV; the sampled first output token is not appended until it is fed as a later decode input. Therefore a request returning `N` completion tokens occupies at most `prompt_len + N - 1` KV tokens. -- Kept the scheduler formula unchanged and added explicit boundary coverage for this contract: - - helper-level assertions for current/max KV token counts; - - a scheduler regression proving `prompt_len=page_size, max_tokens=1` fits in one prompt page and finishes without a decode KV page. - -### Step 9: Maintainer review portability fixes -- Maintainer review on PR #131 reproduced the issue-shaped HTTP pressure workload successfully on PR head `6b5f963`, then requested two portability fixes before merge. -- Reverted `test_data/Qwen3-4B.json` to avoid carrying a non-portable exact-golden refresh in a scheduler/KV PR. -- Rewrote validation evidence to use checkout-neutral paths such as `models/Qwen3-4B` and "validation venv" instead of machine-local absolute paths. - -### Unexpected -- The exact Qwen3-4B e2e initially failed because the checked-in golden text did not match the validation host's current HF revision/runtime output. This matches prior project history around Qwen3 greedy near-tie/golden drift. Maintainer review showed the regenerated fixture was not portable to the standard local model snapshot, so the fixture change was reverted and the scheduler/HTTP gates carry this PR. -- DeepSeek diff-review was attempted twice and timed out (`180s`, then `300s`), so no external advisor result is counted. - -## Debrief - -- **Outcome**: Issue #85's observed hang is addressed for the measured QPS=2 Qwen3-4B serving workload. Scheduler admission now keeps temporarily over-budget requests waiting instead of admitting them on prefill-only capacity, successful/client-dropped/error paths release request state, impossible requests surface as vLLM request errors, and the real `vllm bench serve` workload completed `500/500` with post-pressure completion still healthy. -- **Pitfalls encountered**: - - Full lifetime KV accounting is the basic no-preemption fix. Prefill-only accounting can still allow decode-time allocation failure when many active requests grow into new pages together. - - Exact text e2e depends on the model snapshot/golden pairing; do not refresh `test_data/Qwen3-4B.json` in scheduler PRs unless the regeneration contract is reproducible across the standard validation paths. - - `vllm 0.21.0` installation pulled a large PyTorch/CUDA 13 stack. The install was slow but completed and enabled the real issue client. - - CUDA feature selection mattered on the remote 5090: `cuda-13010` expects CUDA 13.1 driver API symbols, so validation hosts using CUDA 13.0 need a CUDA 13.1/compat runtime rather than a source-level downgrade. -- **Lessons learned**: - - The scheduler should treat KV as a lifetime reservation until preemption exists. That is simpler and safer than relying on decode-time allocation errors. - - KV budget math must count tokens actually written to KV, not sampled tokens already returned to the client. For Qwen3, the first sampled token does not occupy KV until the next decode step. - - Requests larger than a single model instance's usable KV capacity need explicit rejection; otherwise they would wait forever and block the queue. - - Serving regression evidence should include both the pressure client result and a post-pressure completion, because the original failure mode kept `/v1/models` alive while completions hung. -- **Follow-ups**: - - Design real preemption/cancellation semantics for active requests when the scheduler wants to trade fairness/throughput against full lifetime reservation. - - Decide whether exact Qwen3 greedy golden drift should get a stronger deterministic tie-breaking gate or remain a regenerated-snapshot fixture. diff --git a/docs/models/qwen3/model-crate.md b/docs/models/qwen3/model-crate.md deleted file mode 100644 index 39636723..00000000 --- a/docs/models/qwen3/model-crate.md +++ /dev/null @@ -1,541 +0,0 @@ -# Qwen3-4B Model Crate - -**Created**: 2026-05-03 -**Status**: ready for diff review -**TL;DR**: `crates/pegainfer-qwen3-4b` now owns Qwen3 config, weights, execution, scheduler, tests, benches, and kernel plan. Root `pegainfer` loads Qwen3 through a generic `EngineHandle` and no longer contains `Qwen3Model`, `Qwen3Executor`, `ModelRuntimeConfig`, root Qwen3 tests, or `src/model/qwen3/*`. The old `ModelForward` path has been removed; decode length-limit now emits the final token before `Finished`. Long-context `bs=1` TPOT was traced to non-partition FlashInfer paged decode under-filling the GPU; Qwen3 runtime gates FlashInfer split-K decode for `padded_bs<=2 && seq_len>=1024` and was retuned to `chunk_tokens=256,max_chunks=64`, cutting 4k/64 serving steady TPOT from about `11.7ms` to `6.46ms` on RTX 5090. Qwen3 now keeps a single model-crate bench entry: `qwen3_kernel_snapshot`, a JSON snapshot runner with warm/cold-L2 latency, default-on CUPTI counters, and compare. Correctness/truth is intentionally out of this snapshot for now. - -## Preparation - -- **Read**: - - `docs/index.md` - identified the kernels/core crate split and per-model boundary docs. - - `docs/models/qwen3/kernels-crate.md` - Qwen3 kernel source/build ownership and human kernel index already live in `pegainfer-kernels`; model-owned DAG metadata should live with the model crate. - - `docs/subsystems/kernels/pegainfer-kernels-boundary.md` - records the per-model engine direction and says root should be reusable frontend/control-plane infrastructure, not a universal model abstraction. - - `src/main.rs`, `src/lib.rs`, `src/server_engine.rs`, `src/scheduler.rs`, `src/model_executor.rs`, `src/model/qwen3/*`, `src/bin/bench_serving.rs`, and Qwen3 tests - mapped what root currently knows about Qwen3. -- **Relevant history**: - - The earlier shared-runtime work (now consolidated into `docs/subsystems/runtime/runtime.md`) was a useful simplification, but the next boundary should not make `ModelForward` the long-term universal engine API. -- **Plan**: - 1. Define the model crate/root interface before moving code. - 2. Move the generic text-generation handle/request/event types into `pegainfer-core` so root and model crates can communicate without model crates depending on root. - 3. Create `crates/pegainfer-qwen3-4b` and move Qwen3 config, weights, forward paths, decode buffers, `Qwen3Executor`, Qwen3 scheduler internals, Qwen3 correctness tests, and Qwen3-specific benches into it. - 4. Keep root `pegainfer` as frontend plus model registry. The registry can know crate names, but `main`, `vllm_frontend`, and generic benchmark code should only see `EngineHandle`, `ModelInfo`, and tokenizer path. - 5. Add a model-owned `kernel_plan.rs` in the Qwen3 crate as the LLM/human index from model DAG phases to reusable kernels. Do not add a hand-maintained public TOML in `pegainfer-kernels`. - 6. Verify locally with format/metadata, then on the CUDA validation host with release build, clippy, Qwen3 crate e2e, and root `bench_serving snapshot`. Keep microbench timing in Criterion benches instead of duplicating it as a test. -- **Risks / open questions**: - - If the scheduler stays in root, root still knows Qwen3's execution shape. To meet the stated goal, the Qwen3 scheduler should move into the Qwen3 crate and expose only a generic handle. - - `bench_serving` previously had a direct `ModelForward` path for Qwen3 and a scheduler path for Qwen3.5. It needed to become generic over `EngineHandle`, while Qwen3 crate-local benches should use the model executor phase API. - - Qwen3.5 remains in root for this phase. The registry may temporarily wrap root-local Qwen3.5, but new Qwen3 code should not depend on that temporary shape. - -## Interface Proposal - -The root-visible interface should be request/response oriented, not prefill/decode oriented. - -```rust -// pegainfer-core -pub struct EngineLoadOptions { - pub enable_cuda_graph: bool, - pub device_ordinals: Vec, - pub seed: u64, -} - -pub struct ModelInfo { - pub id: &'static str, - pub display_name: String, - pub max_model_len: Option, -} - -pub struct GenerateRequest { - pub prompt_tokens: Vec, - pub params: SamplingParams, - pub max_tokens: usize, - pub token_tx: tokio::sync::mpsc::UnboundedSender, - pub logprobs: usize, - pub echo: bool, -} - -pub enum TokenEvent { - Token { id: u32, logprob: Option }, - PromptTokens { ids: Vec, logprobs: Vec> }, - Finished { finish_reason: FinishReason, prompt_tokens: usize, completion_tokens: usize }, -} - -#[derive(Clone)] -pub struct EngineHandle { - submit_tx: tokio::sync::mpsc::UnboundedSender, -} -``` - -```rust -// pegainfer-qwen3-4b -pub fn probe_model(model_path: &std::path::Path) -> anyhow::Result>; -pub fn start_engine( - model_path: &std::path::Path, - options: EngineLoadOptions, -) -> anyhow::Result; -pub fn kernel_plan() -> &'static KernelPlan; -``` - -`Qwen3Model`, `BatchDecodeBuffers`, and `KvState` should not be root-facing APIs. The deliberate low-level escape hatch is `pegainfer_qwen3_4b::runtime`, which exposes `Qwen3Executor` plus prefill/decode/unified plan types. That is the production phase boundary used by the scheduler and by model-local benches; root should still use `start_engine`. - -## Execution Log - -### Step 1: Add generic engine API to core -- Added `pegainfer_core::engine` with: - - `EngineLoadOptions` - - `ModelInfo` - - `TokenLogprob` - - `FinishReason` - - `GenerateRequest` - - `TokenEvent` - - `EngineHandle` -- Root `server_engine` now re-exports `FinishReason` and `TokenLogprob` for compatibility. -- Root `scheduler.rs` is reduced to compatibility re-exports for `SchedulerHandle`, `SchedulerRequest`, and `TokenEvent`. - -### Step 2: Extract Qwen3 crate -- Added `crates/pegainfer-qwen3-4b`. -- Moved Qwen3-owned code into the crate: - - config/weights/forward/prefill/decode/unified forward - - batch decode buffers - - `Qwen3Executor` - - Qwen3 scheduler internals - - Qwen3 e2e and paged-attention correctness tests - - Qwen3 regression data generator - - Qwen3 prefill Criterion bench -- Added `kernel_plan.rs` as the model-owned kernel routing index. It is typed Rust metadata, not a hand-maintained public TOML. - -### Step 3: Remove root Qwen3 execution knowledge -- Root no longer has: - - `src/model/qwen3.rs` - - `src/model/qwen3/*` - - `src/model_executor.rs` - - Qwen3 root tests: `tests/e2e.rs`, `tests/paged_attention.rs`, `tests/bench_prefill.rs` -- Root `main.rs` starts Qwen3 through `pegainfer_qwen3_4b::start_engine(...)`. -- Root `vllm_frontend.rs` accepts a generic `EngineHandle`. -- Root `bench_serving` uses the same generic scheduler bench path for Qwen3 instead of constructing `Qwen3Model` directly. -- Checked root with `rg` and confirmed no hits for `Qwen3Model`, `Qwen3Executor`, `ModelRuntimeConfig`, `model_executor`, `src/model/qwen3`, or stale "Qwen3 continuous" comments under root source/tests/benches/README. - -### Step 4: Link and validation fixes -- Added explicit `stdc++` link output in `pegainfer-kernels` build script. Once Qwen3 became an independent crate with its own tests, the FlashInfer C++ CUDA objects needed the C++ runtime linked for test binaries as well as root binaries. -- Fixed the Qwen3 crate prefill test to respect `PEGAINFER_TEST_MODEL_PATH`. -- The validation build directory still has no `.git`, so `bench_serving snapshot` writes `commit: unknown`; after pulling it back with `rsync -e 'ssh -S none'`, the local snapshot commit field was set to current local `HEAD` short hash `0f54a1d`. - -### Step 5: Verification -- Local: - - `cargo fmt --all --check` passes. - - `cargo metadata --no-deps --format-version 1` passes. -- CUDA validation host (RTX 5090): - - `PEGAINFER_CUDA_SM=120 cargo clippy --release --all-targets -- -D warnings` passes. - - `PEGAINFER_CUDA_SM=120 cargo build --release` passes. - - `PEGAINFER_CUDA_SM=120 cargo test --release --workspace --no-run` passes. - - `PEGAINFER_CUDA_SM=120 PEGAINFER_TEST_MODEL_PATH= cargo test --release -p pegainfer-qwen3-4b --test e2e -- --nocapture` passes. - - `RUST_LOG=warn PEGAINFER_CUDA_SM=120 cargo run --release --bin bench_serving -- --model-path snapshot` passes: - - `prefill_heavy (10000,1)`: TTFT p50 `500.90ms`, p99 `503.30ms` - - `decode_heavy (1024,256)`: TPOT p50 `7.57ms`, p99 `7.74ms` - - This run exposed a scheduler length-limit bug: `max_tokens=256` emitted only `255` token events because the limit path finished without emitting the final decoded token. It was fixed in Step 7. -- Snapshot pulled back to `bench_snapshots/rtx-5090/qwen3-4b.json`. - -### Step 6: Bench Boundary Cleanup -- Removed the duplicate Qwen3 `tests/bench_prefill.rs`; performance timing belongs under Criterion benches, while tests keep correctness/e2e coverage. -- Rejected a bench-only support API and also rejected using `ModelForward` as the benchmark entry. -- Added an explicit `runtime` module that re-exports the scheduler's real `Qwen3Executor` phase API: `PrefillPlan`, `DecodePlan`, `UnifiedPlan`, request items, and result types. -- Removed top-level public `Qwen3Model`, `ModelRuntimeConfig`, and `Qwen3State` re-exports. External low-level tools must opt into `runtime`; root continues to use `start_engine`. -- Replaced `crates/pegainfer-qwen3-4b/benches/qwen3_prefill.rs` with `benches/qwen3_runtime.rs`. It measures executor prefill TTFT over `128`, `512`, `1024`, `2048`, `4096`, and `10000` token prompts, plus executor decode TPOT for batch sizes `1`, `2`, `4`, `8`, `16`, and `32` at a `1024` token context. -- Updated `tests/paged_attention.rs` to use the same executor phase API: prefill once to create KV state, then decode through `execute_decode`. -- Verification after the cleanup: - - Local `cargo fmt --all --check` and `cargo metadata --no-deps --format-version 1` pass. - - Local `cargo check --release -p pegainfer-qwen3-4b --benches --tests` cannot run on the Mac without CUDA/nvcc; with `PEGAINFER_CUDA_SM=120` it still fails at local `nvcc`. - - CUDA host `PEGAINFER_CUDA_SM=120 cargo check --release -p pegainfer-qwen3-4b --benches --tests` passes. - - CUDA host `PEGAINFER_CUDA_SM=120 cargo clippy --release --all-targets -- -D warnings` passes. - - CUDA host `PEGAINFER_CUDA_SM=120 PEGAINFER_TEST_MODEL_PATH= cargo test --release -p pegainfer-qwen3-4b --test paged_attention -- --nocapture` passes. - - CUDA host full Criterion bench passes with `PEGAINFER_CUDA_SM=120 PEGAINFER_TEST_MODEL_PATH= cargo bench -p pegainfer-qwen3-4b --bench qwen3_runtime`: - - Prefill TTFT: `128 -> 11.804ms`, `512 -> 23.200ms`, `1024 -> 44.114ms`, `2048 -> 87.327ms`, `4096 -> 179.60ms`, `10000 -> 505.55ms`. - - Decode one-step batch time at 1024-token context: `bs1 -> 9.3095ms`, `bs2 -> 9.3207ms`, `bs4 -> 9.4059ms`, `bs8 -> 10.960ms`, `bs16 -> 11.718ms`, `bs32 -> 13.196ms`. - -### Step 7: Retire ModelForward and Fix Length Limit -- Deleted `pegainfer_core::model::{ModelForward, GenerationState}` and removed the root `src/model.rs` re-export. -- Deleted the Qwen3 `forward.rs` compatibility path. Qwen3 tests that used it now build their baselines from `batch_prefill(bs=1)` plus `batch_decode(bs=1)`, so they exercise the same phase APIs as production. -- Fixed Qwen3 decode length-limit handling by adding `DecodeEffect::EmitAndFinish`. EOS behavior is unchanged: EOS finishes without emitting the stop token. Length limit now emits the sampled final token, then sends `Finished { finish_reason: Length }`. -- Regenerated `test_data/Qwen3-4B.json` because every length-limited golden output now includes the final requested token. -- Re-ran `bench_serving snapshot` on the CUDA validation host and pulled back `bench_snapshots/rtx-5090/qwen3-4b.json`; `decode_heavy (1024,256)` now records `generated_tokens min=max=avg=256`. -- Performance stayed within noise on RTX 5090: - - `prefill_heavy (10000,1)`: TTFT p50 `501.69ms`, p99 `503.16ms`. - - `decode_heavy (1024,256)`: TPOT p50 `7.56ms`, p99 `7.73ms`. -- Final verification after this step: - - Local `cargo fmt --all --check`, `cargo metadata --no-deps --format-version 1`, and `git diff --check` pass. - - CUDA host `PEGAINFER_CUDA_SM=120 cargo clippy --release --all-targets -- -D warnings` passes. - - CUDA host `PEGAINFER_CUDA_SM=120 PEGAINFER_TEST_MODEL_PATH= cargo test --release -p pegainfer-qwen3-4b --test e2e -- --nocapture` passes. - -### Step 8: Decode Context-Length Sweep and Compile Audit -- Added `crates/pegainfer-qwen3-4b/src/bin/qwen3_decode_context.rs` as a production-path fixed-context decode probe. It prefills a fresh request to a selected context length, then measures or profiles real `Qwen3Executor::execute_decode`; the optional `cudaProfilerStart/Stop` range only exists for profiler capture and does not run in normal serving. -- GPU fixed-context command: - - `PEGAINFER_CUDA_SM=120 target/release/qwen3_decode_context --model-path --iters 10 --contexts 128,512,1024,2048,4096,8192,10000` -- Result on RTX 5090: - -| Context | Decode p50 | -| --- | ---: | -| 128 | `6.1107ms` | -| 512 | `6.7094ms` | -| 1024 | `7.4256ms` | -| 2048 | `8.8918ms` | -| 4096 | `11.7912ms` | -| 8192 | `17.5457ms` | -| 10000 | `20.0653ms` | - -- Linear fit across the sweep: `TPOT ~= 5.9789ms + 1.411us/token * context`, `R^2=0.99997`. -- `nsys` with `--cuda-graph-trace=node` shows the growth is almost entirely FlashInfer paged decode attention: - -| Context | Total kernel time / step | Attention / step | Non-attention / step | -| --- | ---: | ---: | ---: | -| 1024 | `7.3287ms` | `1.5390ms` | `5.7897ms` | -| 10000 | `19.6907ms` | `13.8868ms` | `5.8039ms` | - -- H2D traffic in the profiled decode range was only about `20-23us/step`, so metadata dirty caching is good runtime hygiene but cannot explain a multi-ms TPOT gap. -- Compile audit on the same validation worktree: - - GPU reports compute capability `12.0`; default toolkit is CUDA `12.9` (`nvcc V12.9.86`), driver `575.57.08`. - - `crates/pegainfer-kernels/build.rs` emits `-O3 -gencode arch=compute_120,code=sm_120 -gencode arch=compute_120,code=compute_120 --compiler-options -fPIC`; FlashInfer translation units add `--std=c++17` and the FlashInfer include path. - - `cuobjdump -lelf` confirms both `libkernels_cuda.a` and `target/release/pegainfer` contain `sm_120.cubin`. `compute_120` PTX fallback is also embedded, but the matching SASS is present, so this is not PTX-JIT-only execution. - - CUDA `13.1` is installed and can build the same code into `sm_120` cubins, but the current driver/runtime combination cannot run it (`cudaError=35` after linking `libcudart.so.13`). Until the driver is upgraded, CUDA `12.9` is the latest runnable toolkit on this box. -- Interpretation: the compile target is correct. The `bs=1` long-context slope is the known non-partition FlashInfer paged decode issue: grid shape is effectively `(batch_size, num_kv_heads) = (1, 8)`, so only 8 CTAs scan the whole KV context. At `ctx=4096`, Qwen3-4B attention reads about `604MB` (`576MiB`) of K/V per token; the measured attention time is about `5.7ms`, or roughly `105GB/s` effective aggregate bandwidth, far below the RTX 5090 memory system because the kernel under-fills the GPU. The next real fix is partition-KV/split-K decode for `bs=1` or low-batch, not build-flag tuning. - -### Step 9: Pure Paged Decode Attention Bench -- Added `crates/pegainfer-qwen3-4b/benches/qwen3_attention.rs`. -- The bench does not load Qwen3 weights. It constructs synthetic non-zero Q and paged KV buffers using Qwen3-4B attention shape: `num_qo_heads=32`, `num_kv_heads=8`, `head_dim=128`, `page_size=16`, one layer. -- The bench calls the FlashInfer paged decode FFI directly and uses CUDA events around the kernel launches. It measures decode attention only; it excludes QKV projection, KV append, O projection, MLP, scheduler, tokenizer, and host-side serving overhead. -- Added `paged_attention_decode_split_kv_cuda` as a reusable kernel entry for FlashInfer partition-KV/split-K decode. Runtime dispatch still uses the existing non-partition path; this step only exposes and benchmarks the candidate operator. -- The split-K bench uses `chunk_size=512` and `max_chunks_per_request=64`. Active chunks are packed for `o_indptr`, and remaining graph-stability slots are masked with `block_valid_mask=0`. -- Bench setup runs a non-timed D2H sanity check comparing split-K output with the non-partition output for every synthetic case. The GPU run below passed that check. -- Registered it as a model-crate Criterion bench instead of a kernels-crate bench because the shape, context sweep, and interpretation are Qwen3-specific; the implementation still directly indexes the reusable kernel entry point. -- Local verification: - - `cargo fmt --all --check` passes. - - `cargo metadata --no-deps --format-version 1` passes. - - `git diff --check` passes. -- GPU compile: - - `PEGAINFER_CUDA_SM=120 cargo check --release -p pegainfer-qwen3-4b --bench qwen3_attention` passes. - - `PEGAINFER_CUDA_SM=120 cargo clippy --release -p pegainfer-qwen3-4b --all-targets -- -D warnings` passes. -- GPU run: - - `PEGAINFER_CUDA_SM=120 cargo bench -p pegainfer-qwen3-4b --bench qwen3_attention -- --noplot` passes. - -Single-layer `bs=1` context sweep on RTX 5090: - -| KV length | Non-partition | Split-K 512/padded64 | Speedup | -| --- | ---: | ---: | ---: | -| 128 | `6.56us` | `8.91us` | `0.74x` | -| 512 | `18.70us` | `20.96us` | `0.89x` | -| 1024 | `34.94us` | `21.17us` | `1.65x` | -| 2048 | `65.78us` | `20.85us` | `3.16x` | -| 4096 | `129.23us` | `21.59us` | `5.99x` | -| 8192 | `254.98us` | `22.81us` | `11.18x` | -| 10000 | `311.51us` | `23.31us` | `13.36x` | - -Batch sweep at `kv_len=1024`: - -| Batch size | Non-partition | Split-K 512/padded64 | Speedup | -| --- | ---: | ---: | ---: | -| 1 | `34.97us` | `20.89us` | `1.67x` | -| 2 | `34.96us` | `27.07us` | `1.29x` | -| 4 | `34.90us` | `38.44us` | `0.91x` | -| 8 | `35.00us` | `45.95us` | `0.76x` | -| 16 | `35.05us` | `51.11us` | `0.69x` | -| 32 | `86.12us` | `92.51us` | `0.93x` | - -Interpretation: the pure operator data reproduces the same shape as the full decode profile. At `bs=1`, non-partition time grows almost linearly with KV length, while graph-stable split-K stays near `21-23us/layer` once context reaches 1k. Multiplying the synthetic ctx10000 split-K result by 36 layers gives about `0.84ms` attention-only time instead of about `11.2ms` from non-partition. That is the right order of magnitude for fixing the long-context TPOT slope. The batch sweep also shows the guard must be conservative: at `kv_len=1024`, split-K only wins for `bs<=2`, and non-partition is better once batch already provides enough request/head CTAs. - -### Step 10: Runtime Split-K Decode Gate -- Integrated `paged_attention_decode_split_kv_cuda` into the Qwen3 decode runtime. -- `BatchDecodeBuffers` now owns split-K metadata and workspace: - - `split_request_indices_d` - - `split_kv_tile_indices_d` - - `split_kv_chunk_size_d` - - `split_o_indptr_d` - - `split_block_valid_mask_d` - - `split_tmp_v` - - `split_tmp_s` -- CUDA graph cache is now keyed by `(batch_bucket, attention_path)` instead of only `batch_bucket`. This matters because a request can first capture the `bs=1` non-partition graph at short context and later cross the split-K threshold; the split-K path needs its own graph capture. -- Runtime gate: - - split-K when `padded_bs <= 2 && max_seq_len >= 1024` - - otherwise keep non-partition decode -- Split-K metadata uses `chunk_size=max(512, ceil(max_seq_len / 64))` and `64` reserved chunk slots per request. Real chunk slots are packed for `o_indptr`; unused graph-stability slots are masked with `block_valid_mask=0`. -- Padding batch slots get zero active split chunks. Their output is discarded, and the batch columns remain independent through GEMMs. - -GPU validation: - -| Check | Result | -| --- | --- | -| `PEGAINFER_CUDA_SM=120 cargo check --release -p pegainfer-qwen3-4b --all-targets` | pass | -| `PEGAINFER_CUDA_SM=120 PEGAINFER_TEST_MODEL_PATH= cargo test --release -p pegainfer-qwen3-4b --test e2e -- --nocapture` | pass | -| `PEGAINFER_CUDA_SM=120 cargo clippy --release --all-targets -- -D warnings` | pass | - -Fixed-context decode probe after runtime integration: - -| Prompt context | Decode KV len | p50 TPOT before | p50 TPOT after | -| --- | ---: | ---: | ---: | -| 1024 | 1025 | `7.43ms` | `6.74ms` | -| 4096 | 4097 | `11.79ms` | `6.82ms` | -| 10000 | 10001 | `20.07ms` | `7.06ms` | - -Command: - -```bash -PEGAINFER_CUDA_SM=120 target/release/qwen3_decode_context \ - --model-path \ - --iters 10 \ - --contexts 1024,4096,10000 -``` - -Cross-threshold smoke: - -```bash -PEGAINFER_CUDA_SM=120 target/release/qwen3_decode_context \ - --model-path \ - --iters 600 \ - --contexts 512 -``` - -Result: pass, `p50=6.7156ms`. This exercises a single request growing from non-partition territory into the split-K threshold with separate graph captures. - -Serving request check after rebuilding `bench_serving`: - -```bash -RUST_LOG=warn PEGAINFER_CUDA_SM=120 target/release/bench_serving \ - --model-path \ - request --prompt-len 4096 --output-len 64 -``` - -Result: - -| Metric | Before | After | -| --- | ---: | ---: | -| `first_decode_step_ms p50` | `11.74ms` | `6.82ms` | -| `steady_tpot_ms p50` | `11.72ms` | `6.77ms` | -| `e2e_ms p50` | `916.34ms` | `604.24ms` | - -Interpretation: split-K removes the long-context attention slope for the low-batch case. The remaining `~6.8-7.1ms` TPOT is now dominated by the non-attention decode body: GEMMs/GEMVs, MLP, norms, logits, sampling, and graph replay overhead. Next optimization work should not keep pushing paged attention first; it should re-profile the post-split decode step and pick the new largest kernel family. - -### Step 11: Attention Theoretical Bandwidth Estimate -- Updated `crates/pegainfer-qwen3-4b/benches/qwen3_attention.rs` to print a one-time theoretical bandwidth report before Criterion runs. -- The report queries CUDA Driver attributes: - - `CU_DEVICE_ATTRIBUTE_MEMORY_CLOCK_RATE` - - `CU_DEVICE_ATTRIBUTE_GLOBAL_MEMORY_BUS_WIDTH` -- On RTX 5090, CUDA reports `14001MHz` memory clock and a `512-bit` memory bus. Using `2` transfers per memory clock gives `1792.128GB/s`, matching the public RTX 5090 bandwidth figure. -- The report uses Qwen3 KV read bytes only: - - `bs * kv_len * num_kv_heads * head_dim * 2(K,V) * sizeof(bf16)` - - This is a counter-free lower-bound estimate, not measured DRAM bytes. -- Verification command: - -```bash -PEGAINFER_CUDA_SM=120 cargo bench -p pegainfer-qwen3-4b \ - --bench qwen3_attention -- --noplot -``` - -Key RTX 5090 report rows: - -| Case | KV read | Time | Effective GB/s | Peak % | -| --- | ---: | ---: | ---: | ---: | -| `ctx bs1 non_partition 4096` | `16.777MB` | `129.285us` | `129.769` | `7.24%` | -| `ctx bs1 non_partition 10000` | `40.960MB` | `309.744us` | `132.238` | `7.38%` | -| `ctx bs1 split_k512_padded64 4096` | `16.777MB` | `21.494us` | `780.536` | `43.55%` | -| `ctx bs1 split_k512_padded64 10000` | `40.960MB` | `23.294us` | `1758.386` | `98.12%` | - -Batch sweep sanity rows at `kv_len=1024`: - -| Case | KV read | Time | Effective GB/s | Peak % | -| --- | ---: | ---: | ---: | ---: | -| `batch non_partition bs8` | `33.554MB` | `34.935us` | `960.482` | `53.59%` | -| `batch non_partition bs16` | `67.109MB` | `35.004us` | `1917.174` | `106.98%` | -| `batch split_k512_padded64 bs32` | `134.218MB` | `92.533us` | `1450.489` | `80.94%` | - -Interpretation: the estimate is good enough to prove the original `bs=1` non-partition path was badly under-filling memory bandwidth. It is not good enough to make final hardware-utilization claims because single-layer KV working sets fit in the RTX 5090's `96MiB` L2; the `bs16` non-partition row exceeding `100%` of DRAM peak is the warning sign. The next measurement step should use CUPTI Profiler or NCU counters for `dram__bytes_*`, `lts__t_bytes.*`, and `*_pct_of_peak_sustained_elapsed`. - -### Step 12: CUPTI Counters and Split-K Retune -- Added `crates/pegainfer-cupti`, a small CUPTI Range Profiler wrapper used by the attention bench. It profiles only the attention launch range and lets the bench clear L2 before `cuptiRangeProfilerStart`, so cache-clear traffic is excluded from the measured range. -- Extended `crates/pegainfer-qwen3-4b/benches/qwen3_attention.rs`: - - `PEGAINFER_QWEN3_ATTENTION_CUPTI=1` prints cold-L2 CUPTI rows for `gpu__time_duration.sum`, `dram__bytes.sum`, `dram__bytes_op_read.sum`, `dram__bytes_op_write.sum`, and `lts__t_bytes.sum`. - - `PEGAINFER_QWEN3_ATTENTION_SPLITK_SWEEP=1` sweeps split-K chunk sizes and max chunk slots. - - `PEGAINFER_QWEN3_ATTENTION_REPORT_ONLY=1` prints reports without running Criterion samples. -- GPU CUPTI command: - -```bash -PEGAINFER_CUDA_SM=120 \ -PEGAINFER_QWEN3_ATTENTION_REPORT_ONLY=1 \ -PEGAINFER_QWEN3_ATTENTION_CUPTI=1 \ -cargo bench -p pegainfer-qwen3-4b --bench qwen3_attention -- --noplot -``` - -Key cold-L2 CUPTI rows at `bs=1,ctx=10000`: - -| Path | GPU time | DRAM read | DRAM total | DRAM GB/s | Peak % | KV read / DRAM read | -| --- | ---: | ---: | ---: | ---: | ---: | ---: | -| non-partition | `425.920us` | `41.028MB` | `55.123MB` | `129.421` | `7.22%` | `99.83%` | -| split-K 512/64 | `76.896us` | `41.019MB` | `54.431MB` | `707.849` | `39.50%` | `99.86%` | -| split-K 256/64 | `66.976us` | `41.020MB` | `48.712MB` | `727.301` | `40.58%` | `99.85%` | - -Interpretation: FlashInfer is not rereading KV many times from DRAM. The non-partition path reads roughly the theoretical KV bytes but uses only about `7%` of peak DRAM bandwidth because `bs=1` launches too little work. Split-K increases parallelism and moves the same required KV read to roughly `40%` of peak DRAM bandwidth in the cold-L2 CUPTI range. - -Split-K sweep command: - -```bash -PEGAINFER_CUDA_SM=120 \ -PEGAINFER_QWEN3_ATTENTION_REPORT_ONLY=1 \ -PEGAINFER_QWEN3_ATTENTION_SPLITK_SWEEP=1 \ -cargo bench -p pegainfer-qwen3-4b --bench qwen3_attention -- --noplot -``` - -Representative cold-L2 sweep rows: - -| Case | 256/64 | 512/64 | Result | -| --- | ---: | ---: | --- | -| `bs1 ctx1024` | `22.197us` | `30.823us` | 256/64 wins | -| `bs1 ctx4096` | `26.703us` | `35.159us` | 256/64 wins | -| `bs1 ctx10000` | `38.912us` | `46.824us` | 256/64 wins | -| `bs2 ctx1024` | `23.713us` | `34.705us` | 256/64 wins | -| `bs2 ctx8192` | `54.637us` | `55.200us` | tied, 256/64 slightly ahead | -| `bs2 ctx10000` | `62.417us` | `63.620us` | tied, 256/64 slightly ahead | - -Runtime change: `BatchDecodeBuffers` now uses `SPLIT_KV_CHUNK_TOKENS=256` with `SPLIT_KV_MAX_CHUNKS_PER_REQUEST=64`. This keeps the same graph-stable padded slot budget as the original `512/64` integration, while doubling active chunks for low-batch long-context decode. - -Production decode probe after retune: - -```bash -PEGAINFER_CUDA_SM=120 cargo build --release \ - -p pegainfer-qwen3-4b --bin qwen3_decode_context -PEGAINFER_CUDA_SM=120 target/release/qwen3_decode_context \ - --model-path \ - --iters 10 \ - --contexts 1024,4096,10000 -``` - -| Prompt context | Decode KV len | p50 TPOT | -| --- | ---: | ---: | -| 1024 | 1025 | `6.4002ms` | -| 4096 | 4097 | `6.5327ms` | -| 10000 | 10001 | `7.0436ms` | - -Serving check after syncing the root `src/` worktree on the CUDA validation host: - -```bash -RUST_LOG=warn PEGAINFER_CUDA_SM=120 cargo run --release \ - --bin bench_serving -- \ - --model-path \ - request --prompt-len 4096 --output-len 64 --warmup 5 --iters 20 -``` - -| Metric | p50 | p95 | Samples | -| --- | ---: | ---: | ---: | -| `ttft_ms` | `177.21ms` | `177.86ms` | 20 | -| `first_decode_step_ms` | `6.51ms` | `6.52ms` | 20 | -| `steady_tpot_ms` | `6.46ms` | `6.48ms` | 1240 | -| `e2e_ms` | `584.79ms` | `585.25ms` | 20 | - -Verification: - -| Check | Result | -| --- | --- | -| `PEGAINFER_CUDA_SM=120 cargo clippy --release -p pegainfer-cupti -p pegainfer-qwen3-4b --bench qwen3_attention -- -D warnings` | pass | -| `PEGAINFER_CUDA_SM=120 PEGAINFER_TEST_MODEL_PATH= cargo test --release -p pegainfer-qwen3-4b --test e2e -- --nocapture` | pass | -| `PEGAINFER_CUDA_SM=120 PEGAINFER_QWEN3_ATTENTION_REPORT_ONLY=1 cargo bench -p pegainfer-qwen3-4b --bench qwen3_attention -- --noplot` | pass | -| `cargo fmt --all --check` | pass | -| `cargo metadata --no-deps --format-version 1` | pass | -| `git diff --check` | pass | - -Note: an initial remote e2e run failed because the remote `test_data/Qwen3-4B.json` was stale and expected the pre length-limit baseline. Syncing the tracked baseline fixed it; this was not a split-K numerical drift. - -### Step 13: Kernel Snapshot MVP -- Extracted the Qwen3 paged decode attention case construction into `crates/pegainfer-qwen3-4b/src/kernel_bench.rs`. -- Added `crates/pegainfer-qwen3-4b/benches/qwen3_kernel_snapshot.rs` as a deterministic `harness=false` runner. -- Removed the temporary correctness envelope from the snapshot runner. We do not have a settled truth source for this layer yet, so correctness belongs in a separate design rather than a misleading "non-partition equals truth" field. -- CUPTI is default-on in the snapshot runner. `--no-cupti` is available only for latency-only smoke runs. - -Snapshot command: - -```bash -PEGAINFER_CUDA_SM=120 cargo bench -p pegainfer-qwen3-4b \ - --bench qwen3_kernel_snapshot -- \ - run \ - --contexts 1024 \ - --batch-sizes 1 \ - --variants non_partition,split_kv_256x64 \ - --iters 4 \ - --out $RESULT_ROOT/qwen3_kernel_snapshot_smoke.json -``` - -Compare command: - -```bash -cargo bench -p pegainfer-qwen3-4b \ - --bench qwen3_kernel_snapshot -- \ - compare \ - --base $RESULT_ROOT/qwen3_kernel_snapshot_smoke.json \ - --new $RESULT_ROOT/qwen3_kernel_snapshot_smoke.json -``` - -The JSON snapshot records: -- model/op identity: `qwen3-4b`, `paged_decode_attention` -- hardware: GPU name, compute capability, memory clock, memory bus width, theoretical peak bandwidth, L2 size, cache-clear size -- measurement recipe: warm iters, cold-L2 iters, `INNER_LAUNCHES` -- CUPTI recipe: enabled flag and metric list -- per-case shape: batch, KV length, head shape, page size, dtype -- per-case variant/params: non-partition or split-K `chunk_tokens/max_chunks` -- warm and cold-L2 CUDA event latency -- CUPTI counters: GPU time, DRAM read/write/total bytes, L2 bytes, SM throughput percentage, active-warp percentage, DRAM bandwidth, peak percentage, and theoretical KV-read over DRAM-read percentage -- theoretical KV read bytes - -GPU smoke result for `bs=1,ctx=1024,iters=4`: - -| Variant | Warm | Cold-L2 | CUPTI GPU | CUPTI DRAM read | SM throughput | Active warps | -| --- | ---: | ---: | ---: | ---: | ---: | ---: | -| `non_partition` | about `35us` | about `46us` | `75.776us` | `4.236MB` | `0.75%` | `8.27%` | -| `split_kv_256x64` | about `13us` | about `21us` | `48.736us` | `4.249MB` | `1.31%` | `11.77%` | - -Snapshot compare result: - -```text -kernel snapshot compare complete: warnings=0 failures=0 -``` - -CUPTI note: the standalone snapshot runner originally crashed inside `libnvperf_host.so` at `NVPW_CUDA_Profiler_DecodeCounters`. The root cause was the verbose user range name, not the attention case or Rust callback trampoline. The fix is to use compact range names such as `qk/non_partition/b1/k1024` and keep full metadata in JSON fields. The first profiled launch also needs an unprofiled warmup launch; otherwise CUDA lazy initialization pollutes the first CUPTI GPU time. - -```bash -PEGAINFER_CUDA_SM=120 cargo bench -p pegainfer-qwen3-4b \ - --bench qwen3_kernel_snapshot -- \ - run \ - --contexts 1024 \ - --batch-sizes 1 \ - --variants non_partition,split_kv_256x64 \ - --iters 4 \ - --out $RESULT_ROOT/qwen3_kernel_snapshot_cupti_smoke.json -``` - -Verification: - -| Check | Result | -| --- | --- | -| `PEGAINFER_CUDA_SM=120 cargo clippy --release -p pegainfer-cupti -p pegainfer-qwen3-4b --bench qwen3_kernel_snapshot -- -D warnings` | pass | -| `PEGAINFER_CUDA_SM=120 PEGAINFER_TEST_MODEL_PATH= cargo test --release -p pegainfer-qwen3-4b --test e2e -- --nocapture` | pass | -| `PEGAINFER_CUDA_SM=120 cargo bench -p pegainfer-qwen3-4b --bench qwen3_kernel_snapshot -- run --contexts 1024 --batch-sizes 1 --variants non_partition,split_kv_256x64 --iters 4 --out $RESULT_ROOT/qwen3_kernel_snapshot_cupti_smoke.json` | pass | - -The SM counters are intentionally minimal. `sm__throughput.avg.pct_of_peak_sustained_elapsed` shows whether SMs are busy over elapsed time; `smsp__warps_active.avg.pct_of_peak_sustained_active` shows active-warp residency while SM partitions are active. At `bs=1,ctx=10000`, non-partition measured `1.19%` SM throughput and `6.59%` DRAM peak, while split-K measured `8.74%` SM throughput and `41.06%` DRAM peak for nearly identical DRAM read bytes. That is the kernel snapshot evidence for low-batch underfill. - -### Step 14: Consolidate Bench Entry Points -- Deleted the retired Criterion benches: - - `crates/pegainfer-qwen3-4b/benches/qwen3_runtime.rs` - - `crates/pegainfer-qwen3-4b/benches/qwen3_attention.rs` -- Removed their `[[bench]]` entries and the Qwen3 crate-local `criterion` dev dependency. -- Qwen3 now has exactly one model-crate bench entry: `qwen3_kernel_snapshot`. -- Rationale: the human CSV report, split-K tuning sweep, and machine-readable JSON runner were duplicating case construction, metric selection, and interpretation. Kernel maintenance should have one durable artifact first; optional human views should be generated from snapshot data rather than maintained as separate benches. - -Verification after consolidation: - -| Check | Result | -| --- | --- | -| `cargo fmt --all --check` | pass | -| `cargo metadata --no-deps --format-version 1` | pass | -| `git diff --check` | pass | -| `PEGAINFER_CUDA_SM=120 cargo check --release -p pegainfer-qwen3-4b --bench qwen3_kernel_snapshot` on the CUDA validation host | pass | -| `PEGAINFER_CUDA_SM=120 cargo clippy --release -p pegainfer-cupti -p pegainfer-qwen3-4b --bench qwen3_kernel_snapshot -- -D warnings` on the CUDA validation host | pass | -| `PEGAINFER_CUDA_SM=120 cargo bench -p pegainfer-qwen3-4b --bench qwen3_kernel_snapshot -- run --contexts 1024 --batch-sizes 1 --variants non_partition,split_kv_256x64 --iters 4 --out $RESULT_ROOT/qwen3_kernel_snapshot_single_bench_smoke.json` on the CUDA validation host | pass | -| `PEGAINFER_CUDA_SM=120 cargo bench -p pegainfer-qwen3-4b --bench qwen3_kernel_snapshot -- compare --base $RESULT_ROOT/qwen3_kernel_snapshot_single_bench_smoke.json --new $RESULT_ROOT/qwen3_kernel_snapshot_single_bench_smoke.json` on the CUDA validation host | pass | - -## Debrief - -The Qwen3 split now enforces the intended dependency direction: model execution code depends on `pegainfer-core` and `pegainfer-kernels`; root depends on the model crate only at registry/startup glue points. Root still has a `ModelType::Qwen3` enum and default Qwen3 model path because the product needs a loader choice, but it no longer sees Qwen3 layers, KV state, TP rank workers, or prefill/decode/unified plans. - -Next cleanup should be a generic model registry module so `main.rs` and `bench_serving.rs` stop matching model crate names directly. Performance-wise, the next target is the post-split decode body: GEMM/GEMV, MLP, norms, logits, sampling, and graph replay overhead now dominate the remaining `~6.5-7.0ms` TPOT. Kernel DevOps-wise, the next target is defining a real correctness/truth source for kernel snapshots instead of treating one implementation path as the oracle. diff --git a/docs/models/qwen3/roadmap.md b/docs/models/qwen3/roadmap.md index 8d71059c..c747d036 100644 --- a/docs/models/qwen3/roadmap.md +++ b/docs/models/qwen3/roadmap.md @@ -1,6 +1,6 @@ # Qwen3-4B Roadmap -> **TL;DR:** Qwen3-4B is the maturity bar of the project — continuous batching, TP=2, default-on prefix cache (#216), and the HF logits golden gate are all live — so its roadmap is sharpening, not bring-up. The verified open set: one real correctness bug (#220 RoPE cache 4096 vs 40960 admitted, silent OOB), per-row batch-decode sampling (O(batch) launches + syncs per step despite a production-proven batched primitive in-tree), zero TP correctness coverage, LoRA built but gated only by a zero-adapter smoke, prefix-cache observability dropped at the scheduler boundary, and a docs layer that describes deleted tooling. Findings verified 2026-06-04 against `6ee9247`. +> **TL;DR:** Qwen3-4B is the maturity bar of the project — continuous batching, TP=2, default-on prefix cache (#216), and the HF logits golden gate are all live — so its roadmap is sharpening, not bring-up. The verified open set: one real correctness bug (#220 RoPE cache 4096 vs 40960 admitted, silent OOB), per-row batch-decode sampling (O(batch) launches + syncs per step despite a production-proven batched primitive in-tree), zero TP correctness coverage, LoRA built but gated only by a zero-adapter smoke, and prefix-cache observability dropped at the scheduler boundary. Findings verified 2026-06-04 against `6ee9247`; the docs/dead-code ledger (#248) was cleared 2026-06-07. > > **Last touched:** 2026-06 @@ -46,9 +46,9 @@ Tracking issue: see the `[Model] Qwen3-4B roadmap` GitHub issue. Cross-model ite ## Cleanup ledger - **Issue hygiene:** #188 references a test target deleted in #194 — close as superseded by the golden gate. #203 §1 still claims qwen3 has no prefix reuse — stale since #216. -- **Dead code:** `batch_decode_trace.rs` `HIDDEN_SIZE`/`INTERMEDIATE_SIZE` consts (pub, zero readers); qwen3 `probe_model()`+`ModelInfo` remain uncalled (server inlines its own detection — qwen35's matching dead pair was removed in #258). - **File size:** `executor.rs` (1435), `scheduler.rs` (1420, ~826 of them inline tests), `kernel_bench.rs` (1112) breach the 1k-line redline. -- **Docs:** `model-crate.md` TL;DR advertises a deleted `qwen3_kernel_snapshot` bench and, with `kernels-crate.md`, uses the obsolete `crates/` layout in every command — collapse both into one slim layout doc. `tp-design.md` describes the implemented controller/worker runtime as future direction — rewrite to past tense, promote the 3 real open items. `kv-pressure-hang.md` — lift the KV-lifetime-reservation lessons to `docs/lessons/`, then delete. `execution.md` Done list predates #216. +- **Docs:** `execution.md` Done list predates #216. +- ~~Dead code + stale qwen3 docs~~ — cleared by #248: `HIDDEN_SIZE`/`INTERMEDIATE_SIZE` consts and `probe_model()`+`ModelInfo` removed; `model-crate.md`+`kernels-crate.md` collapsed into `crate-layout.md`; `tp-design.md` rewritten around the implemented runtime; `kv-pressure-hang.md` lifted to `lessons/kv-full-lifetime-reservation.md` and deleted. ## Done criteria diff --git a/docs/models/qwen3/tp-design.md b/docs/models/qwen3/tp-design.md index fd74c09a..7485b280 100644 --- a/docs/models/qwen3/tp-design.md +++ b/docs/models/qwen3/tp-design.md @@ -1,765 +1,60 @@ -# Qwen3 Tensor Parallelism Design +# Qwen3 Tensor Parallelism -> **TL;DR:** Add `TP=2` support for `Qwen3-4B` as the first model-parallel milestone. The goal is correctness and a clean architectural foundation for larger dense models and future MoE work, with the runtime moving toward a controller-plus-workers broadcast execution model instead of scheduler-owned cross-thread mutable state. +> **TL;DR:** The TP runtime is implemented, not a plan. Qwen3-4B runs `TP=2` end-to-end (TP=8 smoke-tested on 8×4090) through a controller/worker broadcast model: every rank — including rank 0 — executes on a `RankWorker` thread under a coarse-grained `StepCommand` protocol, and the scheduler loop is plan → execute → resolve → apply (`scheduler/{plan,resolve,effects}.rs`). The real open remainder: zero automated TP correctness coverage, replicated (not vocab-parallel) embedding/lm_head, and no TP CUDA-graph path. > -> **Status:** Active. `Qwen3-4B` has now been brought up end-to-end with `TP=2` on a single machine. `TP=8` has also been smoke-tested on an 8x4090 host, but the implementation still carries first-pass runtime debt and has not yet gone through systematic correctness validation. +> **Last touched:** 2026-06 -## Goal +## Open items -Add tensor parallelism for `Qwen3-4B` with a narrow and explicit first target: +These are the three real gaps, in priority order (sequenced in `roadmap.md`): -- support `TP=2` -- preserve `TP=1` -- keep the milestone focused on model-parallel correctness -- establish the right abstractions for later large-model and MoE work +1. **TP correctness coverage.** Every test in the crate runs `device_ordinals: vec![0]`. A reduction-order or shard-offset bug is invisible to every gate. The step is running the existing HF golden gate over `device_ordinals [0,1]` (skip when <2 GPUs), then a systematic TP=8 pass — TP=8 today is only "loads, serves, non-degenerate text" on an 8×4090 host. +2. **Vocab-parallel embedding / lm_head.** Both are replicated per rank by first-pass design. Fine for 4B; becomes the memory bottleneck for larger dense models. +3. **TP CUDA-graph.** Decode graph capture exists only on the single-GPU path; TP decode runs eager. Deferred deliberately until the runtime shape stabilized — it has. -This milestone is about making pegainfer capable of serving a single model replica across two GPUs. It is not about multi-replica throughput scaling. +## Execution model (as implemented, `executor.rs`) -## Why This Matters +One controller decides each step; all ranks execute it under an ordered broadcast: -Large-model serving and MoE serving both require model-parallel building blocks. For pegainfer, tensor parallelism is the first such building block. +- `Qwen3Executor` owns a `primary: RankWorker` plus `workers: Vec`. Rank 0 is not special-cased onto the scheduler thread — it executes on the primary worker thread under the same protocol. +- `StepCommand` is coarse-grained and step-oriented: `Prefill { requests, kv_views, echo }`, `Decode { requests, kv_views }`, `Unified { ... }`. KV mutation details stay inside a step; there are no low-level `EnsureCapacity`/`Advance`-style protocol messages. +- Requests are identified by `RequestId(u64)` from a monotone counter — never by slot indices or parallel-vector alignment. +- Barrier semantics: no worker starts command `N+1` until all workers finished `N`. +- Result flow is asymmetric: non-primary workers return ack/failure only; the primary worker returns the step artifacts. +- Sampling policy (params, RNG inputs) is controller-owned and travels with step items; GPU sampling and logprob extraction execute worker-side. +- Each worker owns its rank-local state: model shard, decode buffers, scratch. The executor owns the KV manager and per-request KV; the scheduler owns request lifecycle (streaming handles, finish bookkeeping, admission). -The immediate value is: +The rejected alternative — scheduler-owned rank-local mutable state with worker threads borrowing `&mut` into it via pointer wrappers — was the bring-up shape and was deliberately removed. TP is a replicated-local-state problem, not a shared-mutable-state problem. -- larger dense models become reachable without forcing a single-GPU fit -- the runtime starts carrying rank-local weights and rank-local execution state -- later MoE work can build on model-parallel foundations instead of retrofitting them into a single-GPU design +### Scheduler boundaries (as implemented, `scheduler/`) -## Scope +The scheduler loop is structured around three step-scoped boundary types: -This first pass is intentionally narrow: +| Boundary | File | Role | +| --- | --- | --- | +| `ExecutionPlan` | `scheduler/plan.rs` | what runs this step (kind + participating requests) | +| `ExecutionArtifacts` | `scheduler/plan.rs` | raw executor products, before lifecycle interpretation | +| `StepEffects` | `scheduler/effects.rs` | lifecycle transitions + token events, applied to scheduler state | -- model: `Qwen3-4B` -- parallel degree: `TP=2` -- focus: correctness and architecture -- deployment target: a single machine +`resolve_step` (`scheduler/resolve.rs`) turns artifacts into effects; `apply_effects` mutates scheduler-owned state. The split isolates the three independent change vectors: batching/admission policy → scheduler, parallel execution strategy → executor, sampling/logprobs/finish semantics → resolver. -The first milestone does not need to solve every parallelism problem. It needs to prove that pegainfer can run one dense model replica across two GPUs without breaking correctness or making the architecture harder to evolve. +## Partitioning spec (reference) -## Design Constraints +Standard dense-model TP layout (vLLM/SGLang-style): attention partitioned by head, MLP by intermediate dim, one all-reduce after attention output projection and one after MLP down projection per layer. Embedding and tied lm_head replicated (open item 2). -- Tensor parallelism must be model-parallel, not data-parallel in disguise. -- The design must serve future larger dense models and MoE, not only Qwen3-4B. -- `TP=1` must remain a supported and healthy path. -- The first version may simplify some weight handling, but those simplifications must not block later movement toward a more fully sharded design. -- The external user experience should stay simple. Tensor parallel support should not require the user to reason about a heavyweight distributed system. -- The design should avoid coupling the abstraction too tightly to one specific kernel path or one specific Qwen3 implementation detail. +Qwen3-4B at `TP=2` (`hidden=2560`, `q_heads=32`, `kv_heads=8`, `head_dim=128`, `intermediate=9728`): -## Explicit Non-Goals +| Tensor | Global | Local per rank | +| --- | --- | --- | +| fused `qkv_proj` | `[6144, 2560]` | `[3072, 2560]` (16 q heads + 4 kv heads, head-aligned slices) | +| `o_proj` | row-parallel | partial hidden, all-reduced | +| fused `gate_up_proj` | `[19456, 2560]` | `[9728, 2560]` (intermediate 4864) | +| `down_proj` | row-parallel | partial hidden, all-reduced | -The first TP milestone does not aim to do the following: +Divisibility (`q_heads % tp == 0`, `kv_heads % tp == 0`, `intermediate % tp == 0`) is a hard requirement, not an accident. -- support `Qwen3.5` -- support MoE expert parallelism -- support data parallelism -- support pipeline parallelism -- preserve current CUDA Graph behavior from day one -- introduce vocab-parallel embedding or vocab-parallel `lm_head` in the first pass -- optimize every path for peak throughput before the basic design is proven correct +## Bring-up hazards (fixed, kept for the next model-parallel bring-up) -## Simplifications Allowed In First Pass - -The first pass is allowed to trade some efficiency for speed of validation, as long as the trade does not poison the long-term design. - -Allowed simplifications: - -- tied embedding / `lm_head` may be replicated instead of sharded -- some paths may be brought up incrementally as long as the final milestone still has a clear acceptance target -- correctness and maintainable abstraction take priority over immediate performance parity - -These are first-pass scope controls, not permanent architectural commitments. - -## Tensor-Parallel Partitioning Spec - -The first `Qwen3-4B` tensor-parallel milestone should follow the mainstream dense-model TP layout used by systems such as vLLM and SGLang: - -- attention projections are partitioned by head -- MLP projections are partitioned by intermediate dimension -- layer outputs that rejoin the residual stream are reduced across ranks -- tied embedding / `lm_head` is replicated in the first pass - -This is the intended layout for `TP=2`. - -### Qwen3-4B Local Dimensions At TP=2 - -`Qwen3-4B` runtime dimensions: - -- `hidden_size = 2560` -- `num_attention_heads = 32` -- `num_key_value_heads = 8` -- `head_dim = 128` -- `intermediate_size = 9728` - -Under `TP=2`, the local dimensions per rank are: - -- local query heads: `16` -- local KV heads: `4` -- local query projection dim: `16 * 128 = 2048` -- local KV projection dim: `4 * 128 = 512` -- local intermediate dim: `9728 / 2 = 4864` - -The first pass should explicitly require divisibility for these dimensions. If a model does not divide cleanly across TP ranks, it is out of scope for this milestone. - -### Attention Projection Layout - -For Qwen3 attention, the fused `qkv_proj` should be partitioned by head-aligned output slices. - -Global layout: - -- `qkv_proj`: `[q_dim + 2 * kv_dim, hidden_size]` -- for Qwen3-4B: `[4096 + 1024 + 1024, 2560] = [6144, 2560]` - -Local layout at `TP=2`: - -- local `q_proj`: `[2048, 2560]` -- local `k_proj`: `[512, 2560]` -- local `v_proj`: `[512, 2560]` -- local fused `qkv_proj`: `[3072, 2560]` - -This partitioning is semantic, not just row-chunking by index. Each rank owns a contiguous subset of query heads and KV heads. - -### Attention Output Projection Layout - -The output projection should follow the standard row-parallel pattern. - -- each rank consumes its local attention output -- each rank produces a partial hidden-state contribution -- the partial hidden states are combined with an `all-reduce` - -The residual stream after this reduction is logically full-width hidden state. - -### MLP Projection Layout - -The MLP should be partitioned by intermediate dimension. - -For Qwen3-4B: - -- global `intermediate_size = 9728` -- local `intermediate_size = 4864` at `TP=2` - -Projection layout: - -- `gate_up_proj` is column-parallel over intermediate dimension -- `down_proj` is row-parallel and its outputs are combined with an `all-reduce` - -With the current fused MLP layout, each rank owns its local fused gate/up rows: - -- global fused `gate_up_proj`: `[2 * 9728, 2560] = [19456, 2560]` -- local fused `gate_up_proj`: `[2 * 4864, 2560] = [9728, 2560]` - -As with attention, the residual stream after the MLP output reduction is logically full-width hidden state. - -### Embedding And LM Head - -For the first pass: - -- token embedding is replicated -- tied `lm_head` is replicated - -This is a deliberate simplification for the initial milestone. It is acceptable because the goal of the first pass is to validate model-parallel execution and establish the TP boundary, not to fully optimize vocab-side memory layout on day one. - -### Communication Points - -For the first dense TP pass, the communication pattern should stay minimal. - -Per transformer layer, the expected TP collectives are: - -- one `all-reduce` after attention output projection -- one `all-reduce` after MLP output projection - -No additional collective requirements are introduced by the first-pass embedding / `lm_head` choice. - -### Runtime Bring-Up Notes - -The first end-to-end `TP=2` bring-up exposed a few concrete runtime hazards that are worth recording because they are not obvious from the high-level TP partitioning design alone. - -- `cuBLAS` handle and workspace state must not be process-global when TP ranks execute on different GPUs from different threads -- TP worker threads must explicitly bind both the CUDA runtime device and the driver context before using cuBLAS, FlashInfer, or NCCL - -In practice, the initial TP implementation initially hit: - -- an illegal memory access reported later in `paged_kv_scatter_cuda` -- intermittent hangs after some requests had already succeeded - -The root cause was not the scheduler boundary. It was runtime state management. - -These issues have now been fixed in the current bring-up implementation: - -- `cuBLAS` handles and workspaces needed to become thread-local -- TP worker threads needed explicit per-thread device binding before GPU work -- request-scoped worker-thread cuBLAS resources needed explicit teardown so repeated TP requests did not accumulate unstable per-thread state - -The later TP correctness pass exposed a separate decode-path bug that should be recorded explicitly: - -- decode was using a specialized paged KV append path that did not stay aligned with the generic paged scatter semantics used by prefill -- this caused decode-built KV state to drift from a fresh prefill-built KV state for the same logical prefix -- the fix was to stop using the decode-only append path and route decode KV writes through the same explicit paged scatter path used elsewhere - -That decode-state corruption bug is now fixed. The remaining TP correctness work is narrower than the original bring-up failures. - -This means the first-pass TP executor is now correct enough to run end-to-end, but the runtime shape is still more fragile than the eventual target design. - -### First-Pass Validity Constraints - -The `Qwen3-4B, TP=2` first pass assumes: - -- `num_attention_heads % tp_size == 0` -- `num_key_value_heads % tp_size == 0` -- `intermediate_size % tp_size == 0` - -These constraints are part of the milestone definition, not an implementation accident. - -## ModelExecutor Abstraction - -The next architectural step should be extracting a synchronous `ModelExecutor` boundary from the current scheduler-owned execution path. - -This is the key abstraction for future model-execution strategies: - -- single GPU execution -- tensor parallel execution -- later tensor-parallel plus expert-parallel execution - -It should be the execution abstraction for one logical model replica. It should not become the abstraction for request queueing, service-layer data parallelism, or cluster orchestration. - -### Why This Is The Next Step - -The current scheduler owns both: - -- control-plane logic such as active/deferred request management, admission control, and token streaming -- execution-plane logic such as prefill, decode, and unified-step GPU execution - -Tensor parallelism changes the execution plane much more than it changes the control plane. - -So the right next step is not a new scheduler design. The right next step is to extract the execution plane behind a stable executor interface while keeping the scheduler responsible for request lifecycle, KV allocation, and batching policy. - -### Control Plane Versus Execution Plane - -The scheduler should remain the control plane. - -The scheduler should continue to own: - -- request queueing -- active / deferred request lifecycle -- admission control -- `KvPool` -- KV page allocation and recycling -- deciding whether the next step is prefill, decode, or unified -- sampling policy -- token streaming -- finish reasons -- HTTP / API semantics - -The executor should become the execution plane. - -The executor should own: - -- model weights -- shared execution resources such as decode buffers, graph state, and TP communication state -- the implementation of batch-level prefill, decode, and unified-step execution - -This means the scheduler decides what batch should run next, while the executor decides how to execute that batch on the underlying device topology. - -### Request-Owned Versus Executor-Owned State - -The scheduler should continue to own request lifecycle state. - -Examples: - -- KV allocation state -- page lists or page-table metadata -- request-local sequence lengths -- last-token bookkeeping and generation counters - -The executor should own shared execution resources. - -Examples: - -- model weights -- shared decode buffers -- CUDA graph state -- TP communication state -- rank-local scratch buffers - -This split keeps admission control and KV budgeting where they already belong, while still moving model execution out of the scheduler. - -### Interface Shape - -The interface should be batch-step oriented, not kernel oriented. - -The scheduler should build a batch specification for one of the existing runtime step types: - -- prefill batch -- decode batch -- unified step - -The executor should synchronously execute that batch specification and return the outputs needed for scheduler-side post-processing. - -The first version should stay synchronous. - -The current runtime already serializes GPU ownership through the scheduler thread, and the next architectural goal is to separate responsibilities, not to introduce a second concurrency model. - -### Batch Specification - -The batch specification should describe one execution step, not a whole request lifecycle. - -The important information is: - -- which requests participate in the step -- whether the step is prefill, decode, or unified -- the input tokens for each request -- mutable references to request-owned execution state -- KV page metadata or equivalent scheduler-owned KV views needed by the kernels - -This keeps the executor narrow: it consumes one scheduler-chosen batch plan and executes it. - -### Three Runtime Boundaries - -The point of the next refactor is not to make the scheduler look smaller on paper. - -The point is to isolate three different responsibilities that already change independently: - -- control-plane step selection -- model execution -- step-result resolution back into request lifecycle state - -Those responsibilities should be represented by three explicit boundary types. - -#### `ExecutionPlan` - -`ExecutionPlan` is the boundary between the scheduler and the executor. - -Its job is to describe what should run in this step. - -It should contain: - -- the step kind -- the participating prefill and decode requests -- the input tokens or prompt slices for those requests -- mutable references or views into scheduler-owned request execution state such as `KvState` -- any per-step ordering or indexing the executor needs - -It should not contain: - -- finish reasons -- token streaming semantics -- HTTP / API semantics -- admission policy -- executor-internal resource ownership - -Ownership model: - -- owned and constructed by the scheduler -- consumed by the executor -- valid only for one execution step - -#### `ExecutionArtifacts` - -`ExecutionArtifacts` is the boundary between the executor and the step-result resolver. - -Its job is to describe what the executor produced, before those results are interpreted as request lifecycle outcomes. - -Examples: - -- prefill logits -- unified-step prefill and decode logits -- an executor-owned decode-step view over batched decode buffers - -It should contain raw execution products and executor-owned views. - -It should not contain: - -- finish reasons -- retirement decisions -- request promotions -- token events - -Ownership model: - -- produced by the executor -- consumed by a scheduler-local result-resolution layer -- short-lived, step-scoped - -#### `StepEffects` - -`StepEffects` is the boundary between the step-result resolver and the scheduler's long-lived request state. - -Its job is to describe how this step changes request lifecycle state. - -Examples: - -- which pending requests become active -- which active requests retire -- which token events should be emitted -- prompt echo payloads -- finish reasons -- updates to `last_token` and generation counters - -It should contain scheduler-facing state transitions and event payloads. - -It should not contain: - -- raw executor buffer views -- CUDA graph state -- TP communication state -- executor-owned staging resources - -Ownership model: - -- produced by the step-result resolver -- applied by the scheduler -- step-scoped, but expressed in scheduler terms rather than executor terms - -### Why These Boundaries Matter - -This is not abstraction for its own sake. - -These boundaries isolate three different change vectors: - -- batching and admission policy change the scheduler -- TP and other model-parallel execution strategies change the executor -- sampling, logprobs, echo, and finish handling change step-result resolution - -## Controller / Worker Broadcast Execution Model - -The execution model for Qwen3 tensor parallelism should now be treated as part of the main TP design, not as a separate note. - -The core idea is: - -- one controller decides the next step -- one primary worker plus zero or more additional rank workers execute it -- an ordered broadcast command stream keeps worker-local state synchronized - -This is the direction we want for the steady-state runtime. The earlier shape where the scheduler owned rank-local mutable state and worker threads borrowed into it was acceptable for bring-up, but it should not be the long-term architecture. - -### What We Are Rejecting - -We do not want the long-term design to rely on: - -- the scheduler thread owning all rank-local KV or execution objects -- worker threads borrowing `&mut` access into those objects -- raw pointer wrappers and timing assumptions as the main cross-thread correctness mechanism - -Tensor parallelism is a replicated local-state problem, not a shared-mutable-state problem. - -### Controller Responsibilities - -The controller is the only place that decides: - -- which requests participate in the next step -- whether the next step is prefill, decode, or unified -- which high-level lifecycle transitions should happen -- which ordered command should be broadcast next - -The controller should not directly execute GPU work, including rank 0 work. Rank 0 should execute on the primary worker thread under the same protocol as the additional ranks. - -### Worker Responsibilities - -Each worker owns its rank-local execution state, including: - -- its local model shard -- its local decode buffers -- its local KV state -- its local per-request execution state keyed by request identity - -The scheduler should continue to own user-facing lifecycle state such as streaming handles, sampling params, generation counters, and finish bookkeeping. This preserves the existing control-plane role while moving rank-local execution state to the workers. - -### Request Identity - -The broadcast protocol should use an explicit process-local request identity: - -- `RequestId(u64)` - -The controller assigns it from a monotonically increasing counter. Protocol messages should identify requests by `RequestId`, not by slot indices or by aligning multiple parallel vectors. - -### Command Protocol - -The protocol should stay coarse-grained and step-oriented. The primary commands are: - -- `RunPrefillStep` -- `RunDecodeStep` -- `RunUnifiedStep` -- `DropRequest` -- `Shutdown` - -We explicitly prefer this over exposing low-level commands such as `EnsureCapacity`, `Advance`, or `Reset` as first-class protocol messages. Internal KV mutation details should remain implementation details of one step whenever possible. - -For step payloads, the command shape should stay request-oriented: - -- `RunPrefillStep { requests: Vec, echo: bool }` -- `RunDecodeStep { requests: Vec }` -- `RunUnifiedStep { prefill: Vec, decode: Vec }` - -At minimum: - -- `PrefillStepItem` contains `request_id` and prompt tokens -- `DecodeStepItem` contains `request_id` and the decode token - -### Synchronization Rule - -The runtime should obey one simple rule: - -- no worker starts command `N + 1` until all workers have finished command `N` - -This gives deterministic ordering, simpler failure handling, and avoids reintroducing cross-thread mutable-borrow coupling through the side door. - -### Execution Shape - -Qwen3 should keep one executor shape: - -- one `Qwen3Executor` -- one primary local lane -- zero or more additional rank workers - -Under this model: - -- single-GPU execution is the `tp_size == 1` case -- tensor parallel execution is the `tp_size > 1` case - -The controller-side protocol should not split into unrelated single-GPU and TP command families. Both modes should use the same coarse-grained `StepCommand`, with broadcast fanout degenerating naturally to the single-worker case. - -### Result Ownership - -For now, result flow should stay asymmetric: - -- non-primary workers return acknowledgement or step failure only -- the primary worker returns the step artifacts needed by the controller - -This keeps workers responsible for local execution while the controller remains responsible for resolving execution artifacts into scheduler-visible effects. - -### Sampling Ownership - -Sampling is split into two responsibilities: - -- the controller owns sampling policy and random input generation -- the primary worker executes GPU sampling and logprob extraction - -Concretely, `SamplingParams` and per-step random values travel with the step items, worker threads run the GPU sampling path, and the controller consumes CPU-visible step artifacts after execution. - -### Request Destruction - -Request destruction should remain an explicit protocol action: - -- `DropRequest { request_id }` - -This keeps lifecycle transitions visible at the command layer instead of hiding them inside unrelated step commands. - -### Near-Term Cleanup Direction - -For the next cleanup passes, we should bias toward: - -- worker-owned rank-local state -- broadcast command protocol -- explicit barrier semantics -- request-oriented payloads - -We should bias away from: - -- controller ownership of rank-local execution objects -- command payloads that smuggle `&mut` semantics through raw pointers -- designs that depend on multiple parallel vectors and positional alignment to identify one logical request - -Without these boundaries, those changes accumulate in the same scheduler functions and the runtime becomes harder to extend in predictable ways. - -### Scheduler-Local Result Resolution - -Not all logic currently living in the scheduler should move into the executor. - -There is a third layer that should remain scheduler-local, but should not stay inline inside the scheduler loop. - -That layer is step-result resolution. - -It should be responsible for: - -- first-token handling after prefill or unified execution -- decode-token handling after decode or unified execution -- logprob assembly -- prompt-echo assembly -- EOS / max-length / consumer-drop retirement decisions -- promotion of newly-prefilled requests into the active set - -This logic is scheduler policy, not model execution. - -So the right direction is: - -- keep execution in `ModelExecutor` -- keep request lifecycle ownership in the scheduler -- move step-result interpretation into a scheduler-local resolver layer - -### Next Refactor Shape - -The next step after introducing `ModelExecutor` should be to restructure the scheduler around these boundaries: - -1. build an `ExecutionPlan` -2. execute it to produce `ExecutionArtifacts` -3. resolve those artifacts into `StepEffects` -4. apply those effects to scheduler-owned state - -Conceptually: - -```rust -loop { - let plan = build_next_plan(...); - let artifacts = executor.execute(plan)?; - let effects = resolve_step(plan, artifacts, ...); - apply_effects(effects, ...); -} -``` - -The primary value of this refactor is responsibility isolation, not reducing the line count of `scheduler.rs` by itself. - -Current status after the latest TP cleanup: - -- the scheduler no longer runs Qwen3 GPU execution directly -- `tp_size == 1` and `tp_size > 1` both go through the same coarse-grained step protocol -- rank 0 execution now happens on a primary worker thread rather than on the scheduler thread -- scheduler-side work is limited to planning, command submission, and effect application -- sampling policy remains controller-owned, but sampling execution and logprob extraction now run on the worker side - -### Rust Sketch - -The following sketch is intentionally narrow. It is meant to capture the boundary, not to freeze the final implementation. - -```rust -use anyhow::Result; - -use crate::kv_pool::KvState; -use crate::tensor::DeviceVec; - -pub struct Qwen3RequestState { - pub kv: KvState, -} - -pub enum BatchKind { - Prefill, - Decode, - Unified, -} - -pub struct PrefillItem<'a> { - pub prompt_tokens: &'a [u32], - pub state: &'a mut Qwen3RequestState, -} - -pub struct DecodeItem<'a> { - pub token: u32, - pub state: &'a mut Qwen3RequestState, -} - -pub struct BatchSpec<'a> { - pub kind: BatchKind, - pub prefills: &'a mut [PrefillItem<'a>], - pub decodes: &'a mut [DecodeItem<'a>], -} - -pub struct BatchResult { - pub prefill_logits: Vec, - pub decode_logits: Vec, -} - -pub trait ModelExecutor: Send { - fn execute_batch(&mut self, spec: BatchSpec<'_>) -> Result; -} -``` - -The exact output representation may later need adjustment. - -In particular, decode output should be allowed to preserve a batched representation if that is important for throughput. - -What matters here is the ownership and control boundary, not freezing `Vec` as the permanent decode API. - -### Design Intent - -The intent of this interface is: - -- scheduler remains the control plane -- executor becomes the execution plane -- TP is hidden inside the executor, not leaked into scheduler logic -- `KvPool` and request admission stay in the scheduler - -This keeps the future shape clean: - -- `SingleGpuQwen3Executor` -- `TensorParallelQwen3Executor` - -Both should be able to sit behind the same scheduler-facing interface. - -### Relationship To DP - -This abstraction is the right carrier for model-internal parallelism such as TP and later EP. - -It is not the right abstraction for service-layer data parallelism across multiple model replicas. - -If pegainfer later needs multiple model replicas, that should live above the executor layer. A `ModelExecutor` still represents one logical model replica, even if that replica internally spans multiple GPUs. - -## What Success Looks Like - -The milestone is successful when all of the following are true: - -- `Qwen3-4B` runs correctly with `TP=2` -- existing single-GPU behavior remains intact with `TP=1` -- the runtime can be reasonably extended later toward larger dense models and MoE-related model parallelism -- the resulting design is understandable and does not create a one-off parallel path that the rest of the codebase must work around - -## Acceptance Criteria - -Primary acceptance criteria: - -- `Qwen3-4B` under `TP=2` passes the existing end-to-end test suite, or passes an equivalent minimally adjusted `e2e` path if the test harness needs TP-aware setup -- generated outputs under `TP=2` match the `TP=1` baseline for the covered `e2e` cases -- `TP=1` continues to pass its existing `e2e` coverage -- runtime stability is acceptable: no hangs, no cross-device state corruption, no obvious lifecycle failures during model load or generation - -## Current State And Remaining Issues - -At this point, the implementation meets the basic smoke-test bar for `Qwen3-4B TP=2`: - -- model load succeeds on two GPUs -- requests complete end-to-end through the existing OpenAI-compatible HTTP path -- generated outputs are sensible and clearly non-degenerate - -The current implementation has also passed a narrower `TP=8` smoke test on an 8x4090 machine: - -- eight-way weight load succeeds -- the server reaches `Scheduler ready` and starts listening -- simple completion requests return non-degenerate text - -That means the executor and sharding path are no longer merely `TP=2`-shaped, but `TP=8` should still be treated as an experimental validated configuration rather than a fully qualified support target. - -However, a few important engineering issues still remain open: - -- TP-vs-TP=1 exact parity is still not fully settled, but the old decode-state corruption bug is no longer the main blocker -- embedding and `lm_head` are still replicated by design in this first pass -- some of the runtime fixes are pragmatic bring-up fixes rather than final abstractions, especially around thread-scoped CUDA runtime / cuBLAS setup and teardown - -The next practical steps should be: - -- keep the current TP path stable and avoid reopening the earlier decode append bug -- further unify the `tp=1` and `tp>1` scheduler / executor flow now that both paths are real and runnable -- defer TP-specific CUDA Graph work until after that runtime shape is cleaner and more stable -- then revisit vocab-side replication only after the execution and correctness story is stable - -So the right reading of the current status is: - -- the architecture direction is validated -- the TP path is real and runnable -- the implementation is still a first-pass runtime bring-up, not the final production shape -- TP-specific CUDA Graph support should still be treated as follow-up work rather than a solved part of the current baseline - -The core bar for this milestone is straightforward: - -- `TP=2` for `Qwen3-4B` must be real, correct, and regression-safe - -## Out Of Scope Questions - -The following questions are intentionally deferred until after the first TP milestone is proven: - -- how far to push throughput optimization in the first TP implementation -- when to restore or redesign CUDA Graph support for TP paths -- when to shard vocab-facing weights instead of replicating them -- whether later multi-GPU support should expand first toward larger dense models, MoE, or broader serving topology work - -## Summary - -This milestone should stay disciplined. - -The job is not to build a full distributed inference platform in one step. The job is to make pegainfer capable of correct `TP=2` execution for `Qwen3-4B`, while establishing the architectural boundary that future large dense and MoE work can build on. +- **cuBLAS handles/workspaces must be thread-local, and every TP worker thread must bind both the CUDA runtime device and the driver context** before touching cuBLAS/FlashInfer/NCCL. The original symptoms were an illegal memory access surfacing later in `paged_kv_scatter_cuda` and intermittent hangs after successful requests — runtime state management, not the scheduler boundary. (Generalized in `docs/lessons/exact-match-gate-thread-cublas.md`.) +- **Request-scoped worker-thread cuBLAS resources need explicit teardown**, or repeated TP requests accumulate unstable per-thread state. +- **Decode KV writes must use the same paged scatter path as prefill.** A decode-only specialized append path silently drifted from the generic scatter semantics, so decode-built KV state diverged from a fresh prefill of the same prefix. The fix was deleting the special path, not patching it. diff --git a/docs/models/qwen35/model-crate.md b/docs/models/qwen35/model-crate.md index 8153c58f..461aea11 100644 --- a/docs/models/qwen35/model-crate.md +++ b/docs/models/qwen35/model-crate.md @@ -8,13 +8,13 @@ - **Read**: - `docs/index.md` - identified the existing core split, Qwen3 model crate split, and Qwen3.5 accuracy/optimization docs. - - `docs/models/qwen3/model-crate.md` - Qwen3 already owns its scheduler, executor/runtime API, tests, benches, and root-facing `EngineHandle` entry. + - `docs/models/qwen3/crate-layout.md` (at the time `model-crate.md`) - Qwen3 already owns its scheduler, executor/runtime API, tests, benches, and root-facing `EngineHandle` entry. - `docs/models/qwen35/accuracy.md` - at the time of this migration, Qwen3.5 e2e tests were regression guards against `test_data/Qwen3.5-4B.json`; current accuracy coverage is the HF logits gate recorded there. - `docs/models/qwen35/optimization.md` - Qwen3.5 should keep its hybrid linear/full-attention scheduler/state architecture. - GitHub issue #79 - acceptance criteria require `pegainfer-qwen35-4b`, removal of root `pegainfer::model::Qwen35Model` and `pegainfer::scheduler_qwen35`, generic root `bench_serving`, and CUDA validation. - `Cargo.toml`, `src/lib.rs`, `src/main.rs`, `src/ops.rs`, `src/scheduler.rs`, `src/model/qwen35.rs`, and `pegainfer-qwen3-4b/src/lib.rs` - mapped the current root Qwen3.5 surface and the Qwen3 crate interface to copy. - **Relevant history**: - - `docs/models/qwen3/model-crate.md` - root should load model crates through `EngineHandle`; model-owned execution details should move behind crate-local modules. + - `docs/models/qwen3/crate-layout.md` (at the time `model-crate.md`) - root should load model crates through `EngineHandle`; model-owned execution details should move behind crate-local modules. - **Plan**: 1. Add `pegainfer-qwen35-4b` to the workspace with dependencies mirroring the Qwen3 crate plus the root dependencies Qwen3.5 currently uses. 2. Move `src/model/qwen35.rs`, `src/model/qwen35/*`, `src/scheduler_qwen35.rs`, and Qwen3.5 recurrent op wrappers into the new crate, keeping CUDA/Triton kernel sources and FFI in `pegainfer-kernels`. diff --git a/docs/models/qwen35/roadmap.md b/docs/models/qwen35/roadmap.md index 6a406e4b..0dc49019 100644 --- a/docs/models/qwen35/roadmap.md +++ b/docs/models/qwen35/roadmap.md @@ -45,7 +45,7 @@ Tracking issue: see the `[Model] Qwen3.5-4B roadmap` GitHub issue. Sibling doc: ## Cleanup ledger -- **Dead code:** ✓ qwen35 `probe_model()`+`ModelInfo` and the `start_with_model` entry point removed (#258); the same dead pair still exists in qwen3 (owned there). +- **Dead code:** ✓ qwen35 `probe_model()`+`ModelInfo` and the `start_with_model` entry point removed (#258); qwen3's matching pair removed in #248. - **Docs:** ✓ qwen35 docs cleaned (#258): `Status:` enum headers dropped, obsolete `crates/` paths corrected to top-level, parity numbers reconciled to one ledger (234ms/11.77ms), and the e2e-gibberish story lifted to `docs/lessons/exact-match-gate-thread-cublas.md`. #186 then added the HF logits gate and retired the exact-text baseline. - **Shared with qwen3 (owned there):** batched greedy decode sampling (`batch_decode.rs` has the same per-row pattern), non-greedy sampling correctness coverage, frontend usage accounting (#78). diff --git a/docs/subsystems/kernels/kernel-op-reports.md b/docs/subsystems/kernels/kernel-op-reports.md index 15f4b385..93fe5e1b 100644 --- a/docs/subsystems/kernels/kernel-op-reports.md +++ b/docs/subsystems/kernels/kernel-op-reports.md @@ -9,17 +9,17 @@ - **Read**: - `docs/index.md` - located the active benchmarking, CUPTI, kernel-boundary, and Qwen3 model-crate docs. - - `docs/models/qwen3/model-crate.md` - confirmed `qwen3_kernel_snapshot` was the current Qwen3 kernel snapshot runner and already captured warm/cold-L2 latency plus default CUPTI counters. + - `docs/models/qwen3/crate-layout.md` (at the time `model-crate.md`) - confirmed `qwen3_kernel_snapshot` was then the Qwen3 kernel snapshot runner and already captured warm/cold-L2 latency plus default CUPTI counters. - `docs/conventions/bench-regression.md` - clarified that the existing serving benchmark remains the model-level regression artifact; this task should not mix per-op reports with E2E snapshots. - `docs/subsystems/kernels/pegainfer-kernels-boundary.md` - confirmed kernels should become first-class measurable assets and model DAG manifests should live with model crates. - - `docs/models/qwen3/kernels-crate.md` - confirmed kernel source/build ownership now lives in `pegainfer-kernels`, while model-owned DAG metadata belongs in the Qwen3 crate. + - `docs/models/qwen3/crate-layout.md` (at the time `kernels-crate.md`) - confirmed kernel source/build ownership lives in `pegainfer-kernels`, while model-owned DAG metadata belongs in the Qwen3 crate. - `docs/playbooks/profiling-guide.md` - confirmed the diagnostic split between kernel composition/proportions and benchmark-grade latency. - **Relevant history**: - - `docs/models/qwen3/model-crate.md` showed the current single-op snapshot already found the low-batch long-context decode-attention bottleneck. + - The Qwen3 crate records (now `docs/models/qwen3/crate-layout.md`) showed the single-op snapshot already found the low-batch long-context decode-attention bottleneck. - **Plan**: 1. Add direct Qwen3 crate dev-dependencies for generic infrastructure (`clap` derive for CLI and `toml` for manifest parsing) instead of extending the hand-written parser. 2. Add a model-local TOML manifest for Qwen3-4B kernel reports, initially covering only op names, phases, shape sweeps, and variants. - 3. Replace `crates/pegainfer-qwen3-4b/benches/qwen3_kernel_snapshot.rs` with a manifest-driven `qwen3_kernel_report` bin; do not keep a bench wrapper. + 3. Replace the `qwen3_kernel_snapshot` bench with a manifest-driven `qwen3_kernel_report` bin; do not keep a bench wrapper. 4. Add a composition command that reads per-op case results and emits a decode phase report by joining the manifest's op repeat rules with measured per-op reports. 5. Run formatting and the strongest local compile checks available; GPU execution may still require the CUDA validation host because this machine lacks local CUDA tooling. - **Risks / open questions**: @@ -29,13 +29,13 @@ ## Execution Log ### Step 1: Move from bench target to bin -- Removed the `qwen3_kernel_snapshot` bench target from `crates/pegainfer-qwen3-4b/Cargo.toml`. -- Moved the report runner to `crates/pegainfer-qwen3-4b/src/bin/qwen3_kernel_report.rs`. +- Removed the `qwen3_kernel_snapshot` bench target from `pegainfer-qwen3-4b/Cargo.toml`. +- Moved the report runner to `pegainfer-qwen3-4b/src/bin/qwen3_kernel_report.rs`. - Added a `kernel-report` feature for generic tool dependencies (`clap`, `toml`, `sha2`, `hex`) and `pegainfer-cupti`; the bin requires that feature so normal Qwen3 library/server builds do not pull CUPTI into the default dependency graph. - Removed the temporary `cargo bench` compatibility argument handling after the tool became a normal binary. ### Step 2: Add model-local manifest -- Added `crates/pegainfer-qwen3-4b/kernel_manifests/qwen3-4b.toml`. +- Added `pegainfer-qwen3-4b/kernel_manifests/qwen3-4b.toml`. - The first manifest now stays deliberately thin: `model`, `[[ops]]`, `phase`, per-op shape sweep fields, and variant labels. Provider-owned facts such as dtype, head counts, head dimension, page size, thresholds, and composition policy stay in Rust. ### Step 3: Refactor report schema and commands @@ -46,11 +46,11 @@ ### Step 4: Local validation - `cargo fmt --all --check` passed. - `cargo metadata --no-deps --format-version 1` passed. -- Local `cargo check --release -p pegainfer-qwen3-4b --bench qwen3_kernel_snapshot` previously failed before Rust type checking because this local host lacks a usable `nvcc`; GPU validation moved to the CUDA validation host. +- Local compile checks of the report bin fail before Rust type checking because this local host lacks a usable `nvcc`; GPU validation moved to the CUDA validation host (build/run commands in Step 5). ### Step 5: GPU minimal validation - Rebuilt the disposable validation worktree at `` from local `HEAD` commit `612850f`, then rsynced the current working tree changes over it. -- Copied initialized FlashInfer headers from `/third_party/flashinfer` into the clean worktree's `crates/pegainfer-kernels/third_party/flashinfer` directory. +- Copied initialized FlashInfer headers from `/third_party/flashinfer` into the clean worktree's `pegainfer-kernels/third_party/flashinfer` directory. - `PEGAINFER_CUDA_SM=120 cargo build --release -p pegainfer-qwen3-4b --features kernel-report --bin qwen3_kernel_report` passed. - `PEGAINFER_CUDA_SM=120 cargo run --release -p pegainfer-qwen3-4b --features kernel-report --bin qwen3_kernel_report -- run --no-cupti --iters 1 --contexts 1024 --batch-sizes 1 --variants non_partition --out $RESULT_ROOT/qwen3_kernel_op_report_min.json` passed. - `PEGAINFER_CUDA_SM=120 cargo run --release -p pegainfer-qwen3-4b --features kernel-report --bin qwen3_kernel_report -- compare --base $RESULT_ROOT/qwen3_kernel_op_report_min.json --new $RESULT_ROOT/qwen3_kernel_op_report_min.json` passed with `warnings=0 failures=0`. @@ -242,10 +242,10 @@ ### Step 16: Inspect local FlashInfer prefill implementation - Initialized the local FlashInfer submodule: - - `git submodule update --init crates/pegainfer-kernels/third_party/flashinfer` + - `git submodule update --init pegainfer-kernels/third_party/flashinfer` - Checked out `779c24d1c9e6fcc51aa2359884696fbf4ac69b3b`. - Confirmed the current PegaInfer wrapper calls FlashInfer's FA2 paged prefill path: - - `crates/pegainfer-kernels/csrc/paged_attention.cu` computes `cta_tile_q = FA2DetermineCtaTileQ(packed_qo_len, head_dim)` and dispatches `BatchPrefillWithPagedKVCacheDispatched`. + - `pegainfer-kernels/csrc/paged_attention.cu` computes `cta_tile_q = FA2DetermineCtaTileQ(packed_qo_len, head_dim)` and dispatches `BatchPrefillWithPagedKVCacheDispatched`. - For Qwen3 `seq_len=10000`, `packed_qo_len = seq_len * (num_qo_heads / num_kv_heads) = 40000`, and FlashInfer's `FA2DetermineCtaTileQ` selects `CTA_TILE_Q=128`. - Rust prefill planning also calls `batch_prefill_cta_tile_q`, so any `CTA_TILE_Q` override must be plumbed into both the plan metadata and the kernel launch. - Matched the NCU kernel traits to FlashInfer source: diff --git a/pegainfer-engine/src/engine.rs b/pegainfer-engine/src/engine.rs index 2bb6b9c5..ab7fdfe3 100644 --- a/pegainfer-engine/src/engine.rs +++ b/pegainfer-engine/src/engine.rs @@ -41,14 +41,6 @@ pub enum EpBackend { Pplx, } -#[derive(Clone, Debug)] -pub struct ModelInfo { - pub id: &'static str, - pub display_name: String, - pub model_path: PathBuf, - pub max_model_len: Option, -} - #[derive(Clone, Debug, PartialEq)] pub struct TokenLogprob { pub logprob: f32, diff --git a/pegainfer-qwen3-4b/src/batch_decode_trace.rs b/pegainfer-qwen3-4b/src/batch_decode_trace.rs index b7e5c850..b5e5483d 100644 --- a/pegainfer-qwen3-4b/src/batch_decode_trace.rs +++ b/pegainfer-qwen3-4b/src/batch_decode_trace.rs @@ -12,8 +12,6 @@ use crate::weights::{ModelRuntimeConfig, Qwen3Model}; pub const MODEL: &str = "qwen3-4b"; pub const PHASE_DECODE: &str = "decode"; -pub const HIDDEN_SIZE: usize = 2560; -pub const INTERMEDIATE_SIZE: usize = 9728; pub const NUM_LAYERS: usize = 36; pub const NUM_Q_HEADS: usize = 32; pub const NUM_KV_HEADS: usize = 8; diff --git a/pegainfer-qwen3-4b/src/lib.rs b/pegainfer-qwen3-4b/src/lib.rs index 43dcf7c3..3575d203 100644 --- a/pegainfer-qwen3-4b/src/lib.rs +++ b/pegainfer-qwen3-4b/src/lib.rs @@ -16,7 +16,7 @@ mod weights; use std::path::Path; use anyhow::Result; -use pegainfer_core::engine::{EngineHandle, EngineLoadOptions, ModelInfo}; +use pegainfer_core::engine::{EngineHandle, EngineLoadOptions}; pub use kernel_plan::kernel_plan; @@ -32,29 +32,6 @@ pub mod runtime { }; } -pub fn probe_model(model_path: &Path) -> Result> { - let config_path = model_path.join("config.json"); - let content = match std::fs::read_to_string(&config_path) { - Ok(content) => content, - Err(err) if err.kind() == std::io::ErrorKind::NotFound => return Ok(None), - Err(err) => return Err(err.into()), - }; - let json: serde_json::Value = serde_json::from_str(&content)?; - if json.get("text_config").is_some() { - return Ok(None); - } - - Ok(Some(ModelInfo { - id: "qwen3-4b", - display_name: "Qwen3-4B".to_string(), - model_path: model_path.to_path_buf(), - max_model_len: json - .get("max_position_embeddings") - .and_then(serde_json::Value::as_u64) - .and_then(|value| u32::try_from(value).ok()), - })) -} - pub fn start_engine(model_path: &Path, options: EngineLoadOptions) -> Result { let EngineLoadOptions { enable_cuda_graph,