diff --git a/docs/index.md b/docs/index.md index b833cad6..328acac7 100644 --- a/docs/index.md +++ b/docs/index.md @@ -93,7 +93,7 @@ Organized by domain (model line / subsystem / playbook / lesson) instead of by l | --- | --- | | `subsystems/runtime/runtime.md` | Runtime complexity is controlled by a shared `openinfer-core` that owns the generation contract and orchestration; per-model crates implement `ModelForward` so prefill/decode and hybrid attention stay hidden from the caller. State (`&mut`) is separated from weights (`&self`) for future bs > 1. | | `subsystems/runtime/kv-cache-design.md` | Dynamo 式 logical/physical 分层 KV cache:BlockManager 管 block 生命周期和 admission,PhysicalBackend trait 管 GPU 内存和布局(FullAttention / MLA)。支持 TP / DP。基于 vLLM/Dynamo/pegaflow 调研。 | -| `subsystems/runtime/pegaflow-offload-integration.md` | 把 `pegaflow-core` 当进程内 Rust 库做 KV 卸载物理后端(HBM→DRAM/SSD/RDMA),补 kvbm 没写的卸载层。**Qwen3-4B full-attn 首发,端到端已在真实 GPU 跑通并验证**(async SAVE+LOAD 接进 executor/scheduler,纯 CPU-hit 与 GPU+CPU 组合 hit 恢复后 logits 与冷算一致)。pegaflow 经 git rev pin(#331+#333)。默认关,未接 server CLI。linear 排除,sparse 暂缓。 | +| `subsystems/runtime/pegaflow-offload-integration.md` | 把 `pegaflow-core` 当进程内 Rust 库做 KV 卸载物理后端(HBM→DRAM/SSD/RDMA),补 kvbm 没写的卸载层。**Qwen3-4B full-attn 首发,端到端已在真实 GPU 跑通并验证**(async SAVE+LOAD 接进 executor/scheduler,纯 CPU-hit 与 GPU+CPU 组合 hit 恢复后 logits 与冷算一致)。pegaflow 经 git rev pin(#331+#333)。默认关,server CLI 已接(#316:`--kv-offload`/`--no-prefix-cache`,plain+LoRA)。linear 排除,sparse 暂缓。 | ## subsystems / scheduler diff --git a/docs/models/qwen3/roadmap.md b/docs/models/qwen3/roadmap.md index d079db0a..9f3273fb 100644 --- a/docs/models/qwen3/roadmap.md +++ b/docs/models/qwen3/roadmap.md @@ -1,6 +1,6 @@ # Qwen3-4B Roadmap -> **TL;DR:** Qwen3-4B is the maturity bar of the project — continuous batching, TP=2, default-on prefix cache (#216), and the HF logits golden gate are all live — so its roadmap is sharpening, not bring-up. The #220 RoPE OOB bug is now fixed (cos/sin cache sized from `max_position_embeddings`, admission rejects past the window, kernel traps an out-of-range position; gated by both an oversized-reject and an in-window >4096 IT). The verified open set: per-row batch-decode sampling (O(batch) launches + syncs per step despite a production-proven batched primitive in-tree), zero TP correctness coverage, LoRA built but gated only by a zero-adapter smoke, prefix-cache observability dropped at the scheduler boundary, a docs layer that describes deleted tooling, and the YaRN #8 follow-up for rope-scaled checkpoints. Open-set findings verified 2026-06-04 against `6ee9247`. +> **TL;DR:** Qwen3-4B is the maturity bar of the project — continuous batching, TP=2, default-on prefix cache (#216), and the HF logits golden gate are all live — so its roadmap is sharpening, not bring-up. The #220 RoPE OOB bug is now fixed (cos/sin cache sized from `max_position_embeddings`, admission rejects past the window, kernel traps an out-of-range position; gated by both an oversized-reject and an in-window >4096 IT). The verified open set: per-row batch-decode sampling (O(batch) launches + syncs per step despite a production-proven batched primitive in-tree), zero TP correctness coverage, LoRA built but gated only by a zero-adapter smoke, prefix-cache observability dropped at the scheduler boundary, a docs layer that describes deleted tooling, and the YaRN #8 follow-up for rope-scaled checkpoints. Open-set findings verified 2026-06-04 against `6ee9247`. Since then: batched greedy sampling phase 1 landed (#307) and in-process pegaflow KV offload / L2 host tier shipped (#316, pure-L2 TTFT 195→40ms) — both folded into the table below; phase-2 random sampling and the rest of the open set stand. > > **Last touched:** 2026-06 @@ -14,7 +14,8 @@ Tracking issue: see the `[Model] Qwen3-4B roadmap` GitHub issue. Cross-model ite | Prefix cache | ✓ default-on full-block kvbm matching (#216); 4 cache-hit replay passes in the golden gate | `executor.rs:750-751`, `tests/hf_golden_gate.rs` | | Accuracy gate | ✓ HF bf16 golden, bs=1/batched/graph + cached replays; single-GPU, ≤256-token prompts | `tests/hf_golden_gate.rs:451` | | Long context | ✓ fixed: RoPE cache sized from `max_position_embeddings`, admission rejects past the window, kernel traps OOB; gated by reject + in-window >4096 ITs. YaRN #8 still open for scaled checkpoints | `weights.rs:310-318`, `tests/context_window.rs`, `tests/context_window_in_window.rs` | -| Batch sampling | ✗ per-row: O(batch) launches + O(batch) D2H syncs per decode step; 1MB scratch literal | `executor.rs:159-179,212-214`, `ops/sampling.rs:7,204-208` | +| Batch sampling | ⚠ greedy phase landed (#307: one `argmax_batch_bf16_into` launch/step, not O(batch)); batched random per-row path + 1MB scratch-from-literal still open (phase 2) | `ops/sampling.rs`, `csrc/shared/argmax.cu` | +| KV offload (L2) | ✓ in-process pegaflow host-tier save/restore (#316); CLI `--kv-offload`/`--no-prefix-cache`, plain + LoRA; pure-L2 TTFT 195→40ms | `subsystems/runtime/pegaflow-offload-integration.md` | | TP correctness | ✗ zero automated coverage — every test runs `device_ordinals: vec![0]` | grep `tests/` | | LoRA | ⚠ load/unload/TP/request-level all built; only test uses a **zero adapter** | `lora.rs`, `tests/lora_smoke.rs:91-130` | | Non-greedy sampling | ✗ zero correctness coverage (all tests greedy); penalties/min_p absent from `SamplingParams` | grep `tests/` | @@ -26,7 +27,7 @@ Tracking issue: see the `[Model] Qwen3-4B roadmap` GitHub issue. Cross-model ite ### Now 1. **YaRN for rope-scaled checkpoints (#8).** The #220 RoPE OOB fix landed scope (a): the cos/sin cache is sized from `config.max_position_embeddings`, admission crash-early rejects past the window (distinct context-length vs KV-budget reasons), the kernel `__trap`s an out-of-range position as a last-resort backstop, and the gate now covers both an oversized reject and an in-window >4096 case (`tests/context_window.rs`, `tests/context_window_in_window.rs`). That precompute is correct *only because this checkpoint has `rope_scaling: null`*. Scope (b) remains open: #8 YaRN is the prerequisite for any rope-scaled checkpoint — the precompute length must come from the scaled schedule, coordinated with the qwen3.5 sibling fix so both crates share the pattern. -2. **Batched greedy decode sampling.** Phase 1: route all-greedy batches through `argmax_batch_bf16_into` — one launch + one D2H per step; this primitive is production-proven in deepseek-v2-lite (`runtime.rs:1379`). `flashinfer_top1_batch_into` has *no* production caller and needs its own validation before use. Phase 2: batched random path with per-row params; source the 1MB FlashInfer row-state scratch from the kernel instead of the literal. Shared `openinfer-core/kernels` work — covers qwen35 too. Gated by the existing golden gate. +2. **Batched greedy decode sampling.** ✓ Phase 1 landed (#307): all-greedy batches route through `argmax_batch_bf16_into` — one launch + one D2H per step (the deepseek-v2-lite-proven primitive, now in `openinfer-core` + `csrc/shared/argmax.cu`). Phase 2 still open: batched random path with per-row params; source the 1MB FlashInfer row-state scratch from the kernel instead of the literal. Shared `openinfer-core/kernels` work — covers qwen35 too. Gated by the existing golden gate. 3. **Sampling correctness coverage.** Every test in both qwen crates is greedy. Add seed-determinism + temperature/top_k/top_p behavioral tests, and audit the frontend for silently-dropped params (penalties, min_p are absent from `SamplingParams` entirely) — the kimi-k2 silent-greedy bug (#237) shows this class is real and currently nothing would catch it here. 4. **Prefix-cache observability.** `cached_tokens` is computed (`executor.rs:751`) and dies at the scheduler boundary; the frontend hardcodes `num_cached_tokens: 0`. Thread it through `TokenEvent::Scheduled` into usage; log hit rate. Adjacent: #78 (streaming usage discards completion_tokens) — same usage-accounting surface. diff --git a/docs/roadmap/execution.md b/docs/roadmap/execution.md index 39348869..25c4f6cf 100644 --- a/docs/roadmap/execution.md +++ b/docs/roadmap/execution.md @@ -10,6 +10,7 @@ These are the shared layers — frontend, runtime, kernels, ledger/simulator/tra - **Model-owned kernel plans.** Qwen3 already carries a light `kernel_plan` mapping prefill/decode/unified phases → Rust wrappers, FFI symbols, and CUDA/Triton/cuBLAS backends. Extend the same shape to Qwen3.5 and DeepSeek V4 so each model crate is self-describing. - **Frontend polish.** `vllm-frontend-rs` is the default OpenAI surface, talking to openinfer via a local engine-core IPC bridge. Outstanding: logprobs / prompt-logprobs translation, usage accounting, and a deliberate decision on whether the served-model-id should decouple from the tokenizer path. +- **KV data plane (pegaflow).** First concrete tier landed: `openinfer-kv-offload::OffloadEngine` wraps in-process `pegaflow-core` as a host-tier ("L2") KV backend (mechanism), with per-model schedulers owning the residency policy (no universal connector trait — policy/mechanism split per `direction.md`). Shipped on Qwen3-4B full-attn (#316). Next candidate: Kimi-K2 MLA (layout zero-impedance, reuses the same connector pattern). SSD/RDMA tiers and DeepSeek sparse remain unrealized scope. See `subsystems/runtime/pegaflow-offload-integration.md`. ### Next @@ -56,7 +57,7 @@ Each model crate owns its own scheduler, kernels, accuracy story, and benchmarks **Goal:** stay ahead of vLLM on serving experience while expanding the parallel-strategy and scheduling surface. -**Done:** single-request and continuous batching ahead of vLLM (decode TPOT wins at all concurrencies; QPS=2 within 2% throughput while leading TTFT, TPOT, and latency stability). TP=2 brought up end-to-end on one machine; TP=8 smoke-tested on 8×4090. Issue #85 KV-pressure hang at QPS=2 is fixed. +**Done:** single-request and continuous batching ahead of vLLM (decode TPOT wins at all concurrencies; QPS=2 within 2% throughput while leading TTFT, TPOT, and latency stability). TP=2 brought up end-to-end on one machine; TP=8 smoke-tested on 8×4090. Issue #85 KV-pressure hang at QPS=2 is fixed. Batched greedy decode sampling landed (#307 — one `argmax_batch_bf16_into` launch per step, not O(batch)). **In-process pegaflow KV offload shipped (#316)** — host-tier "L2" save/restore wired into executor + scheduler, server CLI (`--kv-offload` / `--kv-offload-host-gib` / `--no-prefix-cache`) on both plain and LoRA paths; pure-L2 TTFT 195→40 ms measured. See `subsystems/runtime/pegaflow-offload-integration.md`. **Next:** - Explore pipeline parallelism (PP) as a complement to TP — particularly for larger model fits and multi-node layouts. diff --git a/docs/subsystems/runtime/pegaflow-offload-integration.md b/docs/subsystems/runtime/pegaflow-offload-integration.md index c0653d8e..38163070 100644 --- a/docs/subsystems/runtime/pegaflow-offload-integration.md +++ b/docs/subsystems/runtime/pegaflow-offload-integration.md @@ -1,6 +1,6 @@ # pegaflow KV 卸载接入 Spec -> **TL;DR**: 把 `pegaflow-core` 当**进程内 Rust 库**做 KV 卸载的物理后端(HBM→DRAM/SSD/RDMA),补上 kvbm 留着没写的卸载层。connector 大脑(决定 load/save 哪些 block)用 kvbm logical/physical 分层思想自建,pegaflow 退为语义无关的 raw block transfer 后端。**路线已调整为 Qwen3-4B full-attn 首发**(原计划 Kimi 首发):page-first 单 buffer 经 pegaflow `block_stride_bytes`(PR #331)适配。**端到端已在真实 GPU 上跑通并验证**:async SAVE + async LOAD 接进 `Qwen3Executor` + scheduler,`tests/kv_offload_cpu_hit.rs` 覆盖纯 CPU-hit 与 GPU+CPU 组合 hit,恢复后 logits 与冷算一致;连接层 `OffloadEngine` + `tests/cpu_roundtrip.rs` 字节级一致。默认关(builder flag opt-in),未接 server CLI。**Qwen3.5 linear/SSM state 明确排除**;**DeepSeek sparse 暂缓**。 +> **TL;DR**: 把 `pegaflow-core` 当**进程内 Rust 库**做 KV 卸载的物理后端(HBM→DRAM/SSD/RDMA),补上 kvbm 留着没写的卸载层。connector 大脑(决定 load/save 哪些 block)用 kvbm logical/physical 分层思想自建,pegaflow 退为语义无关的 raw block transfer 后端。**路线已调整为 Qwen3-4B full-attn 首发**(原计划 Kimi 首发):page-first 单 buffer 经 pegaflow `block_stride_bytes`(PR #331)适配。**端到端已在真实 GPU 上跑通并验证**:async SAVE + async LOAD 接进 `Qwen3Executor` + scheduler,`tests/kv_offload_cpu_hit.rs` 覆盖纯 CPU-hit 与 GPU+CPU 组合 hit,恢复后 logits 与冷算一致;连接层 `OffloadEngine` + `tests/cpu_roundtrip.rs` 字节级一致。默认关(builder flag opt-in);**server CLI 已接**(#316:`--kv-offload` / `--kv-offload-host-gib` / `--no-prefix-cache`,plain 与 `--enable-lora` 两条启动路径都透传)。纯-L2 基准实测 Qwen3-4B mean TTFT 195→40ms(−79%,evict-before-probe → `gpu_hit=0`,全前缀从 host tier 恢复)。**Qwen3.5 linear/SSM state 明确排除**;**DeepSeek sparse 暂缓**。 > > Last touched: 2026-06 @@ -17,7 +17,7 @@ - **live 接线(§9,已落地)**:`Qwen3Executor` 持 `Option`(`Qwen3OffloadOptions` opt-in,默认关);SAVE hook(`save_sealed_blocks`,async fire-and-forget)+ 非阻塞 prefetch admission(`begin_kv_prefetch`/`drain_ready_prefetch`/`wait_ready_prefetch`,scheduler `loading` 态)。`tests/kv_offload_cpu_hit.rs` 单测序跑两幕——纯 CPU restore(`gpu_hit==0`)与 GPU+CPU 组合 hit(G=3+C=3 拼成一段连续前缀)——恢复后 first-token logits 与冷算一致(mean Δ≈0.03 nat,bf16 floor)。 - **三处正确性加固**(toxic-review 后):① query lease 在 `reserve_loaded_blocks` 失败 / `load` 提交失败时显式 `release_query_lease`,不再泄漏到 600s TTL;② admission 拒绝(context/KV budget/未知 LoRA)时 `drop_request` 释放已 settle 的 prefetch 状态,不再泄漏已 commit 的 block;③ async SAVE 把被保存 block 的 `ImmutableBlock` 强引用(`KvBlockGuard`)随 spawn 持到 D2H 落地才 drop——封死"请求结束→slot 重分配→D2H 抓到错 KV 写进旧 hash"的静默腐蚀窗口。 -未接 server CLI(仅经 `start_engine_with_offload` / 测试入口)。**依赖已从 fork 摘除**:PR #331+#333 均合入上游 master(squash 进 `07cac7e`),`third_party/pegaflow` 已删,`pegaflow-core` 改为 pin 到该 rev 的 **git 依赖**(见 §5.2),GPU 测试在 git-dep 下行为不变(delta 一致)。 +**server CLI 已接(#316)**:`--kv-offload`(bool)/ `--kv-offload-host-gib`(f64,默认 8.0,pegaflow 启动即整块 `cudaHostAlloc`,RSS 立即反映)/ `--no-prefix-cache`(vLLM 风格;不带 offload = 关前缀匹配,带 offload = 纯-L2 模式,evict-before-probe 使每个前缀从 host tier 恢复)。plain 与 `--enable-lora` 两条路径都透传 `offload_options` + `no_prefix_cache`;LoRA 下安全,因前缀 block hash 以 adapter 名加 salt(`compute_salt_hash`),恢复的 KV(HBM 或 host tier)永不跨 adapter。三处 #316 review 加固:echo 请求不 offer prefetch(其 prefill 跳过 `match_and_add_prefix`,prefetch 块用不上)、admission 按 `prefetched_blocks` 抵扣已 settle 前缀块、`drop_request` 等在途 H2D 落地再放 reservation。**依赖已从 fork 摘除**:PR #331+#333 均合入上游 master(squash 进 `07cac7e`),`third_party/pegaflow` 已删,`pegaflow-core` 改为 pin 到该 rev 的 **git 依赖**(见 §5.2),GPU 测试在 git-dep 下行为不变(delta 一致)。 相关:[kv-cache-design.md](kv-cache-design.md)(logical/physical 分层,已把 pegaflow 列为设计调研)· [qwen3-kvbm-integration-spec.md](qwen3-kvbm-integration-spec.md)(kvbm-logical 已接入)· `models/kimi-k2/kv-cache-design.md`(Kimi 已用 `BlockPool`)· `models/qwen3/prefix-cache.md`(HBM 内前缀复用已落地)。 @@ -45,8 +45,8 @@ openinfer 仓里 vendored 的 `kvbm-physical` / `kvbm-engine` 设计目标就是 | 模型 | KV 形态 | active set | 跨请求复用 | 卸载结论 | | --- | --- | --- | --- | --- | -| **Kimi-K2 MLA** | paged,per-layer ckv/kpe arena,后端是 `BlockPool`;latent 68.6 KiB/token,无 per-head | 无(dense 全前缀) | 有(HBM 内 prefix cache 已落地) | **首发**:接入面最干净,layout 直接适配 pegaflow registration | -| **Qwen3 / Qwen3.5 full-attn** | paged,page-first 单 buffer,`PagePool` | 无(dense 全前缀) | 有(前缀缓存已落地) | **次发**:page-first 与 pegaflow `stride==copy-size` ABI 冲突,需加 `block_stride`(见 §5.R1) | +| **Qwen3 / Qwen3.5 full-attn** | paged,page-first 单 buffer,`PagePool` | 无(dense 全前缀) | 有(前缀缓存已落地) | **已首发(#316)**:page-first 与 pegaflow `stride==copy-size` ABI 冲突已由 `block_stride`(§5.R1)解掉,端到端跑通 | +| **Kimi-K2 MLA** | paged,per-layer ckv/kpe arena,后端是 `BlockPool`;latent 68.6 KiB/token,无 per-head | 无(dense 全前缀) | 有(HBM 内 prefix cache 已落地) | **下一候选**:layout 直接适配 pegaflow registration(接入面最干净),复用 Qwen3-4B 这套 connector 模式即可 | | **Qwen3.5 linear(24 层)** | per-request `RecurrentState` [32,128,128] f32 2 MiB/层,非 paged、独立分配 | 无(每步读写整个 matrix) | **零**(this-request 有损摘要,非 content-addressable) | **排除**:offload 无 prefix/dedup 收益;省显存是 per-request swap-out,另一套机制 | | **DeepSeek-V4 sparse** | per-request per-layer dense arena [window\|compressed],非 paged;compressor 4:1 | **显式**:`topk_idxs` = window 行 + indexer 选中 compressed 行,token/row 粒度,每步重选 | 部分 | **暂缓**:compressor 已控 footprint;indexer 信号现成但 token 粒度 ≠ block 粒度(见 §7) | @@ -56,8 +56,8 @@ openinfer 仓里 vendored 的 `kvbm-physical` / `kvbm-engine` 设计目标就是 ## 4. 路线 -1. **Kimi MLA 首发** —— pegaflow 做 `BlockPool` 下的 host/SSD tier;block evict 时 demote 到 host,前缀 query 命中时从 host restore。带宽便宜(latent 小),layout 零阻抗。 -2. **Qwen full-attn 次发** —— 先给 pegaflow 加 `block_stride_bytes`(R1),再接 page-first buffer。 +1. **Qwen full-attn 已首发(#316)** —— 给 pegaflow 加了 `block_stride_bytes`(R1)解掉 page-first ABI 冲突,async SAVE + 非阻塞 prefetch admission 接进 `Qwen3Executor` + scheduler,server CLI 已接。 +2. **Kimi MLA 下一候选** —— pegaflow 做 `BlockPool` 下的 host/SSD tier;block evict 时 demote 到 host,前缀 query 命中时从 host restore。带宽便宜(latent 小),layout 零阻抗,直接复用 Qwen3-4B 的 connector 模式。 3. **linear 排除、sparse 暂缓**。 ## 5. 可行性(对抗验证结论,附证据)