openinfer-project · xiaguan · Jun 9, 2026 · Jun 9, 2026
diff --git a/docs/index.md b/docs/index.md
@@ -93,7 +93,7 @@ Organized by domain (model line / subsystem / playbook / lesson) instead of by l
 | --- | --- |
 | `subsystems/runtime/runtime.md` | Runtime complexity is controlled by a shared `openinfer-core` that owns the generation contract and orchestration; per-model crates implement `ModelForward` so prefill/decode and hybrid attention stay hidden from the caller. State (`&mut`) is separated from weights (`&self`) for future bs > 1. |
 | `subsystems/runtime/kv-cache-design.md` | Dynamo 式 logical/physical 分层 KV cache：BlockManager 管 block 生命周期和 admission，PhysicalBackend trait 管 GPU 内存和布局（FullAttention / MLA）。支持 TP / DP。基于 vLLM/Dynamo/pegaflow 调研。 |
-| `subsystems/runtime/pegaflow-offload-integration.md` | 把 `pegaflow-core` 当进程内 Rust 库做 KV 卸载物理后端（HBM→DRAM/SSD/RDMA），补 kvbm 没写的卸载层。**Qwen3-4B full-attn 首发，端到端已在真实 GPU 跑通并验证**（async SAVE+LOAD 接进 executor/scheduler，纯 CPU-hit 与 GPU+CPU 组合 hit 恢复后 logits 与冷算一致）。pegaflow 经 git rev pin（#331+#333）。默认关，未接 server CLI。linear 排除，sparse 暂缓。 |
+| `subsystems/runtime/pegaflow-offload-integration.md` | 把 `pegaflow-core` 当进程内 Rust 库做 KV 卸载物理后端（HBM→DRAM/SSD/RDMA），补 kvbm 没写的卸载层。**Qwen3-4B full-attn 首发，端到端已在真实 GPU 跑通并验证**（async SAVE+LOAD 接进 executor/scheduler，纯 CPU-hit 与 GPU+CPU 组合 hit 恢复后 logits 与冷算一致）。pegaflow 经 git rev pin（#331+#333）。默认关，server CLI 已接（#316：`--kv-offload`/`--no-prefix-cache`，plain+LoRA）。linear 排除，sparse 暂缓。 |
 
 ## subsystems / scheduler
 

diff --git a/docs/models/qwen3/roadmap.md b/docs/models/qwen3/roadmap.md
@@ -1,6 +1,6 @@
 # Qwen3-4B Roadmap
 
-> **TL;DR:** Qwen3-4B is the maturity bar of the project — continuous batching, TP=2, default-on prefix cache (#216), and the HF logits golden gate are all live — so its roadmap is sharpening, not bring-up. The #220 RoPE OOB bug is now fixed (cos/sin cache sized from `max_position_embeddings`, admission rejects past the window, kernel traps an out-of-range position; gated by both an oversized-reject and an in-window >4096 IT). The verified open set: per-row batch-decode sampling (O(batch) launches + syncs per step despite a production-proven batched primitive in-tree), zero TP correctness coverage, LoRA built but gated only by a zero-adapter smoke, prefix-cache observability dropped at the scheduler boundary, a docs layer that describes deleted tooling, and the YaRN #8 follow-up for rope-scaled checkpoints. Open-set findings verified 2026-06-04 against `6ee9247`.
+> **TL;DR:** Qwen3-4B is the maturity bar of the project — continuous batching, TP=2, default-on prefix cache (#216), and the HF logits golden gate are all live — so its roadmap is sharpening, not bring-up. The #220 RoPE OOB bug is now fixed (cos/sin cache sized from `max_position_embeddings`, admission rejects past the window, kernel traps an out-of-range position; gated by both an oversized-reject and an in-window >4096 IT). The verified open set: per-row batch-decode sampling (O(batch) launches + syncs per step despite a production-proven batched primitive in-tree), zero TP correctness coverage, LoRA built but gated only by a zero-adapter smoke, prefix-cache observability dropped at the scheduler boundary, a docs layer that describes deleted tooling, and the YaRN #8 follow-up for rope-scaled checkpoints. Open-set findings verified 2026-06-04 against `6ee9247`. Since then: batched greedy sampling phase 1 landed (#307) and in-process pegaflow KV offload / L2 host tier shipped (#316, pure-L2 TTFT 195→40ms) — both folded into the table below; phase-2 random sampling and the rest of the open set stand.
 >
 > **Last touched:** 2026-06
 
@@ -14,7 +14,8 @@ Tracking issue: see the `[Model] Qwen3-4B roadmap` GitHub issue. Cross-model ite
 | Prefix cache | ✓ default-on full-block kvbm matching (#216); 4 cache-hit replay passes in the golden gate | `executor.rs:750-751`, `tests/hf_golden_gate.rs` |
 | Accuracy gate | ✓ HF bf16 golden, bs=1/batched/graph + cached replays; single-GPU, ≤256-token prompts | `tests/hf_golden_gate.rs:451` |
 | Long context | ✓ fixed: RoPE cache sized from `max_position_embeddings`, admission rejects past the window, kernel traps OOB; gated by reject + in-window >4096 ITs. YaRN #8 still open for scaled checkpoints | `weights.rs:310-318`, `tests/context_window.rs`, `tests/context_window_in_window.rs` |
-| Batch sampling | ✗ per-row: O(batch) launches + O(batch) D2H syncs per decode step; 1MB scratch literal | `executor.rs:159-179,212-214`, `ops/sampling.rs:7,204-208` |
+| Batch sampling | ⚠ greedy phase landed (#307: one `argmax_batch_bf16_into` launch/step, not O(batch)); batched random per-row path + 1MB scratch-from-literal still open (phase 2) | `ops/sampling.rs`, `csrc/shared/argmax.cu` |
+| KV offload (L2) | ✓ in-process pegaflow host-tier save/restore (#316); CLI `--kv-offload`/`--no-prefix-cache`, plain + LoRA; pure-L2 TTFT 195→40ms | `subsystems/runtime/pegaflow-offload-integration.md` |
 | TP correctness | ✗ zero automated coverage — every test runs `device_ordinals: vec![0]` | grep `tests/` |
 | LoRA | ⚠ load/unload/TP/request-level all built; only test uses a **zero adapter** | `lora.rs`, `tests/lora_smoke.rs:91-130` |
 | Non-greedy sampling | ✗ zero correctness coverage (all tests greedy); penalties/min_p absent from `SamplingParams` | grep `tests/` |
@@ -26,7 +27,7 @@ Tracking issue: see the `[Model] Qwen3-4B roadmap` GitHub issue. Cross-model ite
 ### Now
 
 1. **YaRN for rope-scaled checkpoints (#8).** The #220 RoPE OOB fix landed scope (a): the cos/sin cache is sized from `config.max_position_embeddings`, admission crash-early rejects past the window (distinct context-length vs KV-budget reasons), the kernel `__trap`s an out-of-range position as a last-resort backstop, and the gate now covers both an oversized reject and an in-window >4096 case (`tests/context_window.rs`, `tests/context_window_in_window.rs`). That precompute is correct *only because this checkpoint has `rope_scaling: null`*. Scope (b) remains open: #8 YaRN is the prerequisite for any rope-scaled checkpoint — the precompute length must come from the scaled schedule, coordinated with the qwen3.5 sibling fix so both crates share the pattern.
-2. **Batched greedy decode sampling.** Phase 1: route all-greedy batches through `argmax_batch_bf16_into` — one launch + one D2H per step; this primitive is production-proven in deepseek-v2-lite (`runtime.rs:1379`). `flashinfer_top1_batch_into` has *no* production caller and needs its own validation before use. Phase 2: batched random path with per-row params; source the 1MB FlashInfer row-state scratch from the kernel instead of the literal. Shared `openinfer-core/kernels` work — covers qwen35 too. Gated by the existing golden gate.
+2. **Batched greedy decode sampling.** ✓ Phase 1 landed (#307): all-greedy batches route through `argmax_batch_bf16_into` — one launch + one D2H per step (the deepseek-v2-lite-proven primitive, now in `openinfer-core` + `csrc/shared/argmax.cu`). Phase 2 still open: batched random path with per-row params; source the 1MB FlashInfer row-state scratch from the kernel instead of the literal. Shared `openinfer-core/kernels` work — covers qwen35 too. Gated by the existing golden gate.
 3. **Sampling correctness coverage.** Every test in both qwen crates is greedy. Add seed-determinism + temperature/top_k/top_p behavioral tests, and audit the frontend for silently-dropped params (penalties, min_p are absent from `SamplingParams` entirely) — the kimi-k2 silent-greedy bug (#237) shows this class is real and currently nothing would catch it here.
 4. **Prefix-cache observability.** `cached_tokens` is computed (`executor.rs:751`) and dies at the scheduler boundary; the frontend hardcodes `num_cached_tokens: 0`. Thread it through `TokenEvent::Scheduled` into usage; log hit rate. Adjacent: #78 (streaming usage discards completion_tokens) — same usage-accounting surface.
 

diff --git a/docs/roadmap/execution.md b/docs/roadmap/execution.md
@@ -10,6 +10,7 @@ These are the shared layers — frontend, runtime, kernels, ledger/simulator/tra
 
 - **Model-owned kernel plans.** Qwen3 already carries a light `kernel_plan` mapping prefill/decode/unified phases → Rust wrappers, FFI symbols, and CUDA/Triton/cuBLAS backends. Extend the same shape to Qwen3.5 and DeepSeek V4 so each model crate is self-describing.
 - **Frontend polish.** `vllm-frontend-rs` is the default OpenAI surface, talking to openinfer via a local engine-core IPC bridge. Outstanding: logprobs / prompt-logprobs translation, usage accounting, and a deliberate decision on whether the served-model-id should decouple from the tokenizer path.
+- **KV data plane (pegaflow).** First concrete tier landed: `openinfer-kv-offload::OffloadEngine` wraps in-process `pegaflow-core` as a host-tier ("L2") KV backend (mechanism), with per-model schedulers owning the residency policy (no universal connector trait — policy/mechanism split per `direction.md`). Shipped on Qwen3-4B full-attn (#316). Next candidate: Kimi-K2 MLA (layout zero-impedance, reuses the same connector pattern). SSD/RDMA tiers and DeepSeek sparse remain unrealized scope. See `subsystems/runtime/pegaflow-offload-integration.md`.
 
 ### Next
 
@@ -56,7 +57,7 @@ Each model crate owns its own scheduler, kernels, accuracy story, and benchmarks
 
 **Goal:** stay ahead of vLLM on serving experience while expanding the parallel-strategy and scheduling surface.
 
-**Done:** single-request and continuous batching ahead of vLLM (decode TPOT wins at all concurrencies; QPS=2 within 2% throughput while leading TTFT, TPOT, and latency stability). TP=2 brought up end-to-end on one machine; TP=8 smoke-tested on 8×4090. Issue #85 KV-pressure hang at QPS=2 is fixed.
+**Done:** single-request and continuous batching ahead of vLLM (decode TPOT wins at all concurrencies; QPS=2 within 2% throughput while leading TTFT, TPOT, and latency stability). TP=2 brought up end-to-end on one machine; TP=8 smoke-tested on 8×4090. Issue #85 KV-pressure hang at QPS=2 is fixed. Batched greedy decode sampling landed (#307 — one `argmax_batch_bf16_into` launch per step, not O(batch)). **In-process pegaflow KV offload shipped (#316)** — host-tier "L2" save/restore wired into executor + scheduler, server CLI (`--kv-offload` / `--kv-offload-host-gib` / `--no-prefix-cache`) on both plain and LoRA paths; pure-L2 TTFT 195→40 ms measured. See `subsystems/runtime/pegaflow-offload-integration.md`.
 
 **Next:**
 - Explore pipeline parallelism (PP) as a complement to TP — particularly for larger model fits and multi-node layouts.

diff --git a/docs/subsystems/runtime/pegaflow-offload-integration.md b/docs/subsystems/runtime/pegaflow-offload-integration.md
@@ -1,6 +1,6 @@
 # pegaflow KV 卸载接入 Spec
 
-> **TL;DR**: 把 `pegaflow-core` 当**进程内 Rust 库**做 KV 卸载的物理后端（HBM→DRAM/SSD/RDMA），补上 kvbm 留着没写的卸载层。connector 大脑（决定 load/save 哪些 block）用 kvbm logical/physical 分层思想自建，pegaflow 退为语义无关的 raw block transfer 后端。**路线已调整为 Qwen3-4B full-attn 首发**（原计划 Kimi 首发）：page-first 单 buffer 经 pegaflow `block_stride_bytes`（PR #331）适配。**端到端已在真实 GPU 上跑通并验证**：async SAVE + async LOAD 接进 `Qwen3Executor` + scheduler，`tests/kv_offload_cpu_hit.rs` 覆盖纯 CPU-hit 与 GPU+CPU 组合 hit，恢复后 logits 与冷算一致；连接层 `OffloadEngine` + `tests/cpu_roundtrip.rs` 字节级一致。默认关（builder flag opt-in），未接 server CLI。**Qwen3.5 linear/SSM state 明确排除**；**DeepSeek sparse 暂缓**。
+> **TL;DR**: 把 `pegaflow-core` 当**进程内 Rust 库**做 KV 卸载的物理后端（HBM→DRAM/SSD/RDMA），补上 kvbm 留着没写的卸载层。connector 大脑（决定 load/save 哪些 block）用 kvbm logical/physical 分层思想自建，pegaflow 退为语义无关的 raw block transfer 后端。**路线已调整为 Qwen3-4B full-attn 首发**（原计划 Kimi 首发）：page-first 单 buffer 经 pegaflow `block_stride_bytes`（PR #331）适配。**端到端已在真实 GPU 上跑通并验证**：async SAVE + async LOAD 接进 `Qwen3Executor` + scheduler，`tests/kv_offload_cpu_hit.rs` 覆盖纯 CPU-hit 与 GPU+CPU 组合 hit，恢复后 logits 与冷算一致；连接层 `OffloadEngine` + `tests/cpu_roundtrip.rs` 字节级一致。默认关（builder flag opt-in）；**server CLI 已接**（#316：`--kv-offload` / `--kv-offload-host-gib` / `--no-prefix-cache`，plain 与 `--enable-lora` 两条启动路径都透传）。纯-L2 基准实测 Qwen3-4B mean TTFT 195→40ms（−79%，evict-before-probe → `gpu_hit=0`，全前缀从 host tier 恢复）。**Qwen3.5 linear/SSM state 明确排除**；**DeepSeek sparse 暂缓**。
 >
 > Last touched: 2026-06
 
@@ -17,7 +17,7 @@
 - **live 接线（§9，已落地）**：`Qwen3Executor` 持 `Option<OffloadEngine>`（`Qwen3OffloadOptions` opt-in，默认关）；SAVE hook（`save_sealed_blocks`，async fire-and-forget）+ 非阻塞 prefetch admission（`begin_kv_prefetch`/`drain_ready_prefetch`/`wait_ready_prefetch`，scheduler `loading` 态）。`tests/kv_offload_cpu_hit.rs` 单测序跑两幕——纯 CPU restore（`gpu_hit==0`）与 GPU+CPU 组合 hit（G=3+C=3 拼成一段连续前缀）——恢复后 first-token logits 与冷算一致（mean Δ≈0.03 nat，bf16 floor）。
 - **三处正确性加固**（toxic-review 后）：① query lease 在 `reserve_loaded_blocks` 失败 / `load` 提交失败时显式 `release_query_lease`，不再泄漏到 600s TTL；② admission 拒绝（context/KV budget/未知 LoRA）时 `drop_request` 释放已 settle 的 prefetch 状态，不再泄漏已 commit 的 block；③ async SAVE 把被保存 block 的 `ImmutableBlock` 强引用（`KvBlockGuard`）随 spawn 持到 D2H 落地才 drop——封死"请求结束→slot 重分配→D2H 抓到错 KV 写进旧 hash"的静默腐蚀窗口。
 
-未接 server CLI（仅经 `start_engine_with_offload` / 测试入口）。**依赖已从 fork 摘除**：PR #331+#333 均合入上游 master（squash 进 `07cac7e`），`third_party/pegaflow` 已删，`pegaflow-core` 改为 pin 到该 rev 的 **git 依赖**（见 §5.2），GPU 测试在 git-dep 下行为不变（delta 一致）。
+**server CLI 已接（#316）**：`--kv-offload`（bool）/ `--kv-offload-host-gib`（f64，默认 8.0，pegaflow 启动即整块 `cudaHostAlloc`，RSS 立即反映）/ `--no-prefix-cache`（vLLM 风格；不带 offload = 关前缀匹配，带 offload = 纯-L2 模式，evict-before-probe 使每个前缀从 host tier 恢复）。plain 与 `--enable-lora` 两条路径都透传 `offload_options` + `no_prefix_cache`；LoRA 下安全，因前缀 block hash 以 adapter 名加 salt（`compute_salt_hash`），恢复的 KV（HBM 或 host tier）永不跨 adapter。三处 #316 review 加固：echo 请求不 offer prefetch（其 prefill 跳过 `match_and_add_prefix`，prefetch 块用不上）、admission 按 `prefetched_blocks` 抵扣已 settle 前缀块、`drop_request` 等在途 H2D 落地再放 reservation。**依赖已从 fork 摘除**：PR #331+#333 均合入上游 master（squash 进 `07cac7e`），`third_party/pegaflow` 已删，`pegaflow-core` 改为 pin 到该 rev 的 **git 依赖**（见 §5.2），GPU 测试在 git-dep 下行为不变（delta 一致）。
 
 相关：[kv-cache-design.md](kv-cache-design.md)（logical/physical 分层，已把 pegaflow 列为设计调研）· [qwen3-kvbm-integration-spec.md](qwen3-kvbm-integration-spec.md)（kvbm-logical 已接入）· `models/kimi-k2/kv-cache-design.md`（Kimi 已用 `BlockPool`）· `models/qwen3/prefix-cache.md`（HBM 内前缀复用已落地）。
 
@@ -45,8 +45,8 @@ openinfer 仓里 vendored 的 `kvbm-physical` / `kvbm-engine` 设计目标就是
 
 | 模型 | KV 形态 | active set | 跨请求复用 | 卸载结论 |
 | --- | --- | --- | --- | --- |
-| **Kimi-K2 MLA** | paged，per-layer ckv/kpe arena，后端是 `BlockPool`；latent 68.6 KiB/token，无 per-head | 无（dense 全前缀） | 有（HBM 内 prefix cache 已落地） | **首发**：接入面最干净，layout 直接适配 pegaflow registration |
-| **Qwen3 / Qwen3.5 full-attn** | paged，page-first 单 buffer，`PagePool` | 无（dense 全前缀） | 有（前缀缓存已落地） | **次发**：page-first 与 pegaflow `stride==copy-size` ABI 冲突，需加 `block_stride`（见 §5.R1） |
+| **Qwen3 / Qwen3.5 full-attn** | paged，page-first 单 buffer，`PagePool` | 无（dense 全前缀） | 有（前缀缓存已落地） | **已首发（#316）**：page-first 与 pegaflow `stride==copy-size` ABI 冲突已由 `block_stride`（§5.R1）解掉，端到端跑通 |
+| **Kimi-K2 MLA** | paged，per-layer ckv/kpe arena，后端是 `BlockPool`；latent 68.6 KiB/token，无 per-head | 无（dense 全前缀） | 有（HBM 内 prefix cache 已落地） | **下一候选**：layout 直接适配 pegaflow registration（接入面最干净），复用 Qwen3-4B 这套 connector 模式即可 |
 | **Qwen3.5 linear（24 层）** | per-request `RecurrentState` [32,128,128] f32 2 MiB/层，非 paged、独立分配 | 无（每步读写整个 matrix） | **零**（this-request 有损摘要，非 content-addressable） | **排除**：offload 无 prefix/dedup 收益；省显存是 per-request swap-out，另一套机制 |
 | **DeepSeek-V4 sparse** | per-request per-layer dense arena [window\|compressed]，非 paged；compressor 4:1 | **显式**：`topk_idxs` = window 行 + indexer 选中 compressed 行，token/row 粒度，每步重选 | 部分 | **暂缓**：compressor 已控 footprint；indexer 信号现成但 token 粒度 ≠ block 粒度（见 §7） |
 
@@ -56,8 +56,8 @@ openinfer 仓里 vendored 的 `kvbm-physical` / `kvbm-engine` 设计目标就是
 
 ## 4. 路线
 
-1. **Kimi MLA 首发** —— pegaflow 做 `BlockPool` 下的 host/SSD tier；block evict 时 demote 到 host，前缀 query 命中时从 host restore。带宽便宜（latent 小），layout 零阻抗。
-2. **Qwen full-attn 次发** —— 先给 pegaflow 加 `block_stride_bytes`（R1），再接 page-first buffer。
+1. **Qwen full-attn 已首发（#316）** —— 给 pegaflow 加了 `block_stride_bytes`（R1）解掉 page-first ABI 冲突，async SAVE + 非阻塞 prefetch admission 接进 `Qwen3Executor` + scheduler，server CLI 已接。
+2. **Kimi MLA 下一候选** —— pegaflow 做 `BlockPool` 下的 host/SSD tier；block evict 时 demote 到 host，前缀 query 命中时从 host restore。带宽便宜（latent 小），layout 零阻抗，直接复用 Qwen3-4B 的 connector 模式。
 3. **linear 排除、sparse 暂缓**。
 
 ## 5. 可行性（对抗验证结论，附证据）