Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ Organized by domain (model line / subsystem / playbook / lesson) instead of by l
| --- | --- |
| `subsystems/runtime/runtime.md` | Runtime complexity is controlled by a shared `openinfer-core` that owns the generation contract and orchestration; per-model crates implement `ModelForward` so prefill/decode and hybrid attention stay hidden from the caller. State (`&mut`) is separated from weights (`&self`) for future bs > 1. |
| `subsystems/runtime/kv-cache-design.md` | Dynamo 式 logical/physical 分层 KV cache:BlockManager 管 block 生命周期和 admission,PhysicalBackend trait 管 GPU 内存和布局(FullAttention / MLA)。支持 TP / DP。基于 vLLM/Dynamo/pegaflow 调研。 |
| `subsystems/runtime/pegaflow-offload-integration.md` | 把 `pegaflow-core` 当进程内 Rust 库做 KV 卸载物理后端(HBM→DRAM/SSD/RDMA),补 kvbm 没写的卸载层。**Qwen3-4B full-attn 首发,端到端已在真实 GPU 跑通并验证**(async SAVE+LOAD 接进 executor/scheduler,纯 CPU-hit 与 GPU+CPU 组合 hit 恢复后 logits 与冷算一致)。pegaflow 经 git rev pin(#331+#333)。默认关,未接 server CLI。linear 排除,sparse 暂缓。 |
| `subsystems/runtime/pegaflow-offload-integration.md` | 把 `pegaflow-core` 当进程内 Rust 库做 KV 卸载物理后端(HBM→DRAM/SSD/RDMA),补 kvbm 没写的卸载层。**Qwen3-4B full-attn 首发,端到端已在真实 GPU 跑通并验证**(async SAVE+LOAD 接进 executor/scheduler,纯 CPU-hit 与 GPU+CPU 组合 hit 恢复后 logits 与冷算一致)。pegaflow 经 git rev pin(#331+#333)。默认关,server CLI 已接(#316:`--kv-offload`/`--no-prefix-cache`,plain+LoRA)。linear 排除,sparse 暂缓。 |

## subsystems / scheduler

Expand Down
7 changes: 4 additions & 3 deletions docs/models/qwen3/roadmap.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Qwen3-4B Roadmap

> **TL;DR:** Qwen3-4B is the maturity bar of the project — continuous batching, TP=2, default-on prefix cache (#216), and the HF logits golden gate are all live — so its roadmap is sharpening, not bring-up. The #220 RoPE OOB bug is now fixed (cos/sin cache sized from `max_position_embeddings`, admission rejects past the window, kernel traps an out-of-range position; gated by both an oversized-reject and an in-window >4096 IT). The verified open set: per-row batch-decode sampling (O(batch) launches + syncs per step despite a production-proven batched primitive in-tree), zero TP correctness coverage, LoRA built but gated only by a zero-adapter smoke, prefix-cache observability dropped at the scheduler boundary, a docs layer that describes deleted tooling, and the YaRN #8 follow-up for rope-scaled checkpoints. Open-set findings verified 2026-06-04 against `6ee9247`.
> **TL;DR:** Qwen3-4B is the maturity bar of the project — continuous batching, TP=2, default-on prefix cache (#216), and the HF logits golden gate are all live — so its roadmap is sharpening, not bring-up. The #220 RoPE OOB bug is now fixed (cos/sin cache sized from `max_position_embeddings`, admission rejects past the window, kernel traps an out-of-range position; gated by both an oversized-reject and an in-window >4096 IT). The verified open set: per-row batch-decode sampling (O(batch) launches + syncs per step despite a production-proven batched primitive in-tree), zero TP correctness coverage, LoRA built but gated only by a zero-adapter smoke, prefix-cache observability dropped at the scheduler boundary, a docs layer that describes deleted tooling, and the YaRN #8 follow-up for rope-scaled checkpoints. Open-set findings verified 2026-06-04 against `6ee9247`. Since then: batched greedy sampling phase 1 landed (#307) and in-process pegaflow KV offload / L2 host tier shipped (#316, pure-L2 TTFT 195→40ms) — both folded into the table below; phase-2 random sampling and the rest of the open set stand.
>
> **Last touched:** 2026-06

Expand All @@ -14,7 +14,8 @@ Tracking issue: see the `[Model] Qwen3-4B roadmap` GitHub issue. Cross-model ite
| Prefix cache | ✓ default-on full-block kvbm matching (#216); 4 cache-hit replay passes in the golden gate | `executor.rs:750-751`, `tests/hf_golden_gate.rs` |
| Accuracy gate | ✓ HF bf16 golden, bs=1/batched/graph + cached replays; single-GPU, ≤256-token prompts | `tests/hf_golden_gate.rs:451` |
| Long context | ✓ fixed: RoPE cache sized from `max_position_embeddings`, admission rejects past the window, kernel traps OOB; gated by reject + in-window >4096 ITs. YaRN #8 still open for scaled checkpoints | `weights.rs:310-318`, `tests/context_window.rs`, `tests/context_window_in_window.rs` |
| Batch sampling | ✗ per-row: O(batch) launches + O(batch) D2H syncs per decode step; 1MB scratch literal | `executor.rs:159-179,212-214`, `ops/sampling.rs:7,204-208` |
| Batch sampling | ⚠ greedy phase landed (#307: one `argmax_batch_bf16_into` launch/step, not O(batch)); batched random per-row path + 1MB scratch-from-literal still open (phase 2) | `ops/sampling.rs`, `csrc/shared/argmax.cu` |
| KV offload (L2) | ✓ in-process pegaflow host-tier save/restore (#316); CLI `--kv-offload`/`--no-prefix-cache`, plain + LoRA; pure-L2 TTFT 195→40ms | `subsystems/runtime/pegaflow-offload-integration.md` |
| TP correctness | ✗ zero automated coverage — every test runs `device_ordinals: vec![0]` | grep `tests/` |
| LoRA | ⚠ load/unload/TP/request-level all built; only test uses a **zero adapter** | `lora.rs`, `tests/lora_smoke.rs:91-130` |
| Non-greedy sampling | ✗ zero correctness coverage (all tests greedy); penalties/min_p absent from `SamplingParams` | grep `tests/` |
Expand All @@ -26,7 +27,7 @@ Tracking issue: see the `[Model] Qwen3-4B roadmap` GitHub issue. Cross-model ite
### Now

1. **YaRN for rope-scaled checkpoints (#8).** The #220 RoPE OOB fix landed scope (a): the cos/sin cache is sized from `config.max_position_embeddings`, admission crash-early rejects past the window (distinct context-length vs KV-budget reasons), the kernel `__trap`s an out-of-range position as a last-resort backstop, and the gate now covers both an oversized reject and an in-window >4096 case (`tests/context_window.rs`, `tests/context_window_in_window.rs`). That precompute is correct *only because this checkpoint has `rope_scaling: null`*. Scope (b) remains open: #8 YaRN is the prerequisite for any rope-scaled checkpoint — the precompute length must come from the scaled schedule, coordinated with the qwen3.5 sibling fix so both crates share the pattern.
2. **Batched greedy decode sampling.** Phase 1: route all-greedy batches through `argmax_batch_bf16_into` — one launch + one D2H per step; this primitive is production-proven in deepseek-v2-lite (`runtime.rs:1379`). `flashinfer_top1_batch_into` has *no* production caller and needs its own validation before use. Phase 2: batched random path with per-row params; source the 1MB FlashInfer row-state scratch from the kernel instead of the literal. Shared `openinfer-core/kernels` work — covers qwen35 too. Gated by the existing golden gate.
2. **Batched greedy decode sampling.** Phase 1 landed (#307): all-greedy batches route through `argmax_batch_bf16_into` — one launch + one D2H per step (the deepseek-v2-lite-proven primitive, now in `openinfer-core` + `csrc/shared/argmax.cu`). Phase 2 still open: batched random path with per-row params; source the 1MB FlashInfer row-state scratch from the kernel instead of the literal. Shared `openinfer-core/kernels` work — covers qwen35 too. Gated by the existing golden gate.
3. **Sampling correctness coverage.** Every test in both qwen crates is greedy. Add seed-determinism + temperature/top_k/top_p behavioral tests, and audit the frontend for silently-dropped params (penalties, min_p are absent from `SamplingParams` entirely) — the kimi-k2 silent-greedy bug (#237) shows this class is real and currently nothing would catch it here.
4. **Prefix-cache observability.** `cached_tokens` is computed (`executor.rs:751`) and dies at the scheduler boundary; the frontend hardcodes `num_cached_tokens: 0`. Thread it through `TokenEvent::Scheduled` into usage; log hit rate. Adjacent: #78 (streaming usage discards completion_tokens) — same usage-accounting surface.

Expand Down
3 changes: 2 additions & 1 deletion docs/roadmap/execution.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ These are the shared layers — frontend, runtime, kernels, ledger/simulator/tra

- **Model-owned kernel plans.** Qwen3 already carries a light `kernel_plan` mapping prefill/decode/unified phases → Rust wrappers, FFI symbols, and CUDA/Triton/cuBLAS backends. Extend the same shape to Qwen3.5 and DeepSeek V4 so each model crate is self-describing.
- **Frontend polish.** `vllm-frontend-rs` is the default OpenAI surface, talking to openinfer via a local engine-core IPC bridge. Outstanding: logprobs / prompt-logprobs translation, usage accounting, and a deliberate decision on whether the served-model-id should decouple from the tokenizer path.
- **KV data plane (pegaflow).** First concrete tier landed: `openinfer-kv-offload::OffloadEngine` wraps in-process `pegaflow-core` as a host-tier ("L2") KV backend (mechanism), with per-model schedulers owning the residency policy (no universal connector trait — policy/mechanism split per `direction.md`). Shipped on Qwen3-4B full-attn (#316). Next candidate: Kimi-K2 MLA (layout zero-impedance, reuses the same connector pattern). SSD/RDMA tiers and DeepSeek sparse remain unrealized scope. See `subsystems/runtime/pegaflow-offload-integration.md`.

### Next

Expand Down Expand Up @@ -56,7 +57,7 @@ Each model crate owns its own scheduler, kernels, accuracy story, and benchmarks

**Goal:** stay ahead of vLLM on serving experience while expanding the parallel-strategy and scheduling surface.

**Done:** single-request and continuous batching ahead of vLLM (decode TPOT wins at all concurrencies; QPS=2 within 2% throughput while leading TTFT, TPOT, and latency stability). TP=2 brought up end-to-end on one machine; TP=8 smoke-tested on 8×4090. Issue #85 KV-pressure hang at QPS=2 is fixed.
**Done:** single-request and continuous batching ahead of vLLM (decode TPOT wins at all concurrencies; QPS=2 within 2% throughput while leading TTFT, TPOT, and latency stability). TP=2 brought up end-to-end on one machine; TP=8 smoke-tested on 8×4090. Issue #85 KV-pressure hang at QPS=2 is fixed. Batched greedy decode sampling landed (#307 — one `argmax_batch_bf16_into` launch per step, not O(batch)). **In-process pegaflow KV offload shipped (#316)** — host-tier "L2" save/restore wired into executor + scheduler, server CLI (`--kv-offload` / `--kv-offload-host-gib` / `--no-prefix-cache`) on both plain and LoRA paths; pure-L2 TTFT 195→40 ms measured. See `subsystems/runtime/pegaflow-offload-integration.md`.

**Next:**
- Explore pipeline parallelism (PP) as a complement to TP — particularly for larger model fits and multi-node layouts.
Expand Down
12 changes: 6 additions & 6 deletions docs/subsystems/runtime/pegaflow-offload-integration.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# pegaflow KV 卸载接入 Spec

> **TL;DR**: 把 `pegaflow-core` 当**进程内 Rust 库**做 KV 卸载的物理后端(HBM→DRAM/SSD/RDMA),补上 kvbm 留着没写的卸载层。connector 大脑(决定 load/save 哪些 block)用 kvbm logical/physical 分层思想自建,pegaflow 退为语义无关的 raw block transfer 后端。**路线已调整为 Qwen3-4B full-attn 首发**(原计划 Kimi 首发):page-first 单 buffer 经 pegaflow `block_stride_bytes`(PR #331)适配。**端到端已在真实 GPU 上跑通并验证**:async SAVE + async LOAD 接进 `Qwen3Executor` + scheduler,`tests/kv_offload_cpu_hit.rs` 覆盖纯 CPU-hit 与 GPU+CPU 组合 hit,恢复后 logits 与冷算一致;连接层 `OffloadEngine` + `tests/cpu_roundtrip.rs` 字节级一致。默认关(builder flag opt-in),未接 server CLI。**Qwen3.5 linear/SSM state 明确排除**;**DeepSeek sparse 暂缓**。
> **TL;DR**: 把 `pegaflow-core` 当**进程内 Rust 库**做 KV 卸载的物理后端(HBM→DRAM/SSD/RDMA),补上 kvbm 留着没写的卸载层。connector 大脑(决定 load/save 哪些 block)用 kvbm logical/physical 分层思想自建,pegaflow 退为语义无关的 raw block transfer 后端。**路线已调整为 Qwen3-4B full-attn 首发**(原计划 Kimi 首发):page-first 单 buffer 经 pegaflow `block_stride_bytes`(PR #331)适配。**端到端已在真实 GPU 上跑通并验证**:async SAVE + async LOAD 接进 `Qwen3Executor` + scheduler,`tests/kv_offload_cpu_hit.rs` 覆盖纯 CPU-hit 与 GPU+CPU 组合 hit,恢复后 logits 与冷算一致;连接层 `OffloadEngine` + `tests/cpu_roundtrip.rs` 字节级一致。默认关(builder flag opt-in);**server CLI 已接**(#316:`--kv-offload` / `--kv-offload-host-gib` / `--no-prefix-cache`,plain 与 `--enable-lora` 两条启动路径都透传)。纯-L2 基准实测 Qwen3-4B mean TTFT 195→40ms(−79%,evict-before-probe → `gpu_hit=0`,全前缀从 host tier 恢复)。**Qwen3.5 linear/SSM state 明确排除**;**DeepSeek sparse 暂缓**。
>
> Last touched: 2026-06

Expand All @@ -17,7 +17,7 @@
- **live 接线(§9,已落地)**:`Qwen3Executor` 持 `Option<OffloadEngine>`(`Qwen3OffloadOptions` opt-in,默认关);SAVE hook(`save_sealed_blocks`,async fire-and-forget)+ 非阻塞 prefetch admission(`begin_kv_prefetch`/`drain_ready_prefetch`/`wait_ready_prefetch`,scheduler `loading` 态)。`tests/kv_offload_cpu_hit.rs` 单测序跑两幕——纯 CPU restore(`gpu_hit==0`)与 GPU+CPU 组合 hit(G=3+C=3 拼成一段连续前缀)——恢复后 first-token logits 与冷算一致(mean Δ≈0.03 nat,bf16 floor)。
- **三处正确性加固**(toxic-review 后):① query lease 在 `reserve_loaded_blocks` 失败 / `load` 提交失败时显式 `release_query_lease`,不再泄漏到 600s TTL;② admission 拒绝(context/KV budget/未知 LoRA)时 `drop_request` 释放已 settle 的 prefetch 状态,不再泄漏已 commit 的 block;③ async SAVE 把被保存 block 的 `ImmutableBlock` 强引用(`KvBlockGuard`)随 spawn 持到 D2H 落地才 drop——封死"请求结束→slot 重分配→D2H 抓到错 KV 写进旧 hash"的静默腐蚀窗口。

未接 server CLI(仅经 `start_engine_with_offload` / 测试入口)。**依赖已从 fork 摘除**:PR #331+#333 均合入上游 master(squash 进 `07cac7e`),`third_party/pegaflow` 已删,`pegaflow-core` 改为 pin 到该 rev 的 **git 依赖**(见 §5.2),GPU 测试在 git-dep 下行为不变(delta 一致)。
**server CLI 已接(#316)**:`--kv-offload`(bool)/ `--kv-offload-host-gib`(f64,默认 8.0,pegaflow 启动即整块 `cudaHostAlloc`,RSS 立即反映)/ `--no-prefix-cache`(vLLM 风格;不带 offload = 关前缀匹配,带 offload = 纯-L2 模式,evict-before-probe 使每个前缀从 host tier 恢复)。plain 与 `--enable-lora` 两条路径都透传 `offload_options` + `no_prefix_cache`;LoRA 下安全,因前缀 block hash 以 adapter 名加 salt(`compute_salt_hash`),恢复的 KV(HBM 或 host tier)永不跨 adapter。三处 #316 review 加固:echo 请求不 offer prefetch(其 prefill 跳过 `match_and_add_prefix`,prefetch 块用不上)、admission 按 `prefetched_blocks` 抵扣已 settle 前缀块、`drop_request` 等在途 H2D 落地再放 reservation。**依赖已从 fork 摘除**:PR #331+#333 均合入上游 master(squash 进 `07cac7e`),`third_party/pegaflow` 已删,`pegaflow-core` 改为 pin 到该 rev 的 **git 依赖**(见 §5.2),GPU 测试在 git-dep 下行为不变(delta 一致)。

相关:[kv-cache-design.md](kv-cache-design.md)(logical/physical 分层,已把 pegaflow 列为设计调研)· [qwen3-kvbm-integration-spec.md](qwen3-kvbm-integration-spec.md)(kvbm-logical 已接入)· `models/kimi-k2/kv-cache-design.md`(Kimi 已用 `BlockPool`)· `models/qwen3/prefix-cache.md`(HBM 内前缀复用已落地)。

Expand Down Expand Up @@ -45,8 +45,8 @@ openinfer 仓里 vendored 的 `kvbm-physical` / `kvbm-engine` 设计目标就是

| 模型 | KV 形态 | active set | 跨请求复用 | 卸载结论 |
| --- | --- | --- | --- | --- |
| **Kimi-K2 MLA** | paged,per-layer ckv/kpe arena,后端是 `BlockPool`;latent 68.6 KiB/token,无 per-head | 无(dense 全前缀) | 有(HBM 内 prefix cache 已落地) | **首发**:接入面最干净,layout 直接适配 pegaflow registration |
| **Qwen3 / Qwen3.5 full-attn** | paged,page-first 单 buffer,`PagePool` | 无(dense 全前缀) | 有(前缀缓存已落地) | **次发**:page-first 与 pegaflow `stride==copy-size` ABI 冲突,需加 `block_stride`(见 §5.R1) |
| **Qwen3 / Qwen3.5 full-attn** | paged,page-first 单 buffer,`PagePool` | 无(dense 全前缀) | 有(前缀缓存已落地) | **已首发(#316)**:page-first 与 pegaflow `stride==copy-size` ABI 冲突已由 `block_stride`(§5.R1)解掉,端到端跑通 |
| **Kimi-K2 MLA** | paged,per-layer ckv/kpe arena,后端是 `BlockPool`;latent 68.6 KiB/token,无 per-head | 无(dense 全前缀) | 有(HBM 内 prefix cache 已落地) | **下一候选**:layout 直接适配 pegaflow registration(接入面最干净),复用 Qwen3-4B 这套 connector 模式即可 |
| **Qwen3.5 linear(24 层)** | per-request `RecurrentState` [32,128,128] f32 2 MiB/层,非 paged、独立分配 | 无(每步读写整个 matrix) | **零**(this-request 有损摘要,非 content-addressable) | **排除**:offload 无 prefix/dedup 收益;省显存是 per-request swap-out,另一套机制 |
| **DeepSeek-V4 sparse** | per-request per-layer dense arena [window\|compressed],非 paged;compressor 4:1 | **显式**:`topk_idxs` = window 行 + indexer 选中 compressed 行,token/row 粒度,每步重选 | 部分 | **暂缓**:compressor 已控 footprint;indexer 信号现成但 token 粒度 ≠ block 粒度(见 §7) |

Expand All @@ -56,8 +56,8 @@ openinfer 仓里 vendored 的 `kvbm-physical` / `kvbm-engine` 设计目标就是

## 4. 路线

1. **Kimi MLA 首发** —— pegaflow 做 `BlockPool` 下的 host/SSD tier;block evict 时 demote 到 host,前缀 query 命中时从 host restore。带宽便宜(latent 小),layout 零阻抗
2. **Qwen full-attn 次发** —— 先给 pegaflow 加 `block_stride_bytes`(R1),再接 page-first buffer
1. **Qwen full-attn 已首发(#316)** —— pegaflow 加了 `block_stride_bytes`(R1)解掉 page-first ABI 冲突,async SAVE + 非阻塞 prefetch admission 接进 `Qwen3Executor` + scheduler,server CLI 已接
2. **Kimi MLA 下一候选** —— pegaflow 做 `BlockPool` 下的 host/SSD tier;block evict 时 demote 到 host,前缀 query 命中时从 host restore。带宽便宜(latent 小),layout 零阻抗,直接复用 Qwen3-4B 的 connector 模式
3. **linear 排除、sparse 暂缓**。

## 5. 可行性(对抗验证结论,附证据)
Expand Down