feat(kv-offload): in-process pegaflow KV-cache offload on Qwen3-4B#316
Conversation
Introduce `pegainfer-kv-offload`, the in-process bridge that moves KV blocks between pegainfer's GPU paged cache and pegaflow's host/SSD tiers. `OffloadEngine` owns a `PegaEngine` plus a small tokio runtime, translates pegainfer's page-first `KvLayout` into pegaflow's per-layer strided registration (one [K|V] segment per layer, `block_stride` decoupling the fused page interleave), and exposes best-effort async save plus a polled, non-blocking load handle. pegaflow-core is pinned to upstream rev 07cac7e (carries block_stride #331 + the in-process load API #333); the workspace cudarc gains `nvrtc` for its embedded JIT copy kernel. Add the model-agnostic cache primitives the connector needs: `KvBuffer::device_ptr`, the `KvBlockGuard` / `assigned_block_guards` strong pins that keep source blocks alive across an in-flight save D2H, and the `PrefixProbe` / `LoadReservation` reservation path in `BlockPool`. `BlockManager::evict_inactive` drains the inactive pool without the reset assertion, for a cold-cache flush that tolerates still-pinned blocks. Drop the dead `kvbm-config`/`kvbm-engine`/`kvbm-physical` crates from the workspace — they pinned an older tokio and are unused by the offload path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Drive the offload connector from the Qwen3-4B path. `Qwen3OffloadOptions` turns it on per executor; sealed prefix blocks are SAVEd best-effort after prefill (pinned via `KvBlockGuard` until the D2H lands), and a request can RESTORE a CPU-only prefix through `begin_kv_prefetch`: probe GPU prefix → query pegaflow → reserve loaded blocks → async load → settle at the right prefix offset, stacking the CPU continuation onto any GPU hit. The scheduler grows a `loading` admission state (`offer_prefetch` / `reclaim_ready_prefetch` / `block_on_loading`) so prefetch never blocks the GPU thread, and `release_rejected` releases any settled prefetch state when a request is rejected at admission so committed prefix blocks can't leak. On the load failure paths the query lease is released so its pinned host blocks are not stranded for the TTL. `tests/kv_offload_cpu_hit.rs` is the live gate on real Qwen3-4B weights: one executor, two sequential scenarios (pure CPU-tier restore after HBM eviction, and a combined GPU+CPU split prefix), asserting the restored KV reproduces the cold first-token logits to the bf16 floor. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add the integration spec: pegaflow-core as an in-process physical offload tier behind a self-built kvbm-style connector brain, the page-first single-buffer registration, the live Qwen3-4B save/prefetch wiring, the three post-review correctness hardenings, and the git-rev dependency pin. Refresh the index.md row to match (Qwen3-4B full-attn shipped first and is live-validated). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request integrates pegaflow-core as an in-process Rust library to support KV cache offloading (HBM to DRAM/SSD/RDMA) for the single-GPU Qwen3-4B path. It introduces the pegainfer-kv-offload crate with an OffloadEngine to manage block transfers, updates the logical cache manager to support prefetching and block pinning, and wires these hooks into the Qwen3 executor and scheduler. A critical feedback item highlights a potential silent data corruption vulnerability in drop_request: if a request is cancelled while an async prefetch is in flight, its destination blocks are returned to the free pool and could be reassigned before the active DMA transfer completes. Synchronously waiting for the in-flight DMA to finish before releasing the reservation is recommended to prevent this issue.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| fn drop_request(&mut self, request_id: RequestId) -> Result<()> { | ||
| // Remove and drop — RAII on SchedulableSequence's block guards | ||
| // returns all allocated blocks regardless of lifecycle state. | ||
| // returns all allocated blocks regardless of lifecycle state. The same | ||
| // RAII frees any parked prefetch's reserved/held blocks. | ||
| self.request_kvs.remove(&request_id); | ||
| self.prefetch.remove(&request_id); | ||
| self.saved_cursor.remove(&request_id); | ||
| Ok(()) | ||
| } |
There was a problem hiding this comment.
⚠️ 潜在的在途 DMA 内存污染与静默数据损坏风险 (In-flight DMA Memory Corruption Risk)
在当前实现中,如果一个请求在 begin_kv_prefetch 启动了异步加载之后、但在加载完成前被取消或释放(例如客户端断开连接),drop_request 会直接将该请求从 self.prefetch 中移除:
self.prefetch.remove(&request_id);这会直接 drop 掉 PrefetchState,从而释放其持有的 LoadReservation(即 MutableBlock 目标 GPU 块)。这些块会被立即归还给 BlockManager 的空闲池,并可能在接下来的 Tick 中被分配给其他全新请求。
然而,此时 pegaflow 已经在工作线程中提交了针对这些物理块 ID 的异步 CUDA DMA 拷贝(D2H/H2D)。由于拷贝无法被取消,在途的 DMA 最终落地时,会静默覆写已被重新分配给新请求的 GPU 块,导致极其难以排查的静默数据损坏 (Silent Data Corruption)。
建议解决方案
在 drop_request 中,如果发现该请求存在在途的预取句柄(handle),应当同步等待(wait())该 DMA 拷贝彻底完成后,再允许释放 reservation 归还物理块。由于调度器运行在专属的同步 OS 线程上,此处的短时间等待是完全安全且必要的。
fn drop_request(&mut self, request_id: RequestId) -> Result<()> {
// Remove and drop — RAII on SchedulableSequence's block guards
// returns all allocated blocks regardless of lifecycle state. The same
// RAII frees any parked prefetch's reserved/held blocks.
self.request_kvs.remove(&request_id);
if let Some(mut st) = self.prefetch.remove(&request_id) {
if let Some(handle) = st.handle.take() {
// 必须同步等待在途 DMA 完成,防止物理块被提前归还空闲池并被新请求复用,
// 从而避免在途 DMA 落地时造成静默数据损坏。
let _ = handle.wait();
}
}
self.saved_cursor.remove(&request_id);
Ok(())
}There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 79057768f6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| self.prefetch.insert( | ||
| request_id, | ||
| PrefetchState { | ||
| probe, | ||
| reservation: Some(reservation), | ||
| handle: Some(handle), | ||
| }, |
There was a problem hiding this comment.
Account for prefetched pages during admission
When KV offload is enabled, offer_prefetch runs before admit_deferred_requests, and storing this PrefetchState keeps the LoadReservation's GPU pages allocated while the request is parked. After the load is promoted back to deferred, admission still checks max_needed <= available_blocks as if the request did not already own those restored prefix pages, so a long CPU-hit request near the KV budget can pin N pages, reduce available_blocks by N, fail admission forever, and never release the prefetch. Either admit with the prefetched pages credited to this request or avoid reserving them until after admission.
Useful? React with 👍 / 👎.
Drop-during-load corruption: `drop_request` released the `LoadReservation` even when the CPU→GPU load was still in flight. A request cancelled mid-load freed its destination blocks for immediate reuse while pegaflow's worker kept writing them — silent KV corruption, the load-side mirror of the SAVE keep-alive pin. It now blocks on the in-flight `LoadHandle` before the reservation drops; the scheduler is a dedicated sync thread, so the wait is free. Admission double-count: a settled prefetch's blocks are already out of `available_blocks`, but `admit_deferred_requests` still charged the full request footprint, so a near-budget CPU-hit request could fail admission forever and never release its prefetch. Admission now credits each request's held prefetch blocks (`prefetched_blocks` → `PrefixProbe::held_blocks`) against the free-pool budget, while keeping the full footprint for the per-request cap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…arking Wire `start_engine_with_offload` into the Qwen3 server path behind three flags: `--kv-offload` (enable the pegaflow host tier), `--kv-offload-host-gib` (pool size; pegaflow allocates it up front, so RSS reflects it at startup), and `--no-prefix-cache`. The last is the vLLM-style switch threaded through `start_qwen3` to `Qwen3Executor::set_no_prefix_cache`: without offload it disables prefix matching outright; with offload it is the pure-L2 mode (no cross-request HBM reuse, every prefix restored from the host tier) used to measure the L2 TTFT win. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
L2 (host-tier) TTFT benchmark + review fixesPure-L2 TTFT measurementAdded Setup: Qwen3-4B on RTX 5070 Ti (16 GB HBM), 30 GiB host pool, 64 × 2048-token prompts,
Server logs confirm it is genuinely L2, not a residual L1 hit: Every request restores the full 127-block prefix from the host tier ( Memory: pegaflow
Review fixes (
|
|
To use Codex here, create an environment for this repo. |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f5db7ddfa1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if executor.begin_kv_prefetch( | ||
| req.request_id, | ||
| &req.prompt_tokens, | ||
| req.lora_adapter.as_deref(), | ||
| ) { |
There was a problem hiding this comment.
Skip KV prefetch for echo requests
When --kv-offload is enabled, this offers every deferred request to CPU-tier prefetch, including echo requests. Echo prefills later deliberately skip match_and_add_prefix (!req.echo) because prompt logprobs need the whole prompt forwarded, so a CPU-hit echo request can park and reserve restored blocks that admission credits via prefetched_blocks, then execute_prefill tries to schedule a full cold prefill while those pointless prefetch blocks are still held. Under tight KV budgets this makes repeated echo requests fail scheduling even though the same cold request would fit; avoid prefetching req.echo requests or release the prefetch before admission/scheduling.
Useful? React with 👍 / 👎.
Echo requests forward the whole prompt to recover prompt logprobs, so their prefill skips match_and_add_prefix and can never reuse a restored prefix. Offering them to async CPU-tier prefetch parked restored KV that admission credited via prefetched_blocks but prefill never spent, starving the request under tight KV budgets. Guard offer_prefetch with !req.echo so echo requests never probe. With no prefetch entry, prefetched_blocks(echo) is always 0, so admission charges the full footprint — consistent with the cold prefill echo actually runs. The echo => no-prefetch-state => zero-credit invariant is now structural rather than a runtime patch. Flagged by Codex review on #316. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Fixed the P2 echo finding in @chatgpt-codex-connector (P2 — echo + prefetch): confirmed and fixed. Echo requests forward the whole prompt for prompt logprobs, so their prefill skips The two other latest inline comments (gemini @ Verified: 35 lib tests (incl. the new echo test) + @codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f958c57b6e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| &device_ordinals, | ||
| seed, | ||
| lora_options.validate()?, | ||
| Qwen3OffloadOptions::disabled(), |
There was a problem hiding this comment.
Honor KV-offload flags in LoRA startup
When the server is started with --enable-lora --kv-offload (or --enable-lora --no-prefix-cache), it takes the LoRA-control startup path, which calls this wrapper; the wrapper now hard-codes Qwen3OffloadOptions::disabled() and has no way to propagate no_prefix_cache. That makes the newly added CLI flags silently no-op for LoRA-enabled Qwen3 even though the scheduler/executor path accepts offload options, so users benchmarking or serving LoRA with requested L2 offload still run with the normal GPU-only prefix cache.
Useful? React with 👍 / 👎.
start_engine_with_lora_control hard-coded Qwen3OffloadOptions::disabled() and had no no_prefix_cache parameter, so --enable-lora silently dropped --kv-offload / --kv-offload-host-gib / --no-prefix-cache — the flags no-op'd instead of erroring. Thread offload_options and no_prefix_cache through the LoRA-control path the same way the non-LoRA path does, and build the offload options once in main.rs for both branches. Offload under LoRA is correct by construction: the prefix block hash is salted with the adapter name (compute_salt_hash), so restored KV — HBM or host tier — never crosses adapters. Flagged by Codex review on #316. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Fixed the LoRA-startup P2 in @chatgpt-codex-connector (P2 — LoRA startup ignored offload flags): confirmed and fixed. Offload under LoRA is correct by construction — the prefix block hash is salted with the adapter name ( Verified: server + qwen3-4b build clean, The three older inline comments now re-anchored to @codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ff9fab39f2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if self.l1_retention_disabled { | ||
| // Pure-L2 mode: drop any cross-request HBM retention so the probe | ||
| // sees gpu_hit == 0 and queries the whole cacheable prefix from the | ||
| // host tier. Only inactive (completed, unheld) blocks are drained — | ||
| // the current request holds nothing yet, and in-flight prefetches | ||
| // keep their reserved blocks, so this never touches live KV. | ||
| self.kv_mgr.pool().evict_inactive(); |
There was a problem hiding this comment.
Enforce pure-L2 mode before matching save-pinned blocks
When --kv-offload --no-prefix-cache is used and the next identical request arrives while the previous request's fire-and-forget save is still in flight, save_sealed_blocks is holding KvBlockGuard pins for those HBM pages, so they are still active rather than inactive. This evict_inactive() call cannot remove them, and the following probe_prefix() can see a GPU hit (possibly leaving query_hashes empty), so the request reuses HBM instead of restoring from the host tier. That breaks the documented pure-L2/no-prefix-cache behavior for immediate TTFT-style repeats; the path needs to wait/flush saves or otherwise make save-pinned blocks unmatchable before probing.
Useful? React with 👍 / 👎.
What
Integrate pegaflow (Apache-2.0) into pegainfer as an in-process Rust KV-cache offload tier, dense-attention-first, on Qwen3-4B. A prefix that has fallen out of HBM — or was never GPU-resident — can be restored from pegaflow's host (and, later, SSD/RDMA) tiers instead of recomputed.
End-to-end is live and validated on real Qwen3-4B weights: async SAVE + async LOAD wired into
Qwen3Executor+ the scheduler, with the restored KV reproducing the cold first-token logits to the bf16 floor.How
pegainfer-kv-offload(new crate) —OffloadEngineowns aPegaEngine+ a small tokio runtime.Registration::from_buffermaps pegainfer's fused page-first single buffer into pegaflow's per-layer strided registration: within a page K and V are contiguous (layer_stride= one[K|V]segment), sosegments = 1andblock_stridedecouples the page interleave — not the K/V-split path. Save is best-effort fire-and-forget; load is a polled, non-blockingLoadHandle.pegainfer-kv-cache) —KvBuffer::device_ptr, theKvBlockGuard/assigned_block_guardsstrong pins, and thePrefixProbe/LoadReservationreservation path.BlockManager::evict_inactivegives a cold-cache flush that tolerates still-pinned blocks.Qwen3OffloadOptionsopt-in (default off). SAVE after prefill;begin_kv_prefetchprobes the GPU prefix, queries pegaflow, reserves + async-loads the CPU continuation, and settles it at the right prefix offset, stacking CPU blocks onto any GPU hit. The scheduler grows a non-blockingloadingadmission state so prefetch never stalls the GPU thread.Correctness hardenings (post toxic-review)
release_query_leaseon the reserve/load failure paths so pinned host blocks aren't stranded for the 600s TTL.release_rejectedcallsdrop_requestwhen a request is rejected at admission, so committed prefix blocks can't leak.KvBlockGuarduntil the D2H lands, closing the "request ends → slot reused → D2H reads the wrong KV" silent-corruption window.Tests
pegainfer-kv-offload/tests/cpu_roundtrip.rs— byte-level save → query → load on a realKvBuffer, plus a zero-block negative control.pegainfer-qwen3-4b/tests/kv_offload_cpu_hit.rs— live gate on real weights: one executor, two sequential scenarios (pure CPU-tier restore after HBM eviction, and a combined GPU+CPU split prefix), asserting warm logits match cold (mean Δ ≈ 0.03 nat).Run:
PEGAINFER_TEST_MODEL_PATH=models/Qwen3-4B cargo test --release -p pegainfer-qwen3-4b --test kv_offload_cpu_hitDependency
pegaflow-coreis a git dependency pinned to upstream rev07cac7e, which carries the two upstream changes this needs:block_stride_bytes(novitalabs/pegaflow#331) and the in-process load API (#333).default-features = falsedrops itscuda-12/rdma; the workspace cudarc gainsnvrtcfor pegaflow's embedded JIT copy kernel. The deadkvbm-config/kvbm-engine/kvbm-physicalcrates (older-tokio pin, unused) are dropped from the workspace.Scope
Default off; not yet wired into the server CLI (only
start_engine_with_offload+ tests). Qwen3.5 linear/SSM state is excluded; DeepSeek sparse is deferred. Seedocs/subsystems/runtime/pegaflow-offload-integration.md.🤖 Generated with Claude Code