Skip to content

feat(kv-offload): in-process pegaflow KV-cache offload on Qwen3-4B#316

Merged
xiaguan merged 7 commits into
mainfrom
feat/pegaflow-kv-offload-qwen3
Jun 9, 2026
Merged

feat(kv-offload): in-process pegaflow KV-cache offload on Qwen3-4B#316
xiaguan merged 7 commits into
mainfrom
feat/pegaflow-kv-offload-qwen3

Conversation

@xiaguan

@xiaguan xiaguan commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

What

Integrate pegaflow (Apache-2.0) into pegainfer as an in-process Rust KV-cache offload tier, dense-attention-first, on Qwen3-4B. A prefix that has fallen out of HBM — or was never GPU-resident — can be restored from pegaflow's host (and, later, SSD/RDMA) tiers instead of recomputed.

End-to-end is live and validated on real Qwen3-4B weights: async SAVE + async LOAD wired into Qwen3Executor + the scheduler, with the restored KV reproducing the cold first-token logits to the bf16 floor.

How

per-model scheduler   ← policy: which blocks should be resident
  ↓ load/save intents (a set of blocks)
pegainfer-kv-offload  ← the connector "brain": block identity, GPU-slot
  (OffloadEngine)        orchestration, transfer scheduling (kvbm logical/physical idea)
  ↓ semantics-free raw transfer
pegaflow-core         ← muscle: D2H/H2D + DRAM/SSD/RDMA tiers
  • pegainfer-kv-offload (new crate)OffloadEngine owns a PegaEngine + a small tokio runtime. Registration::from_buffer maps pegainfer's fused page-first single buffer into pegaflow's per-layer strided registration: within a page K and V are contiguous (layer_stride = one [K|V] segment), so segments = 1 and block_stride decouples the page interleave — not the K/V-split path. Save is best-effort fire-and-forget; load is a polled, non-blocking LoadHandle.
  • Shared cache primitives (pegainfer-kv-cache) — KvBuffer::device_ptr, the KvBlockGuard / assigned_block_guards strong pins, and the PrefixProbe / LoadReservation reservation path. BlockManager::evict_inactive gives a cold-cache flush that tolerates still-pinned blocks.
  • Qwen3-4B wiringQwen3OffloadOptions opt-in (default off). SAVE after prefill; begin_kv_prefetch probes the GPU prefix, queries pegaflow, reserves + async-loads the CPU continuation, and settles it at the right prefix offset, stacking CPU blocks onto any GPU hit. The scheduler grows a non-blocking loading admission state so prefetch never stalls the GPU thread.

Correctness hardenings (post toxic-review)

  1. Query-lease leakrelease_query_lease on the reserve/load failure paths so pinned host blocks aren't stranded for the 600s TTL.
  2. Prefetch-state leak on rejectionrelease_rejected calls drop_request when a request is rejected at admission, so committed prefix blocks can't leak.
  3. SAVE D2H race — source blocks are pinned via KvBlockGuard until the D2H lands, closing the "request ends → slot reused → D2H reads the wrong KV" silent-corruption window.

Tests

  • pegainfer-kv-offload/tests/cpu_roundtrip.rs — byte-level save → query → load on a real KvBuffer, plus a zero-block negative control.
  • pegainfer-qwen3-4b/tests/kv_offload_cpu_hit.rs — live gate on real weights: one executor, two sequential scenarios (pure CPU-tier restore after HBM eviction, and a combined GPU+CPU split prefix), asserting warm logits match cold (mean Δ ≈ 0.03 nat).

Run: PEGAINFER_TEST_MODEL_PATH=models/Qwen3-4B cargo test --release -p pegainfer-qwen3-4b --test kv_offload_cpu_hit

Dependency

pegaflow-core is a git dependency pinned to upstream rev 07cac7e, which carries the two upstream changes this needs: block_stride_bytes (novitalabs/pegaflow#331) and the in-process load API (#333). default-features = false drops its cuda-12/rdma; the workspace cudarc gains nvrtc for pegaflow's embedded JIT copy kernel. The dead kvbm-config/kvbm-engine/kvbm-physical crates (older-tokio pin, unused) are dropped from the workspace.

Scope

Default off; not yet wired into the server CLI (only start_engine_with_offload + tests). Qwen3.5 linear/SSM state is excluded; DeepSeek sparse is deferred. See docs/subsystems/runtime/pegaflow-offload-integration.md.

🤖 Generated with Claude Code

xiaguan and others added 3 commits June 9, 2026 16:01
Introduce `pegainfer-kv-offload`, the in-process bridge that moves KV
blocks between pegainfer's GPU paged cache and pegaflow's host/SSD tiers.
`OffloadEngine` owns a `PegaEngine` plus a small tokio runtime, translates
pegainfer's page-first `KvLayout` into pegaflow's per-layer strided
registration (one [K|V] segment per layer, `block_stride` decoupling the
fused page interleave), and exposes best-effort async save plus a polled,
non-blocking load handle. pegaflow-core is pinned to upstream rev 07cac7e
(carries block_stride #331 + the in-process load API #333); the workspace
cudarc gains `nvrtc` for its embedded JIT copy kernel.

Add the model-agnostic cache primitives the connector needs:
`KvBuffer::device_ptr`, the `KvBlockGuard` / `assigned_block_guards` strong
pins that keep source blocks alive across an in-flight save D2H, and the
`PrefixProbe` / `LoadReservation` reservation path in `BlockPool`.
`BlockManager::evict_inactive` drains the inactive pool without the reset
assertion, for a cold-cache flush that tolerates still-pinned blocks.

Drop the dead `kvbm-config`/`kvbm-engine`/`kvbm-physical` crates from the
workspace — they pinned an older tokio and are unused by the offload path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Drive the offload connector from the Qwen3-4B path. `Qwen3OffloadOptions`
turns it on per executor; sealed prefix blocks are SAVEd best-effort after
prefill (pinned via `KvBlockGuard` until the D2H lands), and a request can
RESTORE a CPU-only prefix through `begin_kv_prefetch`: probe GPU prefix →
query pegaflow → reserve loaded blocks → async load → settle at the right
prefix offset, stacking the CPU continuation onto any GPU hit.

The scheduler grows a `loading` admission state (`offer_prefetch` /
`reclaim_ready_prefetch` / `block_on_loading`) so prefetch never blocks the
GPU thread, and `release_rejected` releases any settled prefetch state when a
request is rejected at admission so committed prefix blocks can't leak. On
the load failure paths the query lease is released so its pinned host blocks
are not stranded for the TTL.

`tests/kv_offload_cpu_hit.rs` is the live gate on real Qwen3-4B weights:
one executor, two sequential scenarios (pure CPU-tier restore after HBM
eviction, and a combined GPU+CPU split prefix), asserting the restored KV
reproduces the cold first-token logits to the bf16 floor.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add the integration spec: pegaflow-core as an in-process physical offload
tier behind a self-built kvbm-style connector brain, the page-first single-buffer
registration, the live Qwen3-4B save/prefetch wiring, the three post-review
correctness hardenings, and the git-rev dependency pin. Refresh the index.md row
to match (Qwen3-4B full-attn shipped first and is live-validated).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates pegaflow-core as an in-process Rust library to support KV cache offloading (HBM to DRAM/SSD/RDMA) for the single-GPU Qwen3-4B path. It introduces the pegainfer-kv-offload crate with an OffloadEngine to manage block transfers, updates the logical cache manager to support prefetching and block pinning, and wires these hooks into the Qwen3 executor and scheduler. A critical feedback item highlights a potential silent data corruption vulnerability in drop_request: if a request is cancelled while an async prefetch is in flight, its destination blocks are returned to the free pool and could be reassigned before the active DMA transfer completes. Synchronously waiting for the in-flight DMA to finish before releasing the reservation is recommended to prevent this issue.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines 979 to 987
fn drop_request(&mut self, request_id: RequestId) -> Result<()> {
// Remove and drop — RAII on SchedulableSequence's block guards
// returns all allocated blocks regardless of lifecycle state.
// returns all allocated blocks regardless of lifecycle state. The same
// RAII frees any parked prefetch's reserved/held blocks.
self.request_kvs.remove(&request_id);
self.prefetch.remove(&request_id);
self.saved_cursor.remove(&request_id);
Ok(())
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

⚠️ 潜在的在途 DMA 内存污染与静默数据损坏风险 (In-flight DMA Memory Corruption Risk)

在当前实现中,如果一个请求在 begin_kv_prefetch 启动了异步加载之后、但在加载完成前被取消或释放(例如客户端断开连接),drop_request 会直接将该请求从 self.prefetch 中移除:

self.prefetch.remove(&request_id);

这会直接 dropPrefetchState,从而释放其持有的 LoadReservation(即 MutableBlock 目标 GPU 块)。这些块会被立即归还给 BlockManager 的空闲池,并可能在接下来的 Tick 中被分配给其他全新请求

然而,此时 pegaflow 已经在工作线程中提交了针对这些物理块 ID 的异步 CUDA DMA 拷贝(D2H/H2D)。由于拷贝无法被取消,在途的 DMA 最终落地时,会静默覆写已被重新分配给新请求的 GPU 块,导致极其难以排查的静默数据损坏 (Silent Data Corruption)

建议解决方案

drop_request 中,如果发现该请求存在在途的预取句柄(handle),应当同步等待(wait())该 DMA 拷贝彻底完成后,再允许释放 reservation 归还物理块。由于调度器运行在专属的同步 OS 线程上,此处的短时间等待是完全安全且必要的。

    fn drop_request(&mut self, request_id: RequestId) -> Result<()> {
        // Remove and drop — RAII on SchedulableSequence's block guards
        // returns all allocated blocks regardless of lifecycle state. The same
        // RAII frees any parked prefetch's reserved/held blocks.
        self.request_kvs.remove(&request_id);
        if let Some(mut st) = self.prefetch.remove(&request_id) {
            if let Some(handle) = st.handle.take() {
                // 必须同步等待在途 DMA 完成,防止物理块被提前归还空闲池并被新请求复用,
                // 从而避免在途 DMA 落地时造成静默数据损坏。
                let _ = handle.wait();
            }
        }
        self.saved_cursor.remove(&request_id);
        Ok(())
    }

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 79057768f6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1038 to +1044
self.prefetch.insert(
request_id,
PrefetchState {
probe,
reservation: Some(reservation),
handle: Some(handle),
},

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Account for prefetched pages during admission

When KV offload is enabled, offer_prefetch runs before admit_deferred_requests, and storing this PrefetchState keeps the LoadReservation's GPU pages allocated while the request is parked. After the load is promoted back to deferred, admission still checks max_needed <= available_blocks as if the request did not already own those restored prefix pages, so a long CPU-hit request near the KV budget can pin N pages, reduce available_blocks by N, fail admission forever, and never release the prefetch. Either admit with the prefetched pages credited to this request or avoid reserving them until after admission.

Useful? React with 👍 / 👎.

xiaguan and others added 2 commits June 9, 2026 17:09
Drop-during-load corruption: `drop_request` released the `LoadReservation`
even when the CPU→GPU load was still in flight. A request cancelled mid-load
freed its destination blocks for immediate reuse while pegaflow's worker kept
writing them — silent KV corruption, the load-side mirror of the SAVE
keep-alive pin. It now blocks on the in-flight `LoadHandle` before the
reservation drops; the scheduler is a dedicated sync thread, so the wait is
free.

Admission double-count: a settled prefetch's blocks are already out of
`available_blocks`, but `admit_deferred_requests` still charged the full
request footprint, so a near-budget CPU-hit request could fail admission
forever and never release its prefetch. Admission now credits each request's
held prefetch blocks (`prefetched_blocks` → `PrefixProbe::held_blocks`) against
the free-pool budget, while keeping the full footprint for the per-request cap.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…arking

Wire `start_engine_with_offload` into the Qwen3 server path behind three
flags: `--kv-offload` (enable the pegaflow host tier), `--kv-offload-host-gib`
(pool size; pegaflow allocates it up front, so RSS reflects it at startup), and
`--no-prefix-cache`. The last is the vLLM-style switch threaded through
`start_qwen3` to `Qwen3Executor::set_no_prefix_cache`: without offload it
disables prefix matching outright; with offload it is the pure-L2 mode (no
cross-request HBM reuse, every prefix restored from the host tier) used to
measure the L2 TTFT win.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@xiaguan

xiaguan commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

L2 (host-tier) TTFT benchmark + review fixes

Pure-L2 TTFT measurement

Added --kv-offload / --kv-offload-host-gib / --no-prefix-cache to the Qwen3 server (f5db7dd) and ran vllm bench serve twice against the same seeded prompt set, so pass 2 finds its prefix only on the host tier.

Setup: Qwen3-4B on RTX 5070 Ti (16 GB HBM), 30 GiB host pool, 64 × 2048-token prompts, --max-concurrency 1 (bs=1), --no-prefix-cache = pure-L2 mode (cross-request HBM/L1 reuse off, all reuse from L2).

metric Cold (first sight) Warm (pure-L2 restore) Δ
Mean TTFT 195.2 ms 40.6 ms −79% (4.8×)
Median TTFT 194.8 ms 38.1 ms −80%
P99 TTFT 203.9 ms 54.5 ms −73%
duration (64 req) 14.9 s 4.8 s 3.1×

Server logs confirm it is genuinely L2, not a residual L1 hit:

Prefetch local-hit: total_keys=127 hit=127 missing=0      (gpu_hit=0)
Load task completed: layers=36 blocks=127 bytes=299630592 (286 MiB)
                     elapsed_ms≈20  bandwidth≈15 GB/s  backend=direct

Every request restores the full 127-block prefix from the host tier (gpu_hit=0 — the evict-before-probe path is doing its job). Warm TTFT ≈ 286 MiB H2D (~20 ms) + the 16-token tail prefill. The ~15 GB/s is consumer-PCIe-bound; a server GPU would push the restore cost down further.

Memory: pegaflow cudaHostAllocs the whole pool at startup — server RSS measured 30.8 GB with --kv-offload-host-gib 30, confirming the pool is eager, not lazy.

Note on --no-prefix-cache semantics under offload: L1-resident and L2-stored blocks share one content hash, so the cache cannot be told to "prefer L2" for a block still in HBM. Pure-L2 therefore means not retaining completed blocks in HBM (evict before the prefetch probe → gpu_hit=0), which is exactly vLLM's --no-enable-prefix-caching behaviour (free, don't keep). Prefix matching stays on, since the L2 restore rides on it.

Review fixes (b5ac797)

Both bot findings were real and are fixed:

  • @gemini-code-assist (HIGH — in-flight DMA corruption): confirmed. drop_request released the LoadReservation while the H2D load could still be in flight, so a cancel-mid-load freed pages that pegaflow's worker kept writing → silent KV corruption. drop_request now blocks on the in-flight LoadHandle before the reservation drops — the load-side mirror of the SAVE keep-alive pin. The scheduler is a dedicated sync thread, so the wait is free.
  • @chatgpt-codex-connector (P1 — admission accounting): confirmed. A settled prefetch's blocks are already out of available_blocks, but admit_deferred_requests still charged the full request footprint, so a near-budget CPU-hit request could fail admission forever and never release its prefetch. Admission now credits each request's held prefetch blocks (prefetched_blocksPrefixProbe::held_blocks) against the free-pool budget, while keeping the full footprint for the per-request cap.

Verified: kv_offload_cpu_hit integration gate passes unchanged (mean Δ 0.028 / 0.030 nat), 17 scheduler unit tests pass, and an adversarial re-review found no residual holes in either fix.

@chatgpt-codex-connector

Copy link
Copy Markdown

To use Codex here, create an environment for this repo.

@xiaguan

xiaguan commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f5db7ddfa1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +175 to +179
if executor.begin_kv_prefetch(
req.request_id,
&req.prompt_tokens,
req.lora_adapter.as_deref(),
) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Skip KV prefetch for echo requests

When --kv-offload is enabled, this offers every deferred request to CPU-tier prefetch, including echo requests. Echo prefills later deliberately skip match_and_add_prefix (!req.echo) because prompt logprobs need the whole prompt forwarded, so a CPU-hit echo request can park and reserve restored blocks that admission credits via prefetched_blocks, then execute_prefill tries to schedule a full cold prefill while those pointless prefetch blocks are still held. Under tight KV budgets this makes repeated echo requests fail scheduling even though the same cold request would fit; avoid prefetching req.echo requests or release the prefetch before admission/scheduling.

Useful? React with 👍 / 👎.

Echo requests forward the whole prompt to recover prompt logprobs, so
their prefill skips match_and_add_prefix and can never reuse a restored
prefix. Offering them to async CPU-tier prefetch parked restored KV that
admission credited via prefetched_blocks but prefill never spent,
starving the request under tight KV budgets.

Guard offer_prefetch with !req.echo so echo requests never probe. With
no prefetch entry, prefetched_blocks(echo) is always 0, so admission
charges the full footprint — consistent with the cold prefill echo
actually runs. The echo => no-prefetch-state => zero-credit invariant is
now structural rather than a runtime patch.

Flagged by Codex review on #316.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@xiaguan

xiaguan commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

Fixed the P2 echo finding in f958c57.

@chatgpt-codex-connector (P2 — echo + prefetch): confirmed and fixed. Echo requests forward the whole prompt for prompt logprobs, so their prefill skips match_and_add_prefix and can never reuse a restored prefix. Offering them to prefetch parked restored KV that admission credited via prefetched_blocks but prefill never spent, starving the request under tight budgets. offer_prefetch now guards with !req.echo, so echo never probes — with no prefetch entry, prefetched_blocks(echo) is always 0 and admission charges the full footprint, matching the cold prefill echo actually runs. The echo ⇒ no prefetch state ⇒ zero credit invariant is now structural rather than a runtime patch. Added a unit test (echo_requests_are_never_offered_to_prefetch).

The two other latest inline comments (gemini @ executor.rs:1045 in-flight DMA; codex @ executor.rs:1110 admission double-count) are the original findings re-anchored to the new commit — both were already fixed in b5ac797.

Verified: 35 lib tests (incl. the new echo test) + kv_offload_cpu_hit integration gate pass; clippy clean on scheduler.rs; adversarial re-review found no residual holes.

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f958c57b6e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread pegainfer-qwen3-4b/src/lib.rs Outdated
&device_ordinals,
seed,
lora_options.validate()?,
Qwen3OffloadOptions::disabled(),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor KV-offload flags in LoRA startup

When the server is started with --enable-lora --kv-offload (or --enable-lora --no-prefix-cache), it takes the LoRA-control startup path, which calls this wrapper; the wrapper now hard-codes Qwen3OffloadOptions::disabled() and has no way to propagate no_prefix_cache. That makes the newly added CLI flags silently no-op for LoRA-enabled Qwen3 even though the scheduler/executor path accepts offload options, so users benchmarking or serving LoRA with requested L2 offload still run with the normal GPU-only prefix cache.

Useful? React with 👍 / 👎.

start_engine_with_lora_control hard-coded Qwen3OffloadOptions::disabled()
and had no no_prefix_cache parameter, so --enable-lora silently dropped
--kv-offload / --kv-offload-host-gib / --no-prefix-cache — the flags
no-op'd instead of erroring. Thread offload_options and no_prefix_cache
through the LoRA-control path the same way the non-LoRA path does, and
build the offload options once in main.rs for both branches.

Offload under LoRA is correct by construction: the prefix block hash is
salted with the adapter name (compute_salt_hash), so restored KV — HBM
or host tier — never crosses adapters.

Flagged by Codex review on #316.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@xiaguan

xiaguan commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

Fixed the LoRA-startup P2 in ff9fab3.

@chatgpt-codex-connector (P2 — LoRA startup ignored offload flags): confirmed and fixed. start_engine_with_lora_control hard-coded Qwen3OffloadOptions::disabled() and had no no_prefix_cache parameter, so --enable-lora silently dropped --kv-offload / --kv-offload-host-gib / --no-prefix-cache. Now offload_options + no_prefix_cache thread through the LoRA-control path exactly as the non-LoRA path does (offer/set_no_prefix_cache), and the offload options are built once in main.rs for both branches.

Offload under LoRA is correct by construction — the prefix block hash is salted with the adapter name (compute_salt_hash, pool.rs:90-102, guarded by the salt (lora) must scope the prefix cache test), so restored KV (HBM or host tier) never crosses adapters.

Verified: server + qwen3-4b build clean, lora_smoke compiles against the new signature, 35 lib tests pass.

The three older inline comments now re-anchored to ff9fab3 (gemini executor.rs:1045 DMA, codex executor.rs:1110 admission, codex scheduler.rs:186 echo) are all already fixed — in b5ac797, b5ac797, and f958c57 respectively.

@codex review

@xiaguan xiaguan merged commit cb49983 into main Jun 9, 2026
1 check passed

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ff9fab39f2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1059 to +1065
if self.l1_retention_disabled {
// Pure-L2 mode: drop any cross-request HBM retention so the probe
// sees gpu_hit == 0 and queries the whole cacheable prefix from the
// host tier. Only inactive (completed, unheld) blocks are drained —
// the current request holds nothing yet, and in-flight prefetches
// keep their reserved blocks, so this never touches live KV.
self.kv_mgr.pool().evict_inactive();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enforce pure-L2 mode before matching save-pinned blocks

When --kv-offload --no-prefix-cache is used and the next identical request arrives while the previous request's fire-and-forget save is still in flight, save_sealed_blocks is holding KvBlockGuard pins for those HBM pages, so they are still active rather than inactive. This evict_inactive() call cannot remove them, and the following probe_prefix() can see a GPU hit (possibly leaving query_hashes empty), so the request reuses HBM instead of restoring from the host tier. That breaks the documented pure-L2/no-prefix-cache behavior for immediate TTFT-style repeats; the path needs to wait/flush saves or otherwise make save-pinned blocks unmatchable before probing.

Useful? React with 👍 / 👎.

xiaguan added a commit that referenced this pull request Jun 9, 2026
Roadmap-doc drift fixes for already-landed work (#307 batched greedy sampling, #316 in-process pegaflow KV offload). Doc-only: offload integration doc CLI status + Kimi/Qwen first-launch order, execution.md data-plane entry, qwen3 roadmap rows, index.md.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant