feat(kv-offload): in-process pegaflow KV-cache offload on Qwen3-4B by xiaguan · Pull Request #316 · openinfer-project/openinfer

xiaguan · 2026-06-09T08:03:06Z

What

Integrate pegaflow (Apache-2.0) into pegainfer as an in-process Rust KV-cache offload tier, dense-attention-first, on Qwen3-4B. A prefix that has fallen out of HBM — or was never GPU-resident — can be restored from pegaflow's host (and, later, SSD/RDMA) tiers instead of recomputed.

End-to-end is live and validated on real Qwen3-4B weights: async SAVE + async LOAD wired into Qwen3Executor + the scheduler, with the restored KV reproducing the cold first-token logits to the bf16 floor.

How

per-model scheduler   ← policy: which blocks should be resident
  ↓ load/save intents (a set of blocks)
pegainfer-kv-offload  ← the connector "brain": block identity, GPU-slot
  (OffloadEngine)        orchestration, transfer scheduling (kvbm logical/physical idea)
  ↓ semantics-free raw transfer
pegaflow-core         ← muscle: D2H/H2D + DRAM/SSD/RDMA tiers

pegainfer-kv-offload (new crate) — OffloadEngine owns a PegaEngine + a small tokio runtime. Registration::from_buffer maps pegainfer's fused page-first single buffer into pegaflow's per-layer strided registration: within a page K and V are contiguous (layer_stride = one [K|V] segment), so segments = 1 and block_stride decouples the page interleave — not the K/V-split path. Save is best-effort fire-and-forget; load is a polled, non-blocking LoadHandle.
Shared cache primitives (pegainfer-kv-cache) — KvBuffer::device_ptr, the KvBlockGuard / assigned_block_guards strong pins, and the PrefixProbe / LoadReservation reservation path. BlockManager::evict_inactive gives a cold-cache flush that tolerates still-pinned blocks.
Qwen3-4B wiring — Qwen3OffloadOptions opt-in (default off). SAVE after prefill; begin_kv_prefetch probes the GPU prefix, queries pegaflow, reserves + async-loads the CPU continuation, and settles it at the right prefix offset, stacking CPU blocks onto any GPU hit. The scheduler grows a non-blocking loading admission state so prefetch never stalls the GPU thread.

Correctness hardenings (post toxic-review)

Query-lease leak — release_query_lease on the reserve/load failure paths so pinned host blocks aren't stranded for the 600s TTL.
Prefetch-state leak on rejection — release_rejected calls drop_request when a request is rejected at admission, so committed prefix blocks can't leak.
SAVE D2H race — source blocks are pinned via KvBlockGuard until the D2H lands, closing the "request ends → slot reused → D2H reads the wrong KV" silent-corruption window.

Tests

pegainfer-kv-offload/tests/cpu_roundtrip.rs — byte-level save → query → load on a real KvBuffer, plus a zero-block negative control.
pegainfer-qwen3-4b/tests/kv_offload_cpu_hit.rs — live gate on real weights: one executor, two sequential scenarios (pure CPU-tier restore after HBM eviction, and a combined GPU+CPU split prefix), asserting warm logits match cold (mean Δ ≈ 0.03 nat).

Run: PEGAINFER_TEST_MODEL_PATH=models/Qwen3-4B cargo test --release -p pegainfer-qwen3-4b --test kv_offload_cpu_hit

Dependency

pegaflow-core is a git dependency pinned to upstream rev 07cac7e, which carries the two upstream changes this needs: block_stride_bytes (novitalabs/pegaflow#331) and the in-process load API (#333). default-features = false drops its cuda-12/rdma; the workspace cudarc gains nvrtc for pegaflow's embedded JIT copy kernel. The dead kvbm-config/kvbm-engine/kvbm-physical crates (older-tokio pin, unused) are dropped from the workspace.

Scope

Default off; not yet wired into the server CLI (only start_engine_with_offload + tests). Qwen3.5 linear/SSM state is excluded; DeepSeek sparse is deferred. See docs/subsystems/runtime/pegaflow-offload-integration.md.

🤖 Generated with Claude Code

Introduce `pegainfer-kv-offload`, the in-process bridge that moves KV blocks between pegainfer's GPU paged cache and pegaflow's host/SSD tiers. `OffloadEngine` owns a `PegaEngine` plus a small tokio runtime, translates pegainfer's page-first `KvLayout` into pegaflow's per-layer strided registration (one [K|V] segment per layer, `block_stride` decoupling the fused page interleave), and exposes best-effort async save plus a polled, non-blocking load handle. pegaflow-core is pinned to upstream rev 07cac7e (carries block_stride #331 + the in-process load API #333); the workspace cudarc gains `nvrtc` for its embedded JIT copy kernel. Add the model-agnostic cache primitives the connector needs: `KvBuffer::device_ptr`, the `KvBlockGuard` / `assigned_block_guards` strong pins that keep source blocks alive across an in-flight save D2H, and the `PrefixProbe` / `LoadReservation` reservation path in `BlockPool`. `BlockManager::evict_inactive` drains the inactive pool without the reset assertion, for a cold-cache flush that tolerates still-pinned blocks. Drop the dead `kvbm-config`/`kvbm-engine`/`kvbm-physical` crates from the workspace — they pinned an older tokio and are unused by the offload path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Drive the offload connector from the Qwen3-4B path. `Qwen3OffloadOptions` turns it on per executor; sealed prefix blocks are SAVEd best-effort after prefill (pinned via `KvBlockGuard` until the D2H lands), and a request can RESTORE a CPU-only prefix through `begin_kv_prefetch`: probe GPU prefix → query pegaflow → reserve loaded blocks → async load → settle at the right prefix offset, stacking the CPU continuation onto any GPU hit. The scheduler grows a `loading` admission state (`offer_prefetch` / `reclaim_ready_prefetch` / `block_on_loading`) so prefetch never blocks the GPU thread, and `release_rejected` releases any settled prefetch state when a request is rejected at admission so committed prefix blocks can't leak. On the load failure paths the query lease is released so its pinned host blocks are not stranded for the TTL. `tests/kv_offload_cpu_hit.rs` is the live gate on real Qwen3-4B weights: one executor, two sequential scenarios (pure CPU-tier restore after HBM eviction, and a combined GPU+CPU split prefix), asserting the restored KV reproduces the cold first-token logits to the bf16 floor. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add the integration spec: pegaflow-core as an in-process physical offload tier behind a self-built kvbm-style connector brain, the page-first single-buffer registration, the live Qwen3-4B save/prefetch wiring, the three post-review correctness hardenings, and the git-rev dependency pin. Refresh the index.md row to match (Qwen3-4B full-attn shipped first and is live-validated). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request integrates pegaflow-core as an in-process Rust library to support KV cache offloading (HBM to DRAM/SSD/RDMA) for the single-GPU Qwen3-4B path. It introduces the pegainfer-kv-offload crate with an OffloadEngine to manage block transfers, updates the logical cache manager to support prefetching and block pinning, and wires these hooks into the Qwen3 executor and scheduler. A critical feedback item highlights a potential silent data corruption vulnerability in drop_request: if a request is cancelled while an async prefetch is in flight, its destination blocks are returned to the free pool and could be reassigned before the active DMA transfer completes. Synchronously waiting for the in-flight DMA to finish before releasing the reservation is recommended to prevent this issue.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-09T08:05:53Z

    fn drop_request(&mut self, request_id: RequestId) -> Result<()> {
        // Remove and drop — RAII on SchedulableSequence's block guards
-        // returns all allocated blocks regardless of lifecycle state.
+        // returns all allocated blocks regardless of lifecycle state. The same
+        // RAII frees any parked prefetch's reserved/held blocks.
        self.request_kvs.remove(&request_id);
+        self.prefetch.remove(&request_id);
+        self.saved_cursor.remove(&request_id);
        Ok(())
    }


⚠️ 潜在的在途 DMA 内存污染与静默数据损坏风险 (In-flight DMA Memory Corruption Risk)

在当前实现中，如果一个请求在 begin_kv_prefetch 启动了异步加载之后、但在加载完成前被取消或释放（例如客户端断开连接），drop_request 会直接将该请求从 self.prefetch 中移除：

self.prefetch.remove(&request_id);

这会直接 drop 掉 PrefetchState，从而释放其持有的 LoadReservation（即 MutableBlock 目标 GPU 块）。这些块会被立即归还给 BlockManager 的空闲池，并可能在接下来的 Tick 中被分配给其他全新请求。

然而，此时 pegaflow 已经在工作线程中提交了针对这些物理块 ID 的异步 CUDA DMA 拷贝（D2H/H2D）。由于拷贝无法被取消，在途的 DMA 最终落地时，会静默覆写已被重新分配给新请求的 GPU 块，导致极其难以排查的静默数据损坏 (Silent Data Corruption)。

建议解决方案

在 drop_request 中，如果发现该请求存在在途的预取句柄（handle），应当同步等待（wait()）该 DMA 拷贝彻底完成后，再允许释放 reservation 归还物理块。由于调度器运行在专属的同步 OS 线程上，此处的短时间等待是完全安全且必要的。

fn drop_request(&mut self, request_id: RequestId) -> Result<()> { // Remove and drop — RAII on SchedulableSequence's block guards // returns all allocated blocks regardless of lifecycle state. The same // RAII frees any parked prefetch's reserved/held blocks. self.request_kvs.remove(&request_id); if let Some(mut st) = self.prefetch.remove(&request_id) { if let Some(handle) = st.handle.take() { // 必须同步等待在途 DMA 完成，防止物理块被提前归还空闲池并被新请求复用， // 从而避免在途 DMA 落地时造成静默数据损坏。 let _ = handle.wait(); } } self.saved_cursor.remove(&request_id); Ok(()) }

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 79057768f6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-09T08:06:59Z

+        self.prefetch.insert(
+            request_id,
+            PrefetchState {
+                probe,
+                reservation: Some(reservation),
+                handle: Some(handle),
+            },


Account for prefetched pages during admission

When KV offload is enabled, offer_prefetch runs before admit_deferred_requests, and storing this PrefetchState keeps the LoadReservation's GPU pages allocated while the request is parked. After the load is promoted back to deferred, admission still checks max_needed <= available_blocks as if the request did not already own those restored prefix pages, so a long CPU-hit request near the KV budget can pin N pages, reduce available_blocks by N, fail admission forever, and never release the prefetch. Either admit with the prefetched pages credited to this request or avoid reserving them until after admission.

Useful? React with 👍 / 👎.

Drop-during-load corruption: `drop_request` released the `LoadReservation` even when the CPU→GPU load was still in flight. A request cancelled mid-load freed its destination blocks for immediate reuse while pegaflow's worker kept writing them — silent KV corruption, the load-side mirror of the SAVE keep-alive pin. It now blocks on the in-flight `LoadHandle` before the reservation drops; the scheduler is a dedicated sync thread, so the wait is free. Admission double-count: a settled prefetch's blocks are already out of `available_blocks`, but `admit_deferred_requests` still charged the full request footprint, so a near-budget CPU-hit request could fail admission forever and never release its prefetch. Admission now credits each request's held prefetch blocks (`prefetched_blocks` → `PrefixProbe::held_blocks`) against the free-pool budget, while keeping the full footprint for the per-request cap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…arking Wire `start_engine_with_offload` into the Qwen3 server path behind three flags: `--kv-offload` (enable the pegaflow host tier), `--kv-offload-host-gib` (pool size; pegaflow allocates it up front, so RSS reflects it at startup), and `--no-prefix-cache`. The last is the vLLM-style switch threaded through `start_qwen3` to `Qwen3Executor::set_no_prefix_cache`: without offload it disables prefix matching outright; with offload it is the pure-L2 mode (no cross-request HBM reuse, every prefix restored from the host tier) used to measure the L2 TTFT win. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

xiaguan · 2026-06-09T09:12:32Z

L2 (host-tier) TTFT benchmark + review fixes

Pure-L2 TTFT measurement

Added --kv-offload / --kv-offload-host-gib / --no-prefix-cache to the Qwen3 server (f5db7dd) and ran vllm bench serve twice against the same seeded prompt set, so pass 2 finds its prefix only on the host tier.

Setup: Qwen3-4B on RTX 5070 Ti (16 GB HBM), 30 GiB host pool, 64 × 2048-token prompts, --max-concurrency 1 (bs=1), --no-prefix-cache = pure-L2 mode (cross-request HBM/L1 reuse off, all reuse from L2).

metric	Cold (first sight)	Warm (pure-L2 restore)	Δ
Mean TTFT	195.2 ms	40.6 ms	−79% (4.8×)
Median TTFT	194.8 ms	38.1 ms	−80%
P99 TTFT	203.9 ms	54.5 ms	−73%
duration (64 req)	14.9 s	4.8 s	3.1×

Server logs confirm it is genuinely L2, not a residual L1 hit:

Prefetch local-hit: total_keys=127 hit=127 missing=0      (gpu_hit=0)
Load task completed: layers=36 blocks=127 bytes=299630592 (286 MiB)
                     elapsed_ms≈20  bandwidth≈15 GB/s  backend=direct

Every request restores the full 127-block prefix from the host tier (gpu_hit=0 — the evict-before-probe path is doing its job). Warm TTFT ≈ 286 MiB H2D (~20 ms) + the 16-token tail prefill. The ~15 GB/s is consumer-PCIe-bound; a server GPU would push the restore cost down further.

Memory: pegaflow cudaHostAllocs the whole pool at startup — server RSS measured 30.8 GB with --kv-offload-host-gib 30, confirming the pool is eager, not lazy.

Note on --no-prefix-cache semantics under offload: L1-resident and L2-stored blocks share one content hash, so the cache cannot be told to "prefer L2" for a block still in HBM. Pure-L2 therefore means not retaining completed blocks in HBM (evict before the prefetch probe → gpu_hit=0), which is exactly vLLM's --no-enable-prefix-caching behaviour (free, don't keep). Prefix matching stays on, since the L2 restore rides on it.

Review fixes (`b5ac797`)

Both bot findings were real and are fixed:

@gemini-code-assist (HIGH — in-flight DMA corruption): confirmed. drop_request released the LoadReservation while the H2D load could still be in flight, so a cancel-mid-load freed pages that pegaflow's worker kept writing → silent KV corruption. drop_request now blocks on the in-flight LoadHandle before the reservation drops — the load-side mirror of the SAVE keep-alive pin. The scheduler is a dedicated sync thread, so the wait is free.
@chatgpt-codex-connector (P1 — admission accounting): confirmed. A settled prefetch's blocks are already out of available_blocks, but admit_deferred_requests still charged the full request footprint, so a near-budget CPU-hit request could fail admission forever and never release its prefetch. Admission now credits each request's held prefetch blocks (prefetched_blocks → PrefixProbe::held_blocks) against the free-pool budget, while keeping the full footprint for the per-request cap.

Verified: kv_offload_cpu_hit integration gate passes unchanged (mean Δ 0.028 / 0.030 nat), 17 scheduler unit tests pass, and an adversarial re-review found no residual holes in either fix.

chatgpt-codex-connector · 2026-06-09T09:12:41Z

To use Codex here, create an environment for this repo.

xiaguan · 2026-06-09T09:28:07Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f5db7ddfa1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-09T09:31:34Z

+            if executor.begin_kv_prefetch(
+                req.request_id,
+                &req.prompt_tokens,
+                req.lora_adapter.as_deref(),
+            ) {


Skip KV prefetch for echo requests

When --kv-offload is enabled, this offers every deferred request to CPU-tier prefetch, including echo requests. Echo prefills later deliberately skip match_and_add_prefix (!req.echo) because prompt logprobs need the whole prompt forwarded, so a CPU-hit echo request can park and reserve restored blocks that admission credits via prefetched_blocks, then execute_prefill tries to schedule a full cold prefill while those pointless prefetch blocks are still held. Under tight KV budgets this makes repeated echo requests fail scheduling even though the same cold request would fit; avoid prefetching req.echo requests or release the prefetch before admission/scheduling.

Useful? React with 👍 / 👎.

Echo requests forward the whole prompt to recover prompt logprobs, so their prefill skips match_and_add_prefix and can never reuse a restored prefix. Offering them to async CPU-tier prefetch parked restored KV that admission credited via prefetched_blocks but prefill never spent, starving the request under tight KV budgets. Guard offer_prefetch with !req.echo so echo requests never probe. With no prefetch entry, prefetched_blocks(echo) is always 0, so admission charges the full footprint — consistent with the cold prefill echo actually runs. The echo => no-prefetch-state => zero-credit invariant is now structural rather than a runtime patch. Flagged by Codex review on #316. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

xiaguan · 2026-06-09T09:54:56Z

Fixed the P2 echo finding in f958c57.

@chatgpt-codex-connector (P2 — echo + prefetch): confirmed and fixed. Echo requests forward the whole prompt for prompt logprobs, so their prefill skips match_and_add_prefix and can never reuse a restored prefix. Offering them to prefetch parked restored KV that admission credited via prefetched_blocks but prefill never spent, starving the request under tight budgets. offer_prefetch now guards with !req.echo, so echo never probes — with no prefetch entry, prefetched_blocks(echo) is always 0 and admission charges the full footprint, matching the cold prefill echo actually runs. The echo ⇒ no prefetch state ⇒ zero credit invariant is now structural rather than a runtime patch. Added a unit test (echo_requests_are_never_offered_to_prefetch).

The two other latest inline comments (gemini @ executor.rs:1045 in-flight DMA; codex @ executor.rs:1110 admission double-count) are the original findings re-anchored to the new commit — both were already fixed in b5ac797.

Verified: 35 lib tests (incl. the new echo test) + kv_offload_cpu_hit integration gate pass; clippy clean on scheduler.rs; adversarial re-review found no residual holes.

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f958c57b6e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-09T10:01:46Z

        &device_ordinals,
        seed,
        lora_options.validate()?,
+        Qwen3OffloadOptions::disabled(),


Honor KV-offload flags in LoRA startup

When the server is started with --enable-lora --kv-offload (or --enable-lora --no-prefix-cache), it takes the LoRA-control startup path, which calls this wrapper; the wrapper now hard-codes Qwen3OffloadOptions::disabled() and has no way to propagate no_prefix_cache. That makes the newly added CLI flags silently no-op for LoRA-enabled Qwen3 even though the scheduler/executor path accepts offload options, so users benchmarking or serving LoRA with requested L2 offload still run with the normal GPU-only prefix cache.

Useful? React with 👍 / 👎.

start_engine_with_lora_control hard-coded Qwen3OffloadOptions::disabled() and had no no_prefix_cache parameter, so --enable-lora silently dropped --kv-offload / --kv-offload-host-gib / --no-prefix-cache — the flags no-op'd instead of erroring. Thread offload_options and no_prefix_cache through the LoRA-control path the same way the non-LoRA path does, and build the offload options once in main.rs for both branches. Offload under LoRA is correct by construction: the prefix block hash is salted with the adapter name (compute_salt_hash), so restored KV — HBM or host tier — never crosses adapters. Flagged by Codex review on #316. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

xiaguan · 2026-06-09T13:22:35Z

Fixed the LoRA-startup P2 in ff9fab3.

@chatgpt-codex-connector (P2 — LoRA startup ignored offload flags): confirmed and fixed. start_engine_with_lora_control hard-coded Qwen3OffloadOptions::disabled() and had no no_prefix_cache parameter, so --enable-lora silently dropped --kv-offload / --kv-offload-host-gib / --no-prefix-cache. Now offload_options + no_prefix_cache thread through the LoRA-control path exactly as the non-LoRA path does (offer/set_no_prefix_cache), and the offload options are built once in main.rs for both branches.

Offload under LoRA is correct by construction — the prefix block hash is salted with the adapter name (compute_salt_hash, pool.rs:90-102, guarded by the salt (lora) must scope the prefix cache test), so restored KV (HBM or host tier) never crosses adapters.

Verified: server + qwen3-4b build clean, lora_smoke compiles against the new signature, 35 lib tests pass.

The three older inline comments now re-anchored to ff9fab3 (gemini executor.rs:1045 DMA, codex executor.rs:1110 admission, codex scheduler.rs:186 echo) are all already fixed — in b5ac797, b5ac797, and f958c57 respectively.

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ff9fab39f2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-09T13:26:59Z

+        if self.l1_retention_disabled {
+            // Pure-L2 mode: drop any cross-request HBM retention so the probe
+            // sees gpu_hit == 0 and queries the whole cacheable prefix from the
+            // host tier. Only inactive (completed, unheld) blocks are drained —
+            // the current request holds nothing yet, and in-flight prefetches
+            // keep their reserved blocks, so this never touches live KV.
+            self.kv_mgr.pool().evict_inactive();


Enforce pure-L2 mode before matching save-pinned blocks

When --kv-offload --no-prefix-cache is used and the next identical request arrives while the previous request's fire-and-forget save is still in flight, save_sealed_blocks is holding KvBlockGuard pins for those HBM pages, so they are still active rather than inactive. This evict_inactive() call cannot remove them, and the following probe_prefix() can see a GPU hit (possibly leaving query_hashes empty), so the request reuses HBM instead of restoring from the host tier. That breaks the documented pure-L2/no-prefix-cache behavior for immediate TTFT-style repeats; the path needs to wait/flush saves or otherwise make save-pinned blocks unmatchable before probing.

Useful? React with 👍 / 👎.

Roadmap-doc drift fixes for already-landed work (#307 batched greedy sampling, #316 in-process pegaflow KV offload). Doc-only: offload integration doc CLI status + Kimi/Qwen first-launch order, execution.md data-plane entry, qwen3 roadmap rows, index.md.

xiaguan and others added 3 commits June 9, 2026 16:01

gemini-code-assist Bot reviewed Jun 9, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 9, 2026

View reviewed changes

xiaguan and others added 2 commits June 9, 2026 17:09

chatgpt-codex-connector Bot reviewed Jun 9, 2026

View reviewed changes

xiaguan merged commit cb49983 into main Jun 9, 2026
1 check passed

chatgpt-codex-connector Bot reviewed Jun 9, 2026

View reviewed changes

Conversation

xiaguan commented Jun 9, 2026

What

How

Correctness hardenings (post toxic-review)

Tests

Dependency

Scope

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Choose a reason for hiding this comment

⚠️ 潜在的在途 DMA 内存污染与静默数据损坏风险 (In-flight DMA Memory Corruption Risk)

建议解决方案

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

xiaguan commented Jun 9, 2026

L2 (host-tier) TTFT benchmark + review fixes

Pure-L2 TTFT measurement

Review fixes (b5ac797)

Uh oh!

chatgpt-codex-connector Bot commented Jun 9, 2026

Uh oh!

xiaguan commented Jun 9, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

xiaguan commented Jun 9, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

xiaguan commented Jun 9, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Review fixes (`b5ac797`)