Qwen3 prefix-cache usage in OpenAI responses to enhance observability by jackyYang6 · Pull Request #299 · openinfer-project/openinfer

jackyYang6 · 2026-06-07T07:22:23Z

Description

Fixes issue #246 by propagating Qwen3 prefix-cache hit information from the model scheduler to OpenAI-compatible API usage fields, so users and benchmark tooling can observe whether a request reused cached prompt tokens.

What changed

Extended TokenEvent::Scheduled with cached_tokens.
Preserved Qwen3 executor-computed PrefillRequestResult.cached_tokens through scheduler planning/effects.
Emitted Qwen3 Scheduled events before prompt echo, first token, or terminal finish events so frontend usage accounting has the cache metadata in time.
Updated non-Qwen3 scheduled senders to report cached_tokens: 0, preserving existing behavior for engines without prefix-cache accounting.
Updated the vLLM/OpenAI frontend to map engine-provided cache metadata into usage:
- usage.prompt_tokens_details.cached_tokens
- vLLM PrefillStats.num_cached_tokens
- vLLM PrefillStats.num_local_cached_tokens
- num_computed_tokens = prompt_tokens.saturating_sub(cached_tokens)
Added usage patching for OpenAI response bodies where the Rust vLLM server does not currently copy cached-token PrefillStats into the final OpenAI usage object.
Added Qwen3 engine-level prefix-cache debug stats for operators.

Why

Prefix caching is enabled by default for Qwen3-4B and can significantly reduce warm-request TTFT, but before this change the cache hit count was dropped at the scheduler/frontend boundary. API users, benchmarks, and operators could not tell whether a repeated prompt actually hit the prefix cache because usage always reported zero cached tokens and there was no engine-level cache hit/miss visibility.

This change makes per-request cache reuse visible in OpenAI-compatible usage while keeping cache accounting owned by the Qwen3 executor/scheduler rather than guessing in the frontend.

Usage patch design

The Rust vLLM frontend receives PrefillStats from TokenEvent::Scheduled, but OpenAI response usage did not reliably preserve the cached-token fields in the final response object. To avoid recomputing cache hits in the frontend, this PR adds a small response wrapper in pegainfer-vllm-frontend/src/patch_usage.rs.

The wrapper:

stores the engine-reported cached-token count by external request id when TokenEvent::Scheduled is processed;
wraps both /v1/completions and /v1/chat/completions;
patches only the returned usage.prompt_tokens_details.cached_tokens field;
handles both non-streaming responses and streaming final usage chunks when stream_options.include_usage=true;
removes the cached-token entry after patching to avoid stale request-id state;
defaults to 0 if an engine did not report cache metadata.

This keeps the frontend as a pass-through/patch layer: it does not infer prefix-cache behavior from prompt length, token ids, or repeated requests. The Qwen3 executor remains the source of truth for cached_tokens.

Engine-level cache info design

Issue #246 also asks for engine-level hit/miss counters logged or exposed for operators. This PR adds a minimal Qwen3-local debug log rather than a new metrics subsystem.

Qwen3 scheduler effects now maintain cumulative prefix-cache stats for the lifetime of the scheduler loop:

total_requests
hit_requests
miss_requests
hit_rate
total_prompt_tokens
total_cached_tokens
token_hit_rate

The stats are logged at debug! after each scheduled observation:

pegainfer_qwen3_4b::scheduler::effects: effects.rs:98 Qwen3 prefix cache stats: total_requests=8, hit_requests=6, miss_requests=2, hit_rate=0.7500, total_prompt_tokens=52184, total_cached_tokens=26080, token_hit_rate=0.4998

The log is debug-level by design: operators can enable it when investigating prefix-cache effectiveness, while default info logs remain quiet under normal serving load. No prompt text, token ids, generated text, or request payloads are logged.

Example startup filter:

RUST_LOG=debug  cargo run --release -- --model-path Qwen/Qwen3-4B --port 8000

Type of Change

Validation

GPU-machine validation performed:

cargo test --release -p pegainfer-qwen3-4b --lib scheduler
- 17 passed; 0 failed
Real /v1/completions cold/warm validation:
- cold: cached_tokens=0
- warm repeated prompt: cached_tokens=6752
Real /v1/chat/completions non-streaming cold/warm validation after chat route patch:
- cold: cached_tokens=0
- warm repeated prompt: cached_tokens=6512
Real streaming validation with stream_options: {"include_usage": true}:
- /v1/completions usage chunk includes prompt_tokens_details.cached_tokens
- /v1/chat/completions usage chunk includes prompt_tokens_details.cached_tokens

Local/sub-agent validation performed:

cargo test --release -p pegainfer-vllm-frontend --lib
cargo test --release -p pegainfer-sim --test frontend_e2e cached_tokens
cargo clippy --release -p pegainfer-vllm-frontend -p pegainfer-sim --all-targets -- -D warnings
cargo fmt --check
git diff --check

Suggested final checks before merge, if not already run on the final clean branch:

cargo test --release -p pegainfer-qwen3-4b --lib scheduler
cargo test --release -p pegainfer-vllm-frontend --lib
cargo test --release -p pegainfer-sim --test frontend_e2e cached_tokens
cargo clippy --release -p pegainfer-vllm-frontend -p pegainfer-sim --all-targets -- -D warnings

Checklist

API usage reports real Qwen3 cached-token counts instead of hard-coded zero.
TokenEvent::Scheduled carries cached_tokens.
Qwen3 scheduler preserves executor-computed cached-token counts through effects.
Qwen3 emits Scheduled before prompt echo, token, or finish events.
Non-Qwen3 engines preserve current behavior with cached_tokens: 0.
/v1/completions non-streaming and streaming usage paths are covered.
/v1/chat/completions non-streaming and streaming usage paths are covered.
Engine-level Qwen3 prefix-cache hit/miss counters are available through debug logs.
No frontend-side cache inference or recomputation was added.

gemini-code-assist

Code Review

This pull request introduces tracking and reporting of cached tokens across multiple engines (including DeepSeek, Kimi, Qwen, and Simulated engines) and integrates this with the vLLM frontend. Specifically, a new patch_usage module is added to the frontend to intercept completion endpoints and patch the OpenAI usage response with the engine-provided cached_tokens count. The review feedback highlights two critical memory leaks in this new module: one occurring on non-streaming error responses where failed requests are not cleaned up from the global tracking map, and another during streaming cancellations where early client disconnects bypass the cleanup logic after the generator loop.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-07T07:24:16Z

+    let status = response.status();
+    if !status.is_success() {
+        return response;
+    }
+    let (parts, body) = response.into_parts();
+    let bytes = match to_bytes(body, COMPLETION_USAGE_PATCH_BODY_LIMIT).await {
+        Ok(bytes) => bytes,
+        Err(error) => {
+            return (
+                StatusCode::INTERNAL_SERVER_ERROR,
+                Json(ErrorBody {
+                    error: format!("failed to read completion response body: {error}"),
+                }),
+            )
+                .into_response();
+        }
+    };


Memory Leak on Non-Streaming Error Responses

If the response status is not successful (e.g., 4xx or 5xx), the function currently returns early without removing the request_id from the global cached_tokens_by_request_id map. This causes a memory leak for every failed request.

We can fix this by reading the response body first, attempting to parse the id to clean up the map, and then returning early if the status is not successful.

let status = response.status(); let (parts, body) = response.into_parts(); let bytes = match to_bytes(body, COMPLETION_USAGE_PATCH_BODY_LIMIT).await { Ok(bytes) => bytes, Err(error) => { return ( StatusCode::INTERNAL_SERVER_ERROR, Json(ErrorBody { error: format!("failed to read completion response body: {error}"), }), ) .into_response(); } }; if let Ok(value) = serde_json::from_slice::<serde_json::Value>(&bytes) { if let Some(request_id) = value.get("id").and_then(serde_json::Value::as_str) { state.cached_tokens_by_request_id.write().await.remove(request_id); } } if !status.is_success() { return Response::from_parts(parts, Body::from(bytes)); }

gemini-code-assist · 2026-06-07T07:24:16Z

+    let status = response.status();
+    if !status.is_success() {
+        return response;
+    }
+    let cached_tokens_by_request_id = state.cached_tokens_by_request_id;
+    let (mut parts, body) = response.into_parts();
+    parts.headers.remove(CONTENT_LENGTH);
+    let stream = body.into_data_stream();
+    let body = Body::from_stream(async_stream::stream! {
+        let mut stream = stream;
+        let mut request_id_for_cleanup = None;
+        while let Some(next) = stream.next().await {
+            let Ok(bytes) = next else {
+                continue;
+            };
+            yield Ok::<Bytes, std::convert::Infallible>(patch_sse_usage_chunk(
+                bytes,
+                &cached_tokens_by_request_id,
+                include_usage,
+                &mut request_id_for_cleanup,
+            ).await);
+        }
+        if let Some(request_id) = request_id_for_cleanup {
+            cached_tokens_by_request_id.write().await.remove(&request_id);
+        }
+    });


Memory Leak on Streaming Cancellation / Client Disconnect

In patch_streaming_completion_usage, if the client disconnects mid-stream, the stream is dropped. In Rust, dropping an async_stream::stream! generator drops its local variables but does not resume execution to run any code after the loop (lines 180-182). Thus, the cleanup .remove(&request_id) is bypassed, causing a memory leak.

We can fix this robustly by implementing a Drop guard. Since local variables are guaranteed to be dropped when the generator is dropped, the Drop implementation of the guard will always run even on early stream cancellation.

let status = response.status(); if !status.is_success() { return response; } let cached_tokens_by_request_id = state.cached_tokens_by_request_id; let (mut parts, body) = response.into_parts(); parts.headers.remove(CONTENT_LENGTH); let stream = body.into_data_stream(); struct StreamCleanupGuard { request_id: Option<String>, map: CachedTokenUsageMap, } impl Drop for StreamCleanupGuard { fn drop(&mut self) { if let Some(ref id) = self.request_id { let map = self.map.clone(); let id = id.clone(); tokio::spawn(async move { map.write().await.remove(&id); }); } } } let body = Body::from_stream(async_stream::stream! { let mut stream = stream; let mut guard = StreamCleanupGuard { request_id: None, map: cached_tokens_by_request_id.clone(), }; while let Some(next) = stream.next().await { let Ok(bytes) = next else { continue; }; yield Ok::<Bytes, std::convert::Infallible>(patch_sse_usage_chunk( bytes, &guard.map, include_usage, &mut guard.request_id, ).await); } });

xiaguan · 2026-06-07T14:54:13Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9d8487980c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-07T14:57:31Z

+        if request_id_for_cleanup.is_none() {
+            if let Some(request_id) = value.get("id").and_then(serde_json::Value::as_str) {
+                *request_id_for_cleanup = Some(request_id.to_string());


Update cleanup id after seeing stream id

For streaming requests that include X-Request-Id or a body request_id, request_id_for_cleanup starts as the fallback chatcmpl-<client-id>, but vLLM still emits and the bridge stores cached-token usage under the actual generated response id. Because this block only records the stream's id when the cleanup id is None, a client disconnect before the usage chunk (or a stream without included usage) drops the cleanup guard while still removing only the fallback key, leaving the actual cached-token entry in CACHED_TOKENS_BY_REQUEST_ID indefinitely for those requests.

Useful? React with 👍 / 👎.

xiaguan · 2026-06-08T03:20:40Z

Heads up — the root cause is upstream: the vllm rust frontend never fills usage.prompt_tokens_details.cached_tokens (Usage::from_counts hardcodes it to None). Filed as vllm-project/vllm#44824 — the cleanest path is to land that upstream and bump our pinned rev, then drop the response-patch layer here.

jackyYang6 · 2026-06-08T06:49:31Z

Totally agreed. I initially resorted to this response-patch layer precisely because of this upstream limitation, and as a result, the patching code has grown increasingly bloated and ugly.

Once the upstream fix lands and we bump our pinned rev, I'll gladly drop the patch layer, rewrite it with clean code, and resubmit.

gemini-code-assist Bot reviewed Jun 7, 2026

View reviewed changes

fix(qwen3): propagate cached token usage

9d84879

jackyYang6 force-pushed the fix/issue-246-cached-token-usage-clean branch from c968f1f to 9d84879 Compare June 7, 2026 10:30

chatgpt-codex-connector Bot reviewed Jun 7, 2026

View reviewed changes

xiaguan mentioned this pull request Jun 9, 2026

chore: rename pegainfer → openinfer across the workspace #318

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3 prefix-cache usage in OpenAI responses to enhance observability#299

Qwen3 prefix-cache usage in OpenAI responses to enhance observability#299
jackyYang6 wants to merge 1 commit into
openinfer-project:mainfrom
jackyYang6:fix/issue-246-cached-token-usage-clean

jackyYang6 commented Jun 7, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 7, 2026

Uh oh!

gemini-code-assist Bot Jun 7, 2026

Uh oh!

xiaguan commented Jun 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 7, 2026

Uh oh!

xiaguan commented Jun 8, 2026

Uh oh!

jackyYang6 commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jackyYang6 commented Jun 7, 2026

Description

What changed

Why

Usage patch design

Engine-level cache info design

Type of Change

Validation

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 7, 2026

Choose a reason for hiding this comment

Memory Leak on Non-Streaming Error Responses

Uh oh!

gemini-code-assist Bot Jun 7, 2026

Choose a reason for hiding this comment

Memory Leak on Streaming Cancellation / Client Disconnect

Uh oh!

xiaguan commented Jun 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

xiaguan commented Jun 8, 2026

Uh oh!

jackyYang6 commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants