Skip to content

Qwen3 prefix-cache usage in OpenAI responses to enhance observability#299

Open
jackyYang6 wants to merge 1 commit into
openinfer-project:mainfrom
jackyYang6:fix/issue-246-cached-token-usage-clean
Open

Qwen3 prefix-cache usage in OpenAI responses to enhance observability#299
jackyYang6 wants to merge 1 commit into
openinfer-project:mainfrom
jackyYang6:fix/issue-246-cached-token-usage-clean

Conversation

@jackyYang6

Copy link
Copy Markdown
Contributor

Description

Fixes issue #246 by propagating Qwen3 prefix-cache hit information from the model scheduler to OpenAI-compatible API usage fields, so users and benchmark tooling can observe whether a request reused cached prompt tokens.

What changed

  • Extended TokenEvent::Scheduled with cached_tokens.
  • Preserved Qwen3 executor-computed PrefillRequestResult.cached_tokens through scheduler planning/effects.
  • Emitted Qwen3 Scheduled events before prompt echo, first token, or terminal finish events so frontend usage accounting has the cache metadata in time.
  • Updated non-Qwen3 scheduled senders to report cached_tokens: 0, preserving existing behavior for engines without prefix-cache accounting.
  • Updated the vLLM/OpenAI frontend to map engine-provided cache metadata into usage:
    • usage.prompt_tokens_details.cached_tokens
    • vLLM PrefillStats.num_cached_tokens
    • vLLM PrefillStats.num_local_cached_tokens
    • num_computed_tokens = prompt_tokens.saturating_sub(cached_tokens)
  • Added usage patching for OpenAI response bodies where the Rust vLLM server does not currently copy cached-token PrefillStats into the final OpenAI usage object.
  • Added Qwen3 engine-level prefix-cache debug stats for operators.

Why

Prefix caching is enabled by default for Qwen3-4B and can significantly reduce warm-request TTFT, but before this change the cache hit count was dropped at the scheduler/frontend boundary. API users, benchmarks, and operators could not tell whether a repeated prompt actually hit the prefix cache because usage always reported zero cached tokens and there was no engine-level cache hit/miss visibility.

This change makes per-request cache reuse visible in OpenAI-compatible usage while keeping cache accounting owned by the Qwen3 executor/scheduler rather than guessing in the frontend.

Usage patch design

The Rust vLLM frontend receives PrefillStats from TokenEvent::Scheduled, but OpenAI response usage did not reliably preserve the cached-token fields in the final response object. To avoid recomputing cache hits in the frontend, this PR adds a small response wrapper in pegainfer-vllm-frontend/src/patch_usage.rs.

The wrapper:

  • stores the engine-reported cached-token count by external request id when TokenEvent::Scheduled is processed;
  • wraps both /v1/completions and /v1/chat/completions;
  • patches only the returned usage.prompt_tokens_details.cached_tokens field;
  • handles both non-streaming responses and streaming final usage chunks when stream_options.include_usage=true;
  • removes the cached-token entry after patching to avoid stale request-id state;
  • defaults to 0 if an engine did not report cache metadata.

This keeps the frontend as a pass-through/patch layer: it does not infer prefix-cache behavior from prompt length, token ids, or repeated requests. The Qwen3 executor remains the source of truth for cached_tokens.

Engine-level cache info design

Issue #246 also asks for engine-level hit/miss counters logged or exposed for operators. This PR adds a minimal Qwen3-local debug log rather than a new metrics subsystem.

Qwen3 scheduler effects now maintain cumulative prefix-cache stats for the lifetime of the scheduler loop:

  • total_requests
  • hit_requests
  • miss_requests
  • hit_rate
  • total_prompt_tokens
  • total_cached_tokens
  • token_hit_rate

The stats are logged at debug! after each scheduled observation:

pegainfer_qwen3_4b::scheduler::effects: effects.rs:98 Qwen3 prefix cache stats: total_requests=8, hit_requests=6, miss_requests=2, hit_rate=0.7500, total_prompt_tokens=52184, total_cached_tokens=26080, token_hit_rate=0.4998

The log is debug-level by design: operators can enable it when investigating prefix-cache effectiveness, while default info logs remain quiet under normal serving load. No prompt text, token ids, generated text, or request payloads are logged.

Example startup filter:

RUST_LOG=debug  cargo run --release -- --model-path Qwen/Qwen3-4B --port 8000

Type of Change

  • Bug fix
  • Tests
  • New feature
  • Refactor
  • Documentation
  • Performance change

Validation

GPU-machine validation performed:

  • cargo test --release -p pegainfer-qwen3-4b --lib scheduler
    • 17 passed; 0 failed
  • Real /v1/completions cold/warm validation:
    • cold: cached_tokens=0
    • warm repeated prompt: cached_tokens=6752
  • Real /v1/chat/completions non-streaming cold/warm validation after chat route patch:
    • cold: cached_tokens=0
    • warm repeated prompt: cached_tokens=6512
  • Real streaming validation with stream_options: {"include_usage": true}:
    • /v1/completions usage chunk includes prompt_tokens_details.cached_tokens
    • /v1/chat/completions usage chunk includes prompt_tokens_details.cached_tokens

Local/sub-agent validation performed:

  • cargo test --release -p pegainfer-vllm-frontend --lib
  • cargo test --release -p pegainfer-sim --test frontend_e2e cached_tokens
  • cargo clippy --release -p pegainfer-vllm-frontend -p pegainfer-sim --all-targets -- -D warnings
  • cargo fmt --check
  • git diff --check

Suggested final checks before merge, if not already run on the final clean branch:

cargo test --release -p pegainfer-qwen3-4b --lib scheduler
cargo test --release -p pegainfer-vllm-frontend --lib
cargo test --release -p pegainfer-sim --test frontend_e2e cached_tokens
cargo clippy --release -p pegainfer-vllm-frontend -p pegainfer-sim --all-targets -- -D warnings

Checklist

  • API usage reports real Qwen3 cached-token counts instead of hard-coded zero.
  • TokenEvent::Scheduled carries cached_tokens.
  • Qwen3 scheduler preserves executor-computed cached-token counts through effects.
  • Qwen3 emits Scheduled before prompt echo, token, or finish events.
  • Non-Qwen3 engines preserve current behavior with cached_tokens: 0.
  • /v1/completions non-streaming and streaming usage paths are covered.
  • /v1/chat/completions non-streaming and streaming usage paths are covered.
  • Engine-level Qwen3 prefix-cache hit/miss counters are available through debug logs.
  • No frontend-side cache inference or recomputation was added.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces tracking and reporting of cached tokens across multiple engines (including DeepSeek, Kimi, Qwen, and Simulated engines) and integrates this with the vLLM frontend. Specifically, a new patch_usage module is added to the frontend to intercept completion endpoints and patch the OpenAI usage response with the engine-provided cached_tokens count. The review feedback highlights two critical memory leaks in this new module: one occurring on non-streaming error responses where failed requests are not cleaned up from the global tracking map, and another during streaming cancellations where early client disconnects bypass the cleanup logic after the generator loop.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +117 to +133
let status = response.status();
if !status.is_success() {
return response;
}
let (parts, body) = response.into_parts();
let bytes = match to_bytes(body, COMPLETION_USAGE_PATCH_BODY_LIMIT).await {
Ok(bytes) => bytes,
Err(error) => {
return (
StatusCode::INTERNAL_SERVER_ERROR,
Json(ErrorBody {
error: format!("failed to read completion response body: {error}"),
}),
)
.into_response();
}
};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Memory Leak on Non-Streaming Error Responses

If the response status is not successful (e.g., 4xx or 5xx), the function currently returns early without removing the request_id from the global cached_tokens_by_request_id map. This causes a memory leak for every failed request.

We can fix this by reading the response body first, attempting to parse the id to clean up the map, and then returning early if the status is not successful.

    let status = response.status();
    let (parts, body) = response.into_parts();
    let bytes = match to_bytes(body, COMPLETION_USAGE_PATCH_BODY_LIMIT).await {
        Ok(bytes) => bytes,
        Err(error) => {
            return (
                StatusCode::INTERNAL_SERVER_ERROR,
                Json(ErrorBody {
                    error: format!("failed to read completion response body: {error}"),
                }),
            )
                .into_response();
        }
    };

    if let Ok(value) = serde_json::from_slice::<serde_json::Value>(&bytes) {
        if let Some(request_id) = value.get("id").and_then(serde_json::Value::as_str) {
            state.cached_tokens_by_request_id.write().await.remove(request_id);
        }
    }

    if !status.is_success() {
        return Response::from_parts(parts, Body::from(bytes));
    }

Comment on lines +158 to +183
let status = response.status();
if !status.is_success() {
return response;
}
let cached_tokens_by_request_id = state.cached_tokens_by_request_id;
let (mut parts, body) = response.into_parts();
parts.headers.remove(CONTENT_LENGTH);
let stream = body.into_data_stream();
let body = Body::from_stream(async_stream::stream! {
let mut stream = stream;
let mut request_id_for_cleanup = None;
while let Some(next) = stream.next().await {
let Ok(bytes) = next else {
continue;
};
yield Ok::<Bytes, std::convert::Infallible>(patch_sse_usage_chunk(
bytes,
&cached_tokens_by_request_id,
include_usage,
&mut request_id_for_cleanup,
).await);
}
if let Some(request_id) = request_id_for_cleanup {
cached_tokens_by_request_id.write().await.remove(&request_id);
}
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Memory Leak on Streaming Cancellation / Client Disconnect

In patch_streaming_completion_usage, if the client disconnects mid-stream, the stream is dropped. In Rust, dropping an async_stream::stream! generator drops its local variables but does not resume execution to run any code after the loop (lines 180-182). Thus, the cleanup .remove(&request_id) is bypassed, causing a memory leak.

We can fix this robustly by implementing a Drop guard. Since local variables are guaranteed to be dropped when the generator is dropped, the Drop implementation of the guard will always run even on early stream cancellation.

    let status = response.status();
    if !status.is_success() {
        return response;
    }
    let cached_tokens_by_request_id = state.cached_tokens_by_request_id;
    let (mut parts, body) = response.into_parts();
    parts.headers.remove(CONTENT_LENGTH);
    let stream = body.into_data_stream();

    struct StreamCleanupGuard {
        request_id: Option<String>,
        map: CachedTokenUsageMap,
    }
    impl Drop for StreamCleanupGuard {
        fn drop(&mut self) {
            if let Some(ref id) = self.request_id {
                let map = self.map.clone();
                let id = id.clone();
                tokio::spawn(async move {
                    map.write().await.remove(&id);
                });
            }
        }
    }

    let body = Body::from_stream(async_stream::stream! {
        let mut stream = stream;
        let mut guard = StreamCleanupGuard {
            request_id: None,
            map: cached_tokens_by_request_id.clone(),
        };
        while let Some(next) = stream.next().await {
            let Ok(bytes) = next else {
                continue;
            };
            yield Ok::<Bytes, std::convert::Infallible>(patch_sse_usage_chunk(
                bytes,
                &guard.map,
                include_usage,
                &mut guard.request_id,
            ).await);
        }
    });

@jackyYang6 jackyYang6 force-pushed the fix/issue-246-cached-token-usage-clean branch from c968f1f to 9d84879 Compare June 7, 2026 10:30
@xiaguan

xiaguan commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9d8487980c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +354 to +356
if request_id_for_cleanup.is_none() {
if let Some(request_id) = value.get("id").and_then(serde_json::Value::as_str) {
*request_id_for_cleanup = Some(request_id.to_string());

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Update cleanup id after seeing stream id

For streaming requests that include X-Request-Id or a body request_id, request_id_for_cleanup starts as the fallback chatcmpl-<client-id>, but vLLM still emits and the bridge stores cached-token usage under the actual generated response id. Because this block only records the stream's id when the cleanup id is None, a client disconnect before the usage chunk (or a stream without included usage) drops the cleanup guard while still removing only the fallback key, leaving the actual cached-token entry in CACHED_TOKENS_BY_REQUEST_ID indefinitely for those requests.

Useful? React with 👍 / 👎.

@xiaguan

xiaguan commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Heads up — the root cause is upstream: the vllm rust frontend never fills usage.prompt_tokens_details.cached_tokens (Usage::from_counts hardcodes it to None). Filed as vllm-project/vllm#44824 — the cleanest path is to land that upstream and bump our pinned rev, then drop the response-patch layer here.

@jackyYang6

Copy link
Copy Markdown
Contributor Author

Totally agreed. I initially resorted to this response-patch layer precisely because of this upstream limitation, and as a result, the patching code has grown increasingly bloated and ugly.

Once the upstream fix lands and we bump our pinned rev, I'll gladly drop the patch layer, rewrite it with clean code, and resubmit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants