Skip to content

Conversation

mhdawson
Copy link

@mhdawson mhdawson commented Sep 26, 2025

Refs: llamastack/llama-stack#3571

Llama stack unconditionally expects usage information when using Responses API and streaming when telemetry is enabled. For full details see llamastack/llama-stack#3571.

Debugging that issue revealed that LiteLLM does not honour a request for usage when streaming and using the vertex api. This PR adds that reporting using the same function as used elsewhere.

NOTE: Some of the changes were due to running make format. It seems like the files I updated did not previously meet the formatting requirements so that added changes beyond the lines I added/changed.

Title

Fix usage reporting with CustomWrapper

Relevant issues

Refs: llamastack/llama-stack#3571

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

  • I have Added testing in the tests/litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • I have added a screenshot of my new test passing locally
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible, it only solves 1 specific problem
image

Type

🐛 Bug Fix

Changes

Refs: llamastack/llama-stack#3571

Llama stack unconditionally expects usage information when
using Responses API and streaming when telemetry is enabled.
For full details see llamastack/llama-stack#3571.

Debugging that issue revealed that LiteLLM does not honour a
request for usage when streaming and using the vertex api. This
PR adds that reporting using the same function as used elsewhere.

Signed-off-by: Michael Dawson <[email protected]>
Copy link

vercel bot commented Sep 26, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
litellm Ready Ready Preview Comment Sep 26, 2025 9:06pm

@CLAassistant
Copy link

CLAassistant commented Sep 26, 2025

CLA assistant check
All committers have signed the CLA.

@mhdawson
Copy link
Author

mhdawson commented Sep 26, 2025

The linting failures don't seem related to any files that I changed and also seem to fail on prior PRs.

@mhdawson
Copy link
Author

mhdawson commented Sep 26, 2025

Looking through recent history I don't see very many PRs that have passed the Mock tests. That along with not seeing how the test that failed would be related to any of the changes in the PR make me think its not related to the PR.

and hasattr(self, "chunks")
):
# Calculate usage from accumulated chunks
usage = calculate_total_usage(chunks=self.chunks)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would do it every time the model response object is created.

i can see us doing a usage calculation for gemini on streaming already @mhdawson

usage = VertexGeminiConfig._calculate_usage(

is there a minimal script you can share for me to reproduce the issue? Curious what's happening

Copy link
Author

@mhdawson mhdawson Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krrishdholakia thanks for following up. My recreate unfortuantely is with llama stack and the responses API where the usage field was not being populated. Does the test that is being added in the PR potentially show the issue as I think it confirms usage is not populated when it is not requested and the custom wrapper is in use?

Copy link
Author

@mhdawson mhdawson Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I don't know the code base well, I asked Claude to explain the issue. This is what is said:

_calculate_usage was being called during streaming - specifically in ModelResponseIterator.chunk_parser() at vertex_and_google_ai_studio_gemini.py:2130, where it calculates
usage for each individual chunk.

The problem was that CustomStreamWrapper wasn't aggregating this usage information from the chunks.

Here's the flow:

  1. Per-chunk calculation (already happening): ModelResponseIterator.chunk_parser() calls _calculate_usage() on each streaming chunk and sets model_response.usage (line 2142)
  2. Missing aggregation (the bug): CustomStreamWrapper collects these chunks in self.chunks but wasn't extracting/aggregating the usage data when stream_options was enabled
  3. The fix:
    - Passes stream_options to CustomStreamWrapper so it knows to enable usage tracking (lines 1749, 1763 in vertex file)
    - Added code in CustomStreamWrapper.model_response_creator() (streaming_handler.py:665-672) that calls calculate_total_usage(chunks=self.chunks) to aggregate usage from all
    chunks

So _calculate_usage was running, but its results were being discarded. The fix enabled CustomStreamWrapper to collect and report the aggregated usage when send_stream_usage=True.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krrishdholakia not sure if there is anything I need to do so it gets untagged for waiting on a response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants