-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
fix: fix usage reporting with CustomWrapper #14961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Refs: llamastack/llama-stack#3571 Llama stack unconditionally expects usage information when using Responses API and streaming when telemetry is enabled. For full details see llamastack/llama-stack#3571. Debugging that issue revealed that LiteLLM does not honour a request for usage when streaming and using the vertex api. This PR adds that reporting using the same function as used elsewhere. Signed-off-by: Michael Dawson <[email protected]>
The latest updates on your projects. Learn more about Vercel for GitHub.
|
The linting failures don't seem related to any files that I changed and also seem to fail on prior PRs. |
Looking through recent history I don't see very many PRs that have passed the Mock tests. That along with not seeing how the test that failed would be related to any of the changes in the PR make me think its not related to the PR. |
and hasattr(self, "chunks") | ||
): | ||
# Calculate usage from accumulated chunks | ||
usage = calculate_total_usage(chunks=self.chunks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would do it every time the model response object is created.
i can see us doing a usage calculation for gemini on streaming already @mhdawson
usage = VertexGeminiConfig._calculate_usage( |
is there a minimal script you can share for me to reproduce the issue? Curious what's happening
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@krrishdholakia thanks for following up. My recreate unfortuantely is with llama stack and the responses API where the usage field was not being populated. Does the test that is being added in the PR potentially show the issue as I think it confirms usage is not populated when it is not requested and the custom wrapper is in use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I don't know the code base well, I asked Claude to explain the issue. This is what is said:
_calculate_usage was being called during streaming - specifically in ModelResponseIterator.chunk_parser() at vertex_and_google_ai_studio_gemini.py:2130, where it calculates
usage for each individual chunk.
The problem was that CustomStreamWrapper wasn't aggregating this usage information from the chunks.
Here's the flow:
- Per-chunk calculation (already happening): ModelResponseIterator.chunk_parser() calls _calculate_usage() on each streaming chunk and sets model_response.usage (line 2142)
- Missing aggregation (the bug): CustomStreamWrapper collects these chunks in self.chunks but wasn't extracting/aggregating the usage data when stream_options was enabled
- The fix:
- Passes stream_options to CustomStreamWrapper so it knows to enable usage tracking (lines 1749, 1763 in vertex file)
- Added code in CustomStreamWrapper.model_response_creator() (streaming_handler.py:665-672) that calls calculate_total_usage(chunks=self.chunks) to aggregate usage from all
chunks
So _calculate_usage was running, but its results were being discarded. The fix enabled CustomStreamWrapper to collect and report the aggregated usage when send_stream_usage=True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@krrishdholakia not sure if there is anything I need to do so it gets untagged for waiting on a response.
Refs: llamastack/llama-stack#3571
Llama stack unconditionally expects usage information when using Responses API and streaming when telemetry is enabled. For full details see llamastack/llama-stack#3571.
Debugging that issue revealed that LiteLLM does not honour a request for usage when streaming and using the vertex api. This PR adds that reporting using the same function as used elsewhere.
NOTE: Some of the changes were due to running make format. It seems like the files I updated did not previously meet the formatting requirements so that added changes beyond the lines I added/changed.
Title
Fix usage reporting with CustomWrapper
Relevant issues
Refs: llamastack/llama-stack#3571
Pre-Submission checklist
Please complete all items before asking a LiteLLM maintainer to review your PR
tests/litellm/
directory, Adding at least 1 test is a hard requirement - see detailsmake test-unit
Type
🐛 Bug Fix
Changes