-
Notifications
You must be signed in to change notification settings - Fork 563
fix(llm): Add async streaming support to ChatNVIDIA provider patch #1504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
f06d0d5 to
b7e314b
Compare
Enables stream_async() to work with ChatNVIDIA/NIM models by implementing async streaming decorator and _agenerate method. Prior to this fix, stream_async() would fail with NIM engine configurations.
b7e314b to
19ddac4
Compare
tgasser-nv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this. I'm really surprised streaming with the nim engine type has been broken the whole time (unless I misunderstood?). here are a couple of followup actions:
- Before merging can you run this yourself locally (using curl or similar) to make sure the chunks stream correctly after the fix?
- How was this not caught in QA in a streaming test up until now? We need a test in QA that fails prior to this fix, and passes after the fix. That will prevent regressions.
- What is the guidance on
LIVE_TEST_MODE? Should we run this as part of a test plan for each PR? I usually just runpytest tests -qbut don't enable LIVE_TEST_MODE
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Greptile Summary
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant LLMRails
participant ChatNVIDIA
participant async_stream_decorator
participant _astream
participant agenerate_from_stream
participant NIM
User->>LLMRails: "stream_async(messages)"
LLMRails->>ChatNVIDIA: "_agenerate(messages, streaming=True)"
ChatNVIDIA->>async_stream_decorator: "wrapper called"
async_stream_decorator->>async_stream_decorator: "check should_stream=True"
async_stream_decorator->>_astream: "self._astream(messages)"
_astream->>NIM: "send request"
loop For each token
NIM-->>_astream: "yield chunk"
_astream-->>async_stream_decorator: "yield chunk"
end
async_stream_decorator->>agenerate_from_stream: "await agenerate_from_stream(stream_iter)"
agenerate_from_stream->>agenerate_from_stream: "aggregate chunks"
agenerate_from_stream-->>async_stream_decorator: "ChatResult"
async_stream_decorator-->>ChatNVIDIA: "ChatResult"
ChatNVIDIA-->>LLMRails: "ChatResult"
LLMRails-->>User: "streamed response"
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, no comments
Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format
8a2507b to
71f6e7c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 1 comment
Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format
| async def _agenerate( | ||
| self, | ||
| messages: List[BaseMessage], | ||
| stop: Optional[List[str]] = None, | ||
| run_manager: Optional[AsyncCallbackManagerForLLMRun] = None, | ||
| **kwargs: Any, | ||
| ) -> ChatResult: | ||
| return await super()._agenerate( | ||
| messages=messages, stop=stop, run_manager=run_manager, **kwargs | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Async method _agenerate missing callbacks parameter that was added to sync _generate (line 137). This inconsistency may cause issues if callers expect uniform API.
| async def _agenerate( | |
| self, | |
| messages: List[BaseMessage], | |
| stop: Optional[List[str]] = None, | |
| run_manager: Optional[AsyncCallbackManagerForLLMRun] = None, | |
| **kwargs: Any, | |
| ) -> ChatResult: | |
| return await super()._agenerate( | |
| messages=messages, stop=stop, run_manager=run_manager, **kwargs | |
| ) | |
| @async_stream_decorator | |
| async def _agenerate( | |
| self, | |
| messages: List[BaseMessage], | |
| stop: Optional[List[str]] = None, | |
| run_manager: Optional[AsyncCallbackManagerForLLMRun] = None, | |
| callbacks: Callbacks = None, | |
| **kwargs: Any, | |
| ) -> ChatResult: | |
| return await super()._agenerate( | |
| messages=messages, stop=stop, run_manager=run_manager, callbacks=callbacks, **kwargs | |
| ) |
Description
Adds async streaming support to the patched ChatNVIDIA provider, enabling
stream_async()to properly stream with NIM models. Prior to this fix,stream_async()with the NIM engine would block for the complete generation and yield all content at once, rather than streaming incrementally.Problem
Before this fix, async streaming tests would fail because chunks weren't streamed incrementally:
for example:
Tested it on all the supported versions:
Version TTFT (s) Total Time (s) ------------------------------------------ 0.3.1 0.473 2.875 0.3.2 0.498 2.797 0.3.3 0.476 2.745 0.3.4 0.450 2.507 0.3.5 0.489 2.760 0.3.6 0.461 2.787 0.3.7 0.444 2.886 0.3.8 0.432 2.831 0.3.9 0.479 2.794 0.3.10 0.440 2.426 0.3.11 0.433 2.909 0.3.12 0.471 2.692 0.3.13 0.455 2.603 0.3.14 0.474 2.825 0.3.15 0.444 2.488 0.3.16 0.473 2.645 0.3.17 0.432 2.857 0.3.18 0.473 3.438 0.3.19 0.458 2.824prior to the fix
in 0.3.19 async api support is added to ChatNVIDIA, that's why we are getting a different error. This is fixed too.
ensure you have
langchain-nvidia-ai-endpointsinstalled and NVIDIA_API_KEY set.