fix(task-store): retry on transient transport errors instead of dropping prompt#2090
Open
yashrajshuklaaa wants to merge 1 commit into
Open
fix(task-store): retry on transient transport errors instead of dropping prompt#2090yashrajshuklaaa wants to merge 1 commit into
yashrajshuklaaa wants to merge 1 commit into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds retry handling in the shared KAgentTaskStore HTTP layer to prevent BYO agents from silently dropping prompts when the agent→controller hop encounters transient httpx.TransportError conditions (e.g., stale keep-alive connections reset by the mesh).
Changes:
- Introduce
_request_with_retry()inKAgentTaskStoreto retry once onhttpx.TransportError. - Route
save/get/deletethrough the new retry helper and document the new error behavior. - Add logging for transport retry attempts.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ing prompt Fixes kagent-dev#2086 Signed-off-by: Yashraj Shukla <shuklayashraj68@gmail.com>
0175791 to
084d78b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When the agent - - > controller HTTP hop raises httpx.TransportError ( idle keep-alive connection reset by Istio/HBONE mesh , controller pod reschedule , etc ) the error previously propagated uncaught out of KAgentTaskStore.get/save silently dropping the user prompt with no error surfaced and no recovery short of a pod restart
Fix :
introduce _request_with_retry( ) in KAgentTaskStore that catches TransportError calls aclose( ) to flush the stale connection pool and retries once on a fresh connection. Non-transport HTTP errors (4xx/5xx) are re-raised immediately without retrying. If the transport error persists after all retries it is re-raised so the caller sees a real error rather than a silent drop
fix lives entirely in kagent-core/_task_store.py and covers all
four framework adapters (langgraph, adk, openai, crewai) automatically
since they all share KAgentTaskStore
Fixes #2086