fix: Retry logic for when Aurora is in paused state #858
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue #848
Description of changes:
This PR addresses issue [#848] by adding a two-stage retry mechanism whenever the Aurora Serverless v2 cluster is auto-paused (min ACU=0). Both the high-level tool wrapper (
knowledge.py
) and the low-level retrieval function (vector_search.py
) now detect the “resuming after auto-pause” error message, pause for 15s on first failure (to allow the cluster to wake up), then for 45s on second failure, before finally surfacing any remaining exception.The retries only kick in when we hit the Aurora “resuming after auto-pause” error. The code looks for the exact substring (case-insensitive)
"resuming after auto-pause"
in the exception message. The other back-end mechanisms remain unchanged.Background
When using Aurora Serverless as the vector store backing a Bedrock Knowledge Base, the first retrieval after a pause fails with:
Furthermore, once that initial error occurs, subsequent tool invocations in the same conversation skip the KB entirely. Adding retries ensures that a paused cluster has time to resume and that embeddings are actually retrieved on the first user turn.
Changes
Import
time
knowledge.py
andvector_search.py
to supporttime.sleep()
delays.knowledge.py (search_knowledge)
search_related_docs()
in a for-loop over two delays:[15, 45]
."resuming after auto-pause"
, log a warning andsleep()
before retrying.vector_search.py
(_bedrock_knowledge_base_search
)agent_client.retrieve(...)
call in the same two-stage retry loop.ClientError
and re-issues the call after delays.Cleaned up unreachable code
ClientError
at the end of _bedrock_knowledge_base_search, since all error handling now occurs within the retry loop.By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.