fix: Retry logic for when Aurora is in paused state #858

axelpina · 2025-05-06T20:31:50Z

Issue #848

Description of changes:

This PR addresses issue [#848] by adding a two-stage retry mechanism whenever the Aurora Serverless v2 cluster is auto-paused (min ACU=0). Both the high-level tool wrapper (knowledge.py) and the low-level retrieval function (vector_search.py) now detect the “resuming after auto-pause” error message, pause for 15s on first failure (to allow the cluster to wake up), then for 45s on second failure, before finally surfacing any remaining exception.

The retries only kick in when we hit the Aurora “resuming after auto-pause” error. The code looks for the exact substring (case-insensitive) "resuming after auto-pause" in the exception message. The other back-end mechanisms remain unchanged.

Background

When using Aurora Serverless as the vector store backing a Bedrock Knowledge Base, the first retrieval after a pause fails with:

The Aurora DB instance arn:… is resuming after being auto-paused. Please wait a few seconds and try again

Furthermore, once that initial error occurs, subsequent tool invocations in the same conversation skip the KB entirely. Adding retries ensures that a paused cluster has time to resume and that embeddings are actually retrieved on the first user turn.

Changes

Import time
- Added to both knowledge.py and vector_search.py to support time.sleep() delays.
knowledge.py (search_knowledge)
- Wrapped the call to search_related_docs() in a for-loop over two delays: [15, 45].
- On catching any exception whose message contains "resuming after auto-pause", log a warning and sleep() before retrying.
- All other exceptions (or after both delays) are logged and re-raised with full traceback.
vector_search.py (_bedrock_knowledge_base_search)
- Wrapped the agent_client.retrieve(...) call in the same two-stage retry loop.
- Detects the same pause message (case-insensitive) in ClientError and re-issues the call after delays.
- Consolidated error logging: non-retryable errors now bubble up immediately; retryable ones are distinguished by warning logs.
Cleaned up unreachable code
- Removed the standalone except ClientError at the end of _bedrock_knowledge_base_search, since all error handling now occurs within the retry loop.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…ionException

backend/app/vector_search.py

backend/app/agents/tools/knowledge.py

statefb

Please fix the CI fails, thank you!

statefb · 2025-05-13T01:57:51Z

backend/app/agents/tools/knowledge.py

@@ -1,5 +1,7 @@
 import logging
 import traceback
+from retry import retry


Could you replace it with reretry as merged on #860?

Axel Pina and others added 4 commits April 29, 2025 11:34

fix: strip unsigned reasoning blocks for DeepSeek R1 to avoid Validat…

8672759

…ionException

fix: lint / formatting

e295fe6

Merge branch 'aws-samples:v3' into v3

d9753a1

fix: Retry logic for when Aurora is in paused state

919ef04

axelpina requested review from statefb, wadabee and Yukinobu-Mine as code owners May 6, 2025 20:31

statefb reviewed May 8, 2025

View reviewed changes

backend/app/vector_search.py Outdated Show resolved Hide resolved

backend/app/agents/tools/knowledge.py Show resolved Hide resolved

Axel Pina added 2 commits May 8, 2025 11:25

fix: reverted vector search changes

32bf293

fix: Addressing PR comments

a00bc0c

axelpina requested a review from statefb May 8, 2025 15:46

statefb reviewed May 13, 2025

View reviewed changes

Merge branch 'aws-samples:v3' into v3

c9d0875

axelpina closed this May 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Retry logic for when Aurora is in paused state #858

fix: Retry logic for when Aurora is in paused state #858

Uh oh!

axelpina commented May 6, 2025

Uh oh!

Uh oh!

Uh oh!

statefb left a comment

Uh oh!

statefb May 13, 2025

Uh oh!

Uh oh!

fix: Retry logic for when Aurora is in paused state #858

fix: Retry logic for when Aurora is in paused state #858

Uh oh!

Conversation

axelpina commented May 6, 2025

Background

Changes

Uh oh!

Uh oh!

Uh oh!

statefb left a comment

Choose a reason for hiding this comment

Uh oh!

statefb May 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!