Skip to content

fix: Retry logic for when Aurora is in paused state #858

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

axelpina
Copy link
Contributor

@axelpina axelpina commented May 6, 2025

Issue #848

Description of changes:

This PR addresses issue [#848] by adding a two-stage retry mechanism whenever the Aurora Serverless v2 cluster is auto-paused (min ACU=0). Both the high-level tool wrapper (knowledge.py) and the low-level retrieval function (vector_search.py) now detect the “resuming after auto-pause” error message, pause for 15s on first failure (to allow the cluster to wake up), then for 45s on second failure, before finally surfacing any remaining exception.

The retries only kick in when we hit the Aurora “resuming after auto-pause” error. The code looks for the exact substring (case-insensitive) "resuming after auto-pause" in the exception message. The other back-end mechanisms remain unchanged.

Background

When using Aurora Serverless as the vector store backing a Bedrock Knowledge Base, the first retrieval after a pause fails with:

The Aurora DB instance arn:… is resuming after being auto-paused. Please wait a few seconds and try again

Furthermore, once that initial error occurs, subsequent tool invocations in the same conversation skip the KB entirely. Adding retries ensures that a paused cluster has time to resume and that embeddings are actually retrieved on the first user turn.

Changes

  • Import time

    • Added to both knowledge.py and vector_search.py to support time.sleep() delays.
  • knowledge.py (search_knowledge)

    • Wrapped the call to search_related_docs() in a for-loop over two delays: [15, 45].
    • On catching any exception whose message contains "resuming after auto-pause", log a warning and sleep() before retrying.
    • All other exceptions (or after both delays) are logged and re-raised with full traceback.
  • vector_search.py (_bedrock_knowledge_base_search)

    • Wrapped the agent_client.retrieve(...) call in the same two-stage retry loop.
    • Detects the same pause message (case-insensitive) in ClientError and re-issues the call after delays.
    • Consolidated error logging: non-retryable errors now bubble up immediately; retryable ones are distinguished by warning logs.
  • Cleaned up unreachable code

    • Removed the standalone except ClientError at the end of _bedrock_knowledge_base_search, since all error handling now occurs within the retry loop.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@axelpina axelpina requested a review from statefb May 8, 2025 15:46
Copy link
Contributor

@statefb statefb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the CI fails, thank you!

@@ -1,5 +1,7 @@
import logging
import traceback
from retry import retry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you replace it with reretry as merged on #860?

@axelpina axelpina closed this May 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants