release-26.2: cloud/amazon: retry s3 requests on credential-expiry errors#169775
Merged
trunk-io[bot] merged 1 commit intocockroachdb:release-26.2from May 6, 2026
Merged
Conversation
When the configured credentials provider issues short-lived tokens, a long-running operation that issues many concurrent multipart `UploadPart` calls can race the credential expiry: a request signed just before the credentials expire can arrive at S3 just after, and S3 returns `ExpiredToken` / `ExpiredTokenException` / `RequestExpired`. `aws-sdk-go` v1 swallowed this race by classifying these three codes as retryable in its default retryer (see [`credsExpiredCodes`][v1-codes] folded into [`isCodeRetryable`][v1-isretry], plus the matching post-attempt cache invalidation in [the after-retry handler][v1-handler] which calls `Credentials.Expire()` between attempts). `aws-sdk-go-v2`'s standard retryer ships [`DefaultRetryableErrorCodes`][v2-codes] containing only `RequestTimeout` / `RequestTimeoutException`, so what was a transparent retry in v1 became a hard, user-visible failure when CRDB migrated to v2. This commit restores the v1 behavior for the s3 client by wrapping the standard retryer with [`retry.AddWithErrorCodes`][v2-add] for the same three codes v1 used. The cache-invalidation half from v1 is not separately needed in practice: by the time the retry's backoff completes, local time has advanced past the cached credentials' \`Expires\`, so [the cache's `IsExpired` check on `Retrieve`][v2-cache] forces a fresh retrieve on the next signing pass on its own. The `aws_kms.go` retryer has the same gap and could be addressed in a follow-up; this commit keeps the s3 fix isolated for ease of backporting to release branches. [v1-codes]: https://github.com/aws/aws-sdk-go/blob/v1.40.37/aws/request/retryer.go#L98-L105 [v1-isretry]: https://github.com/aws/aws-sdk-go/blob/v1.40.37/aws/request/retryer.go#L112-L118 [v1-handler]: https://github.com/aws/aws-sdk-go/blob/v1.40.37/aws/corehandlers/handlers.go#L209-L214 [v2-codes]: https://github.com/aws/aws-sdk-go-v2/blob/v1.36.3/aws/retry/standard.go#L51-L65 [v2-add]: https://github.com/aws/aws-sdk-go-v2/blob/v1.36.3/aws/retry/retry.go#L10-L23 [v2-cache]: https://github.com/aws/aws-sdk-go-v2/blob/v1.36.3/aws/credential_cache.go#L98 Release note (bug fix): A long-running BACKUP to s3 using AUTH=implicit no longer fails with an ExpiredToken error when it races the rotation of the underlying short-lived credentials; the s3 client now retries ExpiredToken, ExpiredTokenException, and RequestExpired the same way the legacy aws-sdk-go v1 client did.
|
Thanks for opening a backport. Before merging, please confirm that the change does not break backwards compatibility and otherwise complies with the backport policy. Include a brief release justification in the PR description explaining why the backport is appropriate. All backports must be reviewed by the TL for the owning area. While the stricter LTS policy does not yet apply, please exercise judgment and consider gating non-critical changes behind a disabled-by-default feature flag when appropriate. |
Contributor
|
😎 Merged successfully - details. |
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Member
msbutler
approved these changes
May 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Backport 1/1 commits from #169763 on behalf of @dt.
When the configured credentials provider issues short-lived tokens, a long-running operation that issues many concurrent multipart
UploadPartcalls can race the credential expiry: a request signed just before the credentials expire can arrive at S3 just after, and S3 returnsExpiredToken/ExpiredTokenException/RequestExpired.aws-sdk-gov1 swallowed this race by classifying these three codes as retryable in its default retryer (seecredsExpiredCodesfolded intoisCodeRetryable, plus the matching post-attempt cache invalidation inthe after-retry handler which calls
Credentials.Expire()between attempts).aws-sdk-go-v2's standard retryer shipsDefaultRetryableErrorCodescontaining onlyRequestTimeout/RequestTimeoutException, so what was a transparent retry in v1 became a hard, user-visible failure when CRDB migrated to v2.This commit restores the v1 behavior for the s3 client by wrapping the standard retryer with
retry.AddWithErrorCodesfor the same three codes v1 used. The cache-invalidation half from v1 is not separately needed in practice: by the time the retry's backoff completes, local time has advanced past the cached credentials' `Expires`, so the cache'sIsExpiredcheck onRetrieveforces a fresh retrieve on the next signing pass on its own.The
aws_kms.goretryer has the same gap and could be addressed in a follow-up; this commit keeps the s3 fix isolated for ease of backporting to release branches.Release note (bug fix): A long-running BACKUP to s3 using AUTH=implicit no longer fails with an ExpiredToken error when it races the rotation of the underlying short-lived credentials; the s3 client now retries ExpiredToken, ExpiredTokenException, and RequestExpired the same way the legacy aws-sdk-go v1 client did.
Release justification: