Skip to content

release-26.2: cloud/amazon: retry s3 requests on credential-expiry errors#169775

Merged
trunk-io[bot] merged 1 commit intocockroachdb:release-26.2from
dt:blathers/backport-release-26.2-169763
May 6, 2026
Merged

release-26.2: cloud/amazon: retry s3 requests on credential-expiry errors#169775
trunk-io[bot] merged 1 commit intocockroachdb:release-26.2from
dt:blathers/backport-release-26.2-169763

Conversation

@dt
Copy link
Copy Markdown
Contributor

@dt dt commented May 5, 2026

Backport 1/1 commits from #169763 on behalf of @dt.


When the configured credentials provider issues short-lived tokens, a long-running operation that issues many concurrent multipart UploadPart calls can race the credential expiry: a request signed just before the credentials expire can arrive at S3 just after, and S3 returns ExpiredToken / ExpiredTokenException / RequestExpired.

aws-sdk-go v1 swallowed this race by classifying these three codes as retryable in its default retryer (see credsExpiredCodes folded into isCodeRetryable, plus the matching post-attempt cache invalidation in
the after-retry handler which calls Credentials.Expire() between attempts). aws-sdk-go-v2's standard retryer ships DefaultRetryableErrorCodes containing only RequestTimeout / RequestTimeoutException, so what was a transparent retry in v1 became a hard, user-visible failure when CRDB migrated to v2.

This commit restores the v1 behavior for the s3 client by wrapping the standard retryer with retry.AddWithErrorCodes for the same three codes v1 used. The cache-invalidation half from v1 is not separately needed in practice: by the time the retry's backoff completes, local time has advanced past the cached credentials' `Expires`, so the cache's IsExpired check on Retrieve forces a fresh retrieve on the next signing pass on its own.

The aws_kms.go retryer has the same gap and could be addressed in a follow-up; this commit keeps the s3 fix isolated for ease of backporting to release branches.

Release note (bug fix): A long-running BACKUP to s3 using AUTH=implicit no longer fails with an ExpiredToken error when it races the rotation of the underlying short-lived credentials; the s3 client now retries ExpiredToken, ExpiredTokenException, and RequestExpired the same way the legacy aws-sdk-go v1 client did.


Release justification:

When the configured credentials provider issues short-lived tokens, a
long-running operation that issues many concurrent multipart
`UploadPart` calls can race the credential expiry: a request signed
just before the credentials expire can arrive at S3 just after, and S3
returns `ExpiredToken` / `ExpiredTokenException` / `RequestExpired`.

`aws-sdk-go` v1 swallowed this race by classifying these three codes
as retryable in its default retryer (see [`credsExpiredCodes`][v1-codes]
folded into [`isCodeRetryable`][v1-isretry], plus the matching
post-attempt cache invalidation in
[the after-retry handler][v1-handler] which calls `Credentials.Expire()`
between attempts). `aws-sdk-go-v2`'s standard retryer ships
[`DefaultRetryableErrorCodes`][v2-codes] containing only
`RequestTimeout` / `RequestTimeoutException`, so what was a transparent
retry in v1 became a hard, user-visible failure when CRDB migrated to
v2.

This commit restores the v1 behavior for the s3 client by wrapping the
standard retryer with [`retry.AddWithErrorCodes`][v2-add] for the same
three codes v1 used. The cache-invalidation half from v1 is not
separately needed in practice: by the time the retry's backoff
completes, local time has advanced past the cached credentials'
\`Expires\`, so [the cache's `IsExpired` check on `Retrieve`][v2-cache]
forces a fresh retrieve on the next signing pass on its own.

The `aws_kms.go` retryer has the same gap and could be addressed in a
follow-up; this commit keeps the s3 fix isolated for ease of
backporting to release branches.

[v1-codes]: https://github.com/aws/aws-sdk-go/blob/v1.40.37/aws/request/retryer.go#L98-L105
[v1-isretry]: https://github.com/aws/aws-sdk-go/blob/v1.40.37/aws/request/retryer.go#L112-L118
[v1-handler]: https://github.com/aws/aws-sdk-go/blob/v1.40.37/aws/corehandlers/handlers.go#L209-L214
[v2-codes]: https://github.com/aws/aws-sdk-go-v2/blob/v1.36.3/aws/retry/standard.go#L51-L65
[v2-add]: https://github.com/aws/aws-sdk-go-v2/blob/v1.36.3/aws/retry/retry.go#L10-L23
[v2-cache]: https://github.com/aws/aws-sdk-go-v2/blob/v1.36.3/aws/credential_cache.go#L98

Release note (bug fix): A long-running BACKUP to s3 using AUTH=implicit
no longer fails with an ExpiredToken error when it races the rotation
of the underlying short-lived credentials; the s3 client now retries
ExpiredToken, ExpiredTokenException, and RequestExpired the same way
the legacy aws-sdk-go v1 client did.
@dt dt requested a review from a team as a code owner May 5, 2026 20:58
@dt dt requested review from andrew-r-thomas and removed request for a team May 5, 2026 20:58
@blathers-crl blathers-crl Bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels May 5, 2026
@blathers-crl
Copy link
Copy Markdown

blathers-crl Bot commented May 5, 2026

Thanks for opening a backport.

Before merging, please confirm that the change does not break backwards compatibility and otherwise complies with the backport policy. Include a brief release justification in the PR description explaining why the backport is appropriate. All backports must be reviewed by the TL for the owning area. While the stricter LTS policy does not yet apply, please exercise judgment and consider gating non-critical changes behind a disabled-by-default feature flag when appropriate.

@blathers-crl blathers-crl Bot assigned dt May 5, 2026
@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io Bot commented May 5, 2026

😎 Merged successfully - details.

@blathers-crl blathers-crl Bot requested a review from msbutler May 5, 2026 20:58
@blathers-crl blathers-crl Bot added backport Label PR's that are backports to older release branches T-disaster-recovery labels May 5, 2026
@blathers-crl
Copy link
Copy Markdown

blathers-crl Bot commented May 5, 2026

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@trunk-io trunk-io Bot merged commit 6cbcccd into cockroachdb:release-26.2 May 6, 2026
21 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport Label PR's that are backports to older release branches blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. T-disaster-recovery target-release-26.2.2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants