Skip to content

[apiserver] Add retry and timeout to apiserver V2 #3869

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

kenchung285
Copy link
Contributor

@kenchung285 kenchung285 commented Jul 15, 2025

Why are these changes needed?

Add retry logic and timeout to apiserver v2:

  1. Timeout is set to long timeout (30 seconds) to prevent unbounded request
  2. The retry logic determines if a http status code is retryable

Wrap errors with more informations

Related issue number

Closes #3606

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@kenchung285 kenchung285 marked this pull request as ready for review July 17, 2025 16:57
@kenchung285
Copy link
Contributor Author

@dentiny PTAL, thanks!

Signed-off-by: Cheng-Yeh Chung <[email protected]>
@kenchung285 kenchung285 requested a review from dentiny July 18, 2025 10:04
return nil, err
}
proxy.Transport = newRetryRoundTripper(baseTransport, 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make retry count a constant, and make sure it's compatible with v1.
https://github.com/machichima/kuberay/blob/a321fbc81220c75c0d52b4ebab337810f7cd4f50/apiserver/pkg/util/config.go#L28-L37
we should at least followup a PR to unify these retry configs between v1 and v2

Copy link
Contributor

@dentiny dentiny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any unit tests, curious how did you test it?

@kenchung285
Copy link
Contributor Author

I don't see any unit tests, curious how did you test it?

Currently my implementation logs the error message to stdout, so I can test my implementation by manual test with deploying a apiserver

I test with the following steps

cd apiserver
make start-local-apiserver

# An invalid path
curl http://localhost:31888/apis/ray.io/v1/namespaces/ray-system/invalid/path

# See stdout logs of the apiserver pod
kubectl logs -n ray-system pod/kuberay-apiserver-85744c5487-8csl4

And what I got is:
Screenshot 2025-07-18 at 8 32 43 PM

@kenchung285
Copy link
Contributor Author

However, this may not work for the case that we wrapped error message into error

@kenchung285 kenchung285 changed the title [apiserver]: Add retry and timeout to apiserver V2 [apiserver] Add retry and timeout to apiserver V2 Jul 18, 2025
@kenchung285 kenchung285 requested a review from dentiny July 18, 2025 14:33
@kenchung285
Copy link
Contributor Author

@dentiny PTAL, thanks!

Signed-off-by: Cheng-Yeh Chung <[email protected]>
for attempt := 0; attempt <= rrt.retries; attempt++ {
if attempt > 0 && req.GetBody != nil {
var bodyCopy io.ReadCloser
bodyCopy, err = req.GetBody()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge bodyPreserveMiddleware into this function? At least merge it into retryRoundTripper.

req.Body = bodyCopy
}

resp, err = rrt.base.RoundTrip(req)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we drain the old response before the next retry?


import "time"

// Compatible with apiserver V1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to be in this PR, could you please leave a TODO, that we will merge v1/v2 retry timeout constants and logic into one?

sleepDuration = HTTPClientDefaultMaxBackoff
}

if deadline, ok := ctx.Deadline(); ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return 200 <= statusCode && statusCode < 300
}

func retryableHTTPStatusCodes(statusCode int) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it's better to use array instead of map for small number of lookups

return
}
err = r.Body.Close()
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we contain the err in the error message?

retries int
}

func newRetryRoundTripper(base http.RoundTripper, retries int) http.RoundTripper {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On testing, RoundTripper is an interface, which means you can have your mock implementation to mimic error; I think we should be able to test retry and timeout with unit test?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] [apiserver] Add timeout and retry for apiserver v2
3 participants