-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Autoscaler][V2] Check IM instance_status before terminating nodes #50707
base: master
Are you sure you want to change the base?
[Autoscaler][V2] Check IM instance_status before terminating nodes #50707
Conversation
Signed-off-by: Ryan O'Leary <[email protected]>
Closes #50868 |
cc: @kevin85421 @rueian |
Co-authored-by: Rueian <[email protected]> Signed-off-by: ryanaoleary <[email protected]>
…tion Signed-off-by: Ryan O'Leary <[email protected]>
would you mind providing more context in the PR description and which case this PR fixes (a simple repro program is helpful)? I actually don't understand what does this PR wants to solve. |
Signed-off-by: Ryan O'Leary <[email protected]>
I have a reproduction script for the issue in #50868. I'll update the PR description with more context and a link to that issue. |
Co-authored-by: Kai-Hsun Chen <[email protected]> Signed-off-by: ryanaoleary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
…tional in proto Signed-off-by: Ryan O'Leary <[email protected]>
Sorry for the multiple review iterations on this PR that exceeded our expectations. I believe the main issue is that the existing API (e.g., termination request/scheduling node) does not clearly indicate which fields are optional and when they should be provided. This is something we need to improve the existing codebase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind adding some tests? I know you have a PR for end-to-end tests in the KubeRay repo, but would you mind adding a non-KubeRay test as well?
Why are these changes needed?
Currently when the autoscaler attempts to scale down Ray nodes due to
max number of worker nodes per type reached
, it's possible for the autoscaler to error and fail to reconcile the nodes withInvalid status transition from ALLOCATED to RAY_STOP_REQUESTED
. This occurs when cloud IM instances (for KubeRay this would be Pods) have not yet been able to start Ray, but are still attempted to be scaled down due to max number of nodes per type constraints. This PR adds a check forim_instance_status
when terminating nodes, transitioning directly fromALLOCATED
toTERMINATING
in the case where Ray has not yet started on the node but the autoscaler requires it to be scaled down. The linked issue contains a reproduction script and more context for why this change is necessary.Related issue number
Closes #50868
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.