Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Autoscaler][V2] Check IM instance_status before terminating nodes #50707

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

ryanaoleary
Copy link
Contributor

@ryanaoleary ryanaoleary commented Feb 19, 2025

Why are these changes needed?

Currently when the autoscaler attempts to scale down Ray nodes due to max number of worker nodes per type reached, it's possible for the autoscaler to error and fail to reconcile the nodes with Invalid status transition from ALLOCATED to RAY_STOP_REQUESTED. This occurs when cloud IM instances (for KubeRay this would be Pods) have not yet been able to start Ray, but are still attempted to be scaled down due to max number of nodes per type constraints. This PR adds a check for im_instance_status when terminating nodes, transitioning directly from ALLOCATED to TERMINATING in the case where Ray has not yet started on the node but the autoscaler requires it to be scaled down. The linked issue contains a reproduction script and more context for why this change is necessary.

Related issue number

Closes #50868

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ryanaoleary ryanaoleary marked this pull request as ready for review February 24, 2025 22:54
@ryanaoleary ryanaoleary requested review from hongchaodeng and a team as code owners February 24, 2025 22:54
@ryanaoleary
Copy link
Contributor Author

Closes #50868

@ryanaoleary
Copy link
Contributor Author

cc: @kevin85421 @rueian

@ryanaoleary ryanaoleary requested a review from rueian March 10, 2025 21:48
@kevin85421
Copy link
Member

kevin85421 commented Mar 11, 2025

would you mind providing more context in the PR description and which case this PR fixes (a simple repro program is helpful)? I actually don't understand what does this PR wants to solve.

@ryanaoleary
Copy link
Contributor Author

would you mind providing more context in the PR description and which case this PR fixes (a simple repro program is helpful)? I actually don't understand what does this PR wants to solve.

I have a reproduction script for the issue in #50868. I'll update the PR description with more context and a link to that issue.

@ryanaoleary ryanaoleary requested a review from kevin85421 March 11, 2025 18:41
@ryanaoleary ryanaoleary requested a review from kevin85421 March 14, 2025 16:49
@kevin85421 kevin85421 self-assigned this Mar 18, 2025
@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label Mar 18, 2025
@kevin85421
Copy link
Member

Sorry for the multiple review iterations on this PR that exceeded our expectations. I believe the main issue is that the existing API (e.g., termination request/scheduling node) does not clearly indicate which fields are optional and when they should be provided. This is something we need to improve the existing codebase.

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind adding some tests? I know you have a PR for end-to-end tests in the KubeRay repo, but would you mind adding a non-KubeRay test as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Autoscaler][V2] Updating max replicas while Pods are pending causes v2 autoscaler to hang
3 participants