[Autoscaler][V2] Check IM instance_status before terminating nodes #50707

ryanaoleary · 2025-02-19T02:12:50Z

Why are these changes needed?

Currently when the autoscaler attempts to scale down Ray nodes due to max number of worker nodes per type reached, it's possible for the autoscaler to error and fail to reconcile the nodes with Invalid status transition from ALLOCATED to RAY_STOP_REQUESTED. This occurs when cloud IM instances (for KubeRay this would be Pods) have not yet been able to start Ray, but are still attempted to be scaled down due to max number of nodes per type constraints. This PR adds a check for im_instance_status when terminating nodes, transitioning directly from ALLOCATED to TERMINATING in the case where Ray has not yet started on the node but the autoscaler requires it to be scaled down. The linked issue contains a reproduction script and more context for why this change is necessary.

Related issue number

Closes #50868

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary · 2025-02-24T22:54:52Z

Closes #50868

ryanaoleary · 2025-03-10T19:12:18Z

cc: @kevin85421 @rueian

python/ray/autoscaler/v2/instance_manager/reconciler.py

python/ray/autoscaler/v2/scheduler.py

Co-authored-by: Rueian <[email protected]> Signed-off-by: ryanaoleary <[email protected]>

…tion Signed-off-by: Ryan O'Leary <[email protected]>

kevin85421 · 2025-03-11T07:36:43Z

would you mind providing more context in the PR description and which case this PR fixes (a simple repro program is helpful)? I actually don't understand what does this PR wants to solve.

python/ray/autoscaler/v2/scheduler.py

python/ray/autoscaler/v2/instance_manager/reconciler.py

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary · 2025-03-11T18:18:12Z

would you mind providing more context in the PR description and which case this PR fixes (a simple repro program is helpful)? I actually don't understand what does this PR wants to solve.

I have a reproduction script for the issue in #50868. I'll update the PR description with more context and a link to that issue.

Signed-off-by: Ryan O'Leary <[email protected]>

python/ray/autoscaler/v2/scheduler.py

src/ray/protobuf/experimental/instance_manager.proto

python/ray/autoscaler/v2/scheduler.py

Co-authored-by: Kai-Hsun Chen <[email protected]> Signed-off-by: ryanaoleary <[email protected]>

Signed-off-by: Ryan O'Leary <[email protected]>

…tional in proto Signed-off-by: Ryan O'Leary <[email protected]>

kevin85421 · 2025-03-18T02:40:18Z

Sorry for the multiple review iterations on this PR that exceeded our expectations. I believe the main issue is that the existing API (e.g., termination request/scheduling node) does not clearly indicate which fields are optional and when they should be provided. This is something we need to improve the existing codebase.

kevin85421

Would you mind adding some tests? I know you have a PR for end-to-end tests in the KubeRay repo, but would you mind adding a non-KubeRay test as well?

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary · 2025-03-19T06:13:26Z

Would you mind adding some tests? I know you have a PR for end-to-end tests in the KubeRay repo, but would you mind adding a non-KubeRay test as well?

Sounds good, I added a unit test in 7736214 that validates that allocated instances can be deleted by the scheduler when max_worker_nodes limits are changed for a NodeTypeConfig.

kevin85421 · 2025-03-20T01:01:20Z

cc @jjyao @edoakes would you mind merging or reviewing this PR? Thanks!

Check instance status before terminating nodes

a8382e4

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary marked this pull request as ready for review February 24, 2025 22:54

ryanaoleary requested review from hongchaodeng and a team as code owners February 24, 2025 22:54

rueian reviewed Mar 10, 2025

View reviewed changes

python/ray/autoscaler/v2/instance_manager/reconciler.py Show resolved Hide resolved

rueian reviewed Mar 10, 2025

View reviewed changes

python/ray/autoscaler/v2/scheduler.py Outdated Show resolved Hide resolved

ryanaoleary and others added 2 commits March 10, 2025 14:26

Update python/ray/autoscaler/v2/scheduler.py

bbe5005

Co-authored-by: Rueian <[email protected]> Signed-off-by: ryanaoleary <[email protected]>

Add note explaining ALLOCATED->TERMINATING state transition justifica…

5fbd206

…tion Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary requested a review from rueian March 10, 2025 21:48

rueian approved these changes Mar 10, 2025

View reviewed changes

Merge branch 'master' into fix-delete-allocated-nodes

9c9b952

kevin85421 reviewed Mar 11, 2025

View reviewed changes

python/ray/autoscaler/v2/scheduler.py Outdated Show resolved Hide resolved

python/ray/autoscaler/v2/scheduler.py Outdated Show resolved Hide resolved

python/ray/autoscaler/v2/instance_manager/reconciler.py Show resolved Hide resolved

Change type hint to Instance.InstanceStatus

3cb668d

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary requested a review from kevin85421 March 11, 2025 18:41

ryanaoleary and others added 3 commits March 12, 2025 01:02

Merge branch 'master' into fix-delete-allocated-nodes

b8d26c0

Merge branch 'master' into fix-delete-allocated-nodes

417adc4

InstanceStatus should use ValueType

3f7f1be

Signed-off-by: Ryan O'Leary <[email protected]>

kevin85421 reviewed Mar 13, 2025

View reviewed changes

ryanaoleary and others added 2 commits March 14, 2025 07:49

Update src/ray/protobuf/experimental/instance_manager.proto

e02de1d

Co-authored-by: Kai-Hsun Chen <[email protected]> Signed-off-by: ryanaoleary <[email protected]>

Add longer comment to im_instance_status

ee394dd

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary requested a review from kevin85421 March 14, 2025 16:49

Add missing instance_status to return and make instance_status non op…

26dfff2

…tional in proto Signed-off-by: Ryan O'Leary <[email protected]>

kevin85421 approved these changes Mar 18, 2025

View reviewed changes

kevin85421 self-assigned this Mar 18, 2025

kevin85421 added the go add ONLY when ready to merge, run all tests label Mar 18, 2025

kevin85421 reviewed Mar 18, 2025

View reviewed changes

ryanaoleary and others added 3 commits March 18, 2025 18:03

Merge branch 'master' into fix-delete-allocated-nodes

446a91f

Fix naming

157638b

Signed-off-by: Ryan O'Leary <[email protected]>

Add unit test for deleting allocated instances when max nodes changes

7736214

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary requested a review from kevin85421 March 19, 2025 19:51

kevin85421 approved these changes Mar 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Autoscaler][V2] Check IM instance_status before terminating nodes #50707

[Autoscaler][V2] Check IM instance_status before terminating nodes #50707

ryanaoleary commented Feb 19, 2025 •

edited

Loading

ryanaoleary commented Feb 24, 2025

ryanaoleary commented Mar 10, 2025

kevin85421 commented Mar 11, 2025 •

edited

Loading

ryanaoleary commented Mar 11, 2025

kevin85421 commented Mar 18, 2025

kevin85421 left a comment

ryanaoleary commented Mar 19, 2025

kevin85421 commented Mar 20, 2025

[Autoscaler][V2] Check IM instance_status before terminating nodes #50707

Are you sure you want to change the base?

[Autoscaler][V2] Check IM instance_status before terminating nodes #50707

Conversation

ryanaoleary commented Feb 19, 2025 • edited Loading

Why are these changes needed?

Related issue number

Checks

ryanaoleary commented Feb 24, 2025

ryanaoleary commented Mar 10, 2025

kevin85421 commented Mar 11, 2025 • edited Loading

ryanaoleary commented Mar 11, 2025

kevin85421 commented Mar 18, 2025

kevin85421 left a comment

Choose a reason for hiding this comment

ryanaoleary commented Mar 19, 2025

kevin85421 commented Mar 20, 2025

ryanaoleary commented Feb 19, 2025 •

edited

Loading

kevin85421 commented Mar 11, 2025 •

edited

Loading