Skip to content

Conversation

@ryanaoleary
Copy link
Collaborator

** This PR is a cherry pick of the below PR / description to the release v1.5 branch **

Why are these changes needed?

This is a feature enhancement to the RayMultiHostIndexing feature that adds replica id and host-index labels to Pods created by KubeRay that was implemented in this PR: #3998.

To support the multi-host use-case for TPUs and GPUs, we already set a replica id label (a unique string based on the group name) and a host-index (a unique int from 0-NumOfHosts-1) when this feature is enabled and NumOfHosts > 1. These labels add value to users through observability when running SPMD workloads, and through atomic creation and deletion/re-creation of multi-host groups.

However, these labels do not support the use-case for multi-slice workloads where it is important to know the ordered index of the replica within the multi-slice set. Frameworks like JAX require the slice ID, which is an int between 0 and the # of slices, to be set to configure multi-slice workloads (source).

To solve this issue, this PR adds a label for the ordered replica index, an int value between 0 and replicas-1 for each worker group when this feature is enabled. This label will greatly simplify the process of setting environment variables like MEGASCALE_SLICE_ID for multi-slice workloads that use JAX. We can then check these KubeRay labels from the KubeRay TPU webhook when injecting environment vars. Before the TPU webhook change, these labels are still useful because users can pass them to the Pod environment using downward API.

Related issue number

#3902

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

…i-slice (ray-project#4163)

* [Feature Enhancement] Set ordered replica index label to support multi-slice

Signed-off-by: Ryan O'Leary <[email protected]>

* rename replica-id -> replica-name

Signed-off-by: Ryan O'Leary <[email protected]>

* Separate replica index feature gate logic

Signed-off-by: Ryan O'Leary <[email protected]>

* remove index arg in createWorkerPod

Signed-off-by: Ryan O'Leary <[email protected]>

---------

Signed-off-by: Ryan O'Leary <[email protected]>
Copy link
Collaborator

@rueian rueian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@andrewsykim andrewsykim merged commit 27117ef into ray-project:release-1.5 Nov 3, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants