-
Notifications
You must be signed in to change notification settings - Fork 648
[Feature Enhancement] Set ordered replica index label to support multi-slice #4163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Enhancement] Set ordered replica index label to support multi-slice #4163
Conversation
|
cc: @andrewsykim this is the change we discussed offline |
…i-slice Signed-off-by: Ryan O'Leary <[email protected]>
1e52764 to
9a8a405
Compare
Signed-off-by: Ryan O'Leary <[email protected]>
|
@ryanaoleary if we want to include this in v1.5, we need to ensure that the code path is unchanged when the feature gate is disabled, it seems we are modifying some code paths even when the feature gate is disabled, could we guard all the changes with the feature gate by end of day tomorrow? |
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Should be fixed with d7c48a2 and 784993b, all the new logic is behind a feature gate and separated into a different code path than the regular |
|
Reviewed and confirmed the behavior when feature gate is disabled is unchanged. Given this has a feature gate I think it's OK to include in v1.5. I will also do some manual testing after the next v1.5 release candidate |
…i-slice (ray-project#4163) * [Feature Enhancement] Set ordered replica index label to support multi-slice Signed-off-by: Ryan O'Leary <[email protected]> * rename replica-id -> replica-name Signed-off-by: Ryan O'Leary <[email protected]> * Separate replica index feature gate logic Signed-off-by: Ryan O'Leary <[email protected]> * remove index arg in createWorkerPod Signed-off-by: Ryan O'Leary <[email protected]> --------- Signed-off-by: Ryan O'Leary <[email protected]>
…i-slice (#4163) (#4171) * [Feature Enhancement] Set ordered replica index label to support multi-slice * rename replica-id -> replica-name * Separate replica index feature gate logic * remove index arg in createWorkerPod --------- Signed-off-by: Ryan O'Leary <[email protected]>
Why are these changes needed?
This is a feature enhancement to the
RayMultiHostIndexingfeature that adds replica id and host-index labels to Pods created by KubeRay that was implemented in this PR: #3998.To support the multi-host use-case for TPUs and GPUs, we already set a replica id label (a unique string based on the group name) and a host-index (a unique int from 0-NumOfHosts-1) when this feature is enabled and
NumOfHosts > 1. These labels add value to users through observability when running SPMD workloads, and through atomic creation and deletion/re-creation of multi-host groups.However, these labels do not support the use-case for multi-slice workloads where it is important to know the ordered index of the replica within the multi-slice set. Frameworks like JAX require the slice ID, which is an int between 0 and the # of slices, to be set to configure multi-slice workloads (source).
To solve this issue, this PR adds a label for the ordered replica index, an int value between 0 and replicas-1 for each worker group when this feature is enabled. This label will greatly simplify the process of setting environment variables like
MEGASCALE_SLICE_IDfor multi-slice workloads that use JAX. We can then check these KubeRay labels from the KubeRay TPU webhook when injecting environment vars. Before the TPU webhook change, these labels are still useful because users can pass them to the Pod environment using downward API.Related issue number
#3902
Checks