-
-
Couldn't load subscription status.
- Fork 10.8k
[Bugfix] Missing NIXL metadata for handshake initialization if instance spans multi-node #26338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…pans multi-node Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: GuanLuo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will look into it more in depth asap, in the meantime tagging other people that may be interested in reviewing @wseaton @markmc @njhill
PS to summarize: listener thread is moved from worker->scheduler. Scheduler aggregates metadata from all workers. Workers carry out handshake by fetching data from Scheduler (single port).
|
This does seem like a good way to get in the collective RPC and scheduler infa changes that #22274 needs, so I am supportive 🙂 |
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @GuanLuo thanks for the patience and the great work with this PR!
Left some comments, I think the main concern to address is the DP: TL;DR afaik we're no aggregating across dp ranks, so I don't think we should add that [dp_rank] dimension in the metadata.
Signed-off-by: GuanLuo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: GuanLuo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one @GuanLuo , I think it looks much cleaner now!
Only left one comment about device_id.
Other than that this LGTM.
Would be nice to get a ack from @wseaton or anyone interested on this #26338: we basically "squashed" the TP dimension in this PR by grouping across TP workers, but we retain DP separation as we don't aggregated across dp rank (therefore we cleaned up all dp references we had).
Signed-off-by: Guan Luo <[email protected]>
|
Documentation preview: https://vllm--26338.org.readthedocs.build/en/26338/ |
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the hard work @GuanLuo !
Let's leave a small window for other reviewers to chime in today, o/w we merge when CI is green again (unrelated failures should be fixed on main first).
| # Need to make sure the device ID is non-negative for NIXL, | ||
| # Torch uses -1 to indicate CPU tensors while NIXL uses explicit | ||
| # memory type. | ||
| self.device_id = max(cache.get_device(), 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity, what happens if all caches were indeed cpu tensors (not a use-case we have, but good to highlight) ?
|
|
||
| # Get descs ids. | ||
| local_block_descs_ids: np.ndarray | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cruft
SGTM, this shouldn't impact changing handshake strategy in the future. |
|
@GuanLuo good to merge after conflict |
Signed-off-by: GuanLuo <[email protected]>
|
@NickLucche Resolved, kicked off CI |
…
Purpose
Fix NIXL handshake issue when model instance spans multiple nodes due to parallelism strategy (i.e. TP=16 and run on 2 H100x8), see #25981 for detail
Test Plan
test_nixl_connector.pyfor unit testing,test_kv_transfer_metadatato verify thatNIXLConnectorWorkerproperly return its handshake metadata and retrieve target metadata;NIXLConnectorSchedulercan serve the collective metadata.nixl_integration/**for integration testing, this covers ifEngineCoreproperly gather and serve the handshake metadata.Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.