-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
[P/D] [NixlConnector] zmq handshake abstraction #26527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the NIXL connector by abstracting the ZMQ handshake logic into a separate HandshakeStrategy
. This is a good architectural improvement that enhances modularity and prepares for supporting other communication protocols. The changes are well-structured, with new files for lazy imports and the handshake strategy itself. However, I've identified a critical issue where a new method will cause an AttributeError
due to accessing a non-existent attribute. I've also noted a potential performance issue in the new handshake logic that appears to perform an unnecessary network operation.
vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector/__init__.py
Outdated
Show resolved
Hide resolved
# Handshake with remote agent-rank0 first to get the tp_size of remote | ||
path = make_zmq_path("tcp", host, port) | ||
logger.debug("Querying master rank metadata on path: %s", path) | ||
metadata, agent_name_0 = handshake(path, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here initiates a handshake with the remote rank 0, but the agent for rank 0 appears to be unused if the target remote rank (p_remote_rank
) is not 0. The _read_blocks
method only uses the agent corresponding to p_remote_rank
. This initial handshake with rank 0 seems to introduce an unnecessary network round trip and agent registration. The previous implementation only connected to the required p_remote_rank
. Could you clarify the need for this initial handshake with rank 0 or remove it if it's redundant? The comment "to get the tp_size of remote" is also misleading as remote_tp_size
is an argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @NickLucche, I think we were saying the handshake through rank 0 was desirable in certain cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up changing to align with current behavior for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector/handshake.py
Show resolved
Hide resolved
vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector/__init__.py
Outdated
Show resolved
Hide resolved
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
9790a03
to
fc48b3f
Compare
Need to align the interface with #26338, since they interact |
Signed-off-by: Will Eaton <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
5c3e7f5
to
3251cdd
Compare
This pull request has merge conflicts that must be resolved before it can be |
Just the ZMQ and base class changes split from: #22274