Skip to content

Conversation

wseaton
Copy link
Contributor

@wseaton wseaton commented Oct 9, 2025

Just the ZMQ and base class changes split from: #22274

@wseaton wseaton changed the title [P/D] [NixlConnector] zmq handshake abstraction [P/D] [NixlConnector] draft: zmq handshake abstraction Oct 9, 2025
@mergify mergify bot added the kv-connector label Oct 9, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the NIXL connector by abstracting the ZMQ handshake logic into a separate HandshakeStrategy. This is a good architectural improvement that enhances modularity and prepares for supporting other communication protocols. The changes are well-structured, with new files for lazy imports and the handshake strategy itself. However, I've identified a critical issue where a new method will cause an AttributeError due to accessing a non-existent attribute. I've also noted a potential performance issue in the new handshake logic that appears to perform an unnecessary network operation.

Comment on lines +144 to +141
# Handshake with remote agent-rank0 first to get the tp_size of remote
path = make_zmq_path("tcp", host, port)
logger.debug("Querying master rank metadata on path: %s", path)
metadata, agent_name_0 = handshake(path, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic here initiates a handshake with the remote rank 0, but the agent for rank 0 appears to be unused if the target remote rank (p_remote_rank) is not 0. The _read_blocks method only uses the agent corresponding to p_remote_rank. This initial handshake with rank 0 seems to introduce an unnecessary network round trip and agent registration. The previous implementation only connected to the required p_remote_rank. Could you clarify the need for this initial handshake with rank 0 or remove it if it's redundant? The comment "to get the tp_size of remote" is also misleading as remote_tp_size is an argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @NickLucche, I think we were saying the handshake through rank 0 was desirable in certain cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up changing to align with current behavior for now

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

@wseaton wseaton changed the title [P/D] [NixlConnector] draft: zmq handshake abstraction [P/D] [NixlConnector] zmq handshake abstraction Oct 9, 2025
@wseaton
Copy link
Contributor Author

wseaton commented Oct 9, 2025

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

@wseaton wseaton force-pushed the handshake-abstraction branch from 9790a03 to fc48b3f Compare October 10, 2025 17:56
@wseaton
Copy link
Contributor Author

wseaton commented Oct 13, 2025

Need to align the interface with #26338, since they interact

@wseaton wseaton force-pushed the handshake-abstraction branch from 5c3e7f5 to 3251cdd Compare October 13, 2025 15:56
Copy link

mergify bot commented Oct 14, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wseaton.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant