Skip to content

[Tool] Add network topology and status reporter#94

Merged
yubofredwang merged 8 commits into
lightseekorg:mainfrom
ppraneth:tool
May 12, 2026
Merged

[Tool] Add network topology and status reporter#94
yubofredwang merged 8 commits into
lightseekorg:mainfrom
ppraneth:tool

Conversation

@ppraneth
Copy link
Copy Markdown
Contributor

@ppraneth ppraneth commented May 10, 2026

Add check_network_topology.py — network topology and status reporter

Closes #26

This script does three things:

  1. Lists all RDMA devices on each node (device name, port, state, link rate, link layer, phys state) — reads from /sys/class/infiniband/ with ibv_devinfo as fallback
  2. Lists UP network interfaces and recommends which one to use for NCCL_SOCKET_IFNAME — identifies which interfaces are RDMA-backed
  3. Runs pairwise TCP connectivity probes across all nodes in a Ray cluster and prints a latency matrix

What was tested:

Ran local mode on a RunPod RTX PRO 4500 pod (PyTorch image). Script runs cleanly, correctly finds no RDMA devices (expected on community GPU pods), correctly identifies eth0 as the active interface and recommends it for NCCL_SOCKET_IFNAME.

Also set up a 2-node Ray cluster on RunPod to test cluster mode. Ray connected successfully across pods but both pods ended up with the same internal IP (172.21.0.2) due to RunPod's isolated Docker networking
so the connectivity matrix couldn't distinguish the two nodes. Cluster mode ran without errors.

What wasn't tested:

  • RDMA device discovery on actual InfiniBand or RoCE hardware (no access to H100/H200 cluster with IB)
  • Connectivity matrix on a real multi-node setup where each node has a distinct routable IP

This would ideally be validated on the same hardware the existing configs target (H100/H200 multi-node). Happy to incorporate any feedback after you've had a chance to run it on real hardware.

@ppraneth
Copy link
Copy Markdown
Contributor Author

@yubofredwang Could you verify this? If any issues come up, please comment and I’ll try to fix them.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 81f3ed2c66

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread tools/check_network_topology.py Outdated
ppraneth added 2 commits May 10, 2026 13:39
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6065479b66

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread tools/check_network_topology.py Outdated
ppraneth added 2 commits May 10, 2026 13:54
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
@yubofredwang
Copy link
Copy Markdown
Collaborator

Thanks for the PR. I think the probing for RDMA data plane is missing. Something like ib_write_bw -d 127.0.0.1 per HCA should be helpful and can measure the actual bandwidth of the IB devices

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
@ppraneth
Copy link
Copy Markdown
Contributor Author

@yubofredwang Made changes and can you check it again?

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: da932b28e6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread tools/check_network_topology.py Outdated
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
@ppraneth
Copy link
Copy Markdown
Contributor Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0fec96fcc6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread tools/check_network_topology.py Outdated
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4c24e515ee

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread tools/check_network_topology.py Outdated
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Copy link
Copy Markdown
Collaborator

@yubofredwang yubofredwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@yubofredwang yubofredwang merged commit 87dfadf into lightseekorg:main May 12, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Build a tool to report network topology and status

2 participants