[Tool] Add network topology and status reporter#94
Conversation
|
@yubofredwang Could you verify this? If any issues come up, please comment and I’ll try to fix them. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 81f3ed2c66
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6065479b66
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
|
Thanks for the PR. I think the probing for RDMA data plane is missing. Something like ib_write_bw -d 127.0.0.1 per HCA should be helpful and can measure the actual bandwidth of the IB devices |
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
|
@yubofredwang Made changes and can you check it again? |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: da932b28e6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0fec96fcc6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4c24e515ee
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Add check_network_topology.py — network topology and status reporter
Closes #26
This script does three things:
/sys/class/infiniband/withibv_devinfoas fallbackNCCL_SOCKET_IFNAME— identifies which interfaces are RDMA-backedWhat was tested:
Ran local mode on a RunPod RTX PRO 4500 pod (PyTorch image). Script runs cleanly, correctly finds no RDMA devices (expected on community GPU pods), correctly identifies
eth0as the active interface and recommends it forNCCL_SOCKET_IFNAME.Also set up a 2-node Ray cluster on RunPod to test cluster mode. Ray connected successfully across pods but both pods ended up with the same internal IP (
172.21.0.2) due to RunPod's isolated Docker networkingso the connectivity matrix couldn't distinguish the two nodes. Cluster mode ran without errors.
What wasn't tested:
This would ideally be validated on the same hardware the existing configs target (H100/H200 multi-node). Happy to incorporate any feedback after you've had a chance to run it on real hardware.