-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Updates multi-node training commands to also support Spark #3978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Updates multi-node training commands to also support Spark #3978
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Simplified multi-node training setup by replacing rendezvous backend parameters (--rdzv_id, --rdzv_backend, --rdzv_endpoint) with direct master address parameters (--master_addr, --master_port).
Key changes:
- Updated all PyTorch multi-node training commands in
multi_gpu.rstto use--master_addr=<ip_of_master>and--master_port=5555instead of rendezvous parameters - Commands affected:
rl_games,rsl_rl, andskrl(PyTorch backend) training scripts - Removed DGX Spark limitation note about multi-node training requiring additional network configurations from
installation/index.rst - JAX-based training commands remain unchanged (already use
--coordinator_address)
Issues found:
- Minor spacing issue on line 163 in
multi_gpu.rst(extra space beforescripts/)
This simplification aligns with PyTorch's recommended approach and makes multi-node setup more straightforward, particularly for DGX Spark environments.
Confidence Score: 4/5
- This PR is safe to merge with minimal risk - it's a documentation-only change that simplifies multi-node training setup.
- Score of 4 reflects that the changes are documentation-only and use a valid PyTorch distributed training approach. The simplified parameter set (
master_addr/master_port) is standard and well-supported. One minor spacing issue was found that should be fixed before merge. - Pay attention to
docs/source/features/multi_gpu.rstline 163 which has an extra space that should be removed.
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| docs/source/features/multi_gpu.rst | 4/5 | Replaced rendezvous parameters with simpler master_addr/master_port for multi-node training commands. One minor spacing issue found on line 163. |
| docs/source/setup/installation/index.rst | 5/5 | Removed limitation note about multi-node training requiring additional network configurations on DGX Spark, aligning with the simplified approach. |
Sequence Diagram
sequenceDiagram
participant User as User/Developer
participant Docs as Documentation
participant MasterNode as Master Node
participant WorkerNode as Worker Node(s)
participant PyTorch as PyTorch Distributed
User->>Docs: Read multi-node training setup
Note over Docs: OLD: rdzv_backend + rdzv_id + rdzv_endpoint
Note over Docs: NEW: master_addr + master_port
User->>MasterNode: Launch with --node_rank=0 --master_addr=<ip> --master_port=5555
MasterNode->>PyTorch: Initialize distributed training (rank 0)
User->>WorkerNode: Launch with --node_rank=1 --master_addr=<ip> --master_port=5555
WorkerNode->>PyTorch: Connect to master using master_addr:master_port
PyTorch->>MasterNode: Coordinate training processes
PyTorch->>WorkerNode: Coordinate training processes
Note over MasterNode,WorkerNode: Synchronize gradients across nodes
Note over MasterNode,WorkerNode: No rendezvous backend needed (simpler setup)
2 files reviewed, 1 comment
| .. code-block:: shell | ||
| python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=0 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=localhost:5555 scripts/reinforcement_learning/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed | ||
| python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=<ip_of_master> --master_port=5555 scripts/reinforcement_learning/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syntax: Extra space before scripts/reinforcement_learning
| python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=<ip_of_master> --master_port=5555 scripts/reinforcement_learning/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed | |
| python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=<ip_of_master> --master_port=5555 scripts/reinforcement_learning/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed |
Description
Removes rendezvous backend for multi-node training since it doesn't seem to be necessary and prevents multi-node setup on the DGX Spark.
Type of change
Checklist
pre-commitchecks with./isaaclab.sh --formatconfig/extension.tomlfileCONTRIBUTORS.mdor my name already exists there