Updates multi-node training commands to also support Spark #3978

kellyguo11 · 2025-11-08T00:39:33Z

Description

Removes rendezvous backend for multi-node training since it doesn't seem to be necessary and prevents multi-node setup on the DGX Spark.

Type of change

Documentation update

Checklist

I have read and understood the contribution guidelines
I have run the pre-commit checks with ./isaaclab.sh --format
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
I have updated the changelog and the corresponding version in the extension's config/extension.toml file
I have added my name to the CONTRIBUTORS.md or my name already exists there

greptile-apps

Greptile Overview

Greptile Summary

Simplified multi-node training setup by replacing rendezvous backend parameters (--rdzv_id, --rdzv_backend, --rdzv_endpoint) with direct master address parameters (--master_addr, --master_port).

Key changes:

Updated all PyTorch multi-node training commands in multi_gpu.rst to use --master_addr=<ip_of_master> and --master_port=5555 instead of rendezvous parameters
Commands affected: rl_games, rsl_rl, and skrl (PyTorch backend) training scripts
Removed DGX Spark limitation note about multi-node training requiring additional network configurations from installation/index.rst
JAX-based training commands remain unchanged (already use --coordinator_address)

Issues found:

Minor spacing issue on line 163 in multi_gpu.rst (extra space before scripts/)

This simplification aligns with PyTorch's recommended approach and makes multi-node setup more straightforward, particularly for DGX Spark environments.

Confidence Score: 4/5

This PR is safe to merge with minimal risk - it's a documentation-only change that simplifies multi-node training setup.
Score of 4 reflects that the changes are documentation-only and use a valid PyTorch distributed training approach. The simplified parameter set (master_addr/master_port) is standard and well-supported. One minor spacing issue was found that should be fixed before merge.
Pay attention to docs/source/features/multi_gpu.rst line 163 which has an extra space that should be removed.

Important Files Changed

File Analysis

Filename	Score	Overview
docs/source/features/multi_gpu.rst	4/5	Replaced rendezvous parameters with simpler master_addr/master_port for multi-node training commands. One minor spacing issue found on line 163.
docs/source/setup/installation/index.rst	5/5	Removed limitation note about multi-node training requiring additional network configurations on DGX Spark, aligning with the simplified approach.

Sequence Diagram

sequenceDiagram
    participant User as User/Developer
    participant Docs as Documentation
    participant MasterNode as Master Node
    participant WorkerNode as Worker Node(s)
    participant PyTorch as PyTorch Distributed
    
    User->>Docs: Read multi-node training setup
    Note over Docs: OLD: rdzv_backend + rdzv_id + rdzv_endpoint
    Note over Docs: NEW: master_addr + master_port
    
    User->>MasterNode: Launch with --node_rank=0 --master_addr=<ip> --master_port=5555
    MasterNode->>PyTorch: Initialize distributed training (rank 0)
    
    User->>WorkerNode: Launch with --node_rank=1 --master_addr=<ip> --master_port=5555
    WorkerNode->>PyTorch: Connect to master using master_addr:master_port
    
    PyTorch->>MasterNode: Coordinate training processes
    PyTorch->>WorkerNode: Coordinate training processes
    
    Note over MasterNode,WorkerNode: Synchronize gradients across nodes
    Note over MasterNode,WorkerNode: No rendezvous backend needed (simpler setup)

_{2 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-11-08T00:42:37Z

docs/source/features/multi_gpu.rst

                .. code-block:: shell

-                    python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=0 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=localhost:5555 scripts/reinforcement_learning/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed
+                    python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=<ip_of_master> --master_port=5555  scripts/reinforcement_learning/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed


syntax: Extra space before scripts/reinforcement_learning

Suggested change

python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=<ip_of_master> --master_port=5555 scripts/reinforcement_learning/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed

python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=<ip_of_master> --master_port=5555 scripts/reinforcement_learning/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed

Updates multi-node training commands to also support Spark

e0e4cdd

kellyguo11 requested review from Mayankm96 and jtigue-bdai as code owners November 8, 2025 00:39

github-actions bot added the documentation Improvements or additions to documentation label Nov 8, 2025

greptile-apps bot reviewed Nov 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Updates multi-node training commands to also support Spark #3978

Updates multi-node training commands to also support Spark #3978

kellyguo11 commented Nov 8, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=<ip_of_master> --master_port=5555 scripts/reinforcement_learning/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed
	python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=<ip_of_master> --master_port=5555 scripts/reinforcement_learning/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed

Updates multi-node training commands to also support Spark #3978

Are you sure you want to change the base?

Updates multi-node training commands to also support Spark #3978

Conversation

kellyguo11 commented Nov 8, 2025

Description

Type of change

Checklist

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant