Skip to content

Conversation

@kellyguo11
Copy link
Contributor

Description

Removes rendezvous backend for multi-node training since it doesn't seem to be necessary and prevents multi-node setup on the DGX Spark.

Type of change

  • Documentation update

Checklist

  • I have read and understood the contribution guidelines
  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the changelog and the corresponding version in the extension's config/extension.toml file
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 8, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Simplified multi-node training setup by replacing rendezvous backend parameters (--rdzv_id, --rdzv_backend, --rdzv_endpoint) with direct master address parameters (--master_addr, --master_port).

Key changes:

  • Updated all PyTorch multi-node training commands in multi_gpu.rst to use --master_addr=<ip_of_master> and --master_port=5555 instead of rendezvous parameters
  • Commands affected: rl_games, rsl_rl, and skrl (PyTorch backend) training scripts
  • Removed DGX Spark limitation note about multi-node training requiring additional network configurations from installation/index.rst
  • JAX-based training commands remain unchanged (already use --coordinator_address)

Issues found:

  • Minor spacing issue on line 163 in multi_gpu.rst (extra space before scripts/)

This simplification aligns with PyTorch's recommended approach and makes multi-node setup more straightforward, particularly for DGX Spark environments.

Confidence Score: 4/5

  • This PR is safe to merge with minimal risk - it's a documentation-only change that simplifies multi-node training setup.
  • Score of 4 reflects that the changes are documentation-only and use a valid PyTorch distributed training approach. The simplified parameter set (master_addr/master_port) is standard and well-supported. One minor spacing issue was found that should be fixed before merge.
  • Pay attention to docs/source/features/multi_gpu.rst line 163 which has an extra space that should be removed.

Important Files Changed

File Analysis

Filename Score Overview
docs/source/features/multi_gpu.rst 4/5 Replaced rendezvous parameters with simpler master_addr/master_port for multi-node training commands. One minor spacing issue found on line 163.
docs/source/setup/installation/index.rst 5/5 Removed limitation note about multi-node training requiring additional network configurations on DGX Spark, aligning with the simplified approach.

Sequence Diagram

sequenceDiagram
    participant User as User/Developer
    participant Docs as Documentation
    participant MasterNode as Master Node
    participant WorkerNode as Worker Node(s)
    participant PyTorch as PyTorch Distributed
    
    User->>Docs: Read multi-node training setup
    Note over Docs: OLD: rdzv_backend + rdzv_id + rdzv_endpoint
    Note over Docs: NEW: master_addr + master_port
    
    User->>MasterNode: Launch with --node_rank=0 --master_addr=<ip> --master_port=5555
    MasterNode->>PyTorch: Initialize distributed training (rank 0)
    
    User->>WorkerNode: Launch with --node_rank=1 --master_addr=<ip> --master_port=5555
    WorkerNode->>PyTorch: Connect to master using master_addr:master_port
    
    PyTorch->>MasterNode: Coordinate training processes
    PyTorch->>WorkerNode: Coordinate training processes
    
    Note over MasterNode,WorkerNode: Synchronize gradients across nodes
    Note over MasterNode,WorkerNode: No rendezvous backend needed (simpler setup)
Loading

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

.. code-block:: shell
python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=0 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=localhost:5555 scripts/reinforcement_learning/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed
python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=<ip_of_master> --master_port=5555 scripts/reinforcement_learning/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: Extra space before scripts/reinforcement_learning

Suggested change
python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=<ip_of_master> --master_port=5555 scripts/reinforcement_learning/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed
python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=<ip_of_master> --master_port=5555 scripts/reinforcement_learning/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant