Skip to content

feat: Multi-GPU Data Parallel (DDP) for training#117

Open
aniruddhaadak80 wants to merge 1 commit intokarpathy:masterfrom
aniruddhaadak80:feat/multi-gpu-ddp
Open

feat: Multi-GPU Data Parallel (DDP) for training#117
aniruddhaadak80 wants to merge 1 commit intokarpathy:masterfrom
aniruddhaadak80:feat/multi-gpu-ddp

Conversation

@aniruddhaadak80
Copy link

@aniruddhaadak80 aniruddhaadak80 commented Mar 10, 2026

Summary

Introduces torch.distributed (DDP) setup to natively run the training script on multi-GPU setups using torchrun.

Key Capabilities

  • Effortless Multi-Device Sync: Adapts the existing single-file train.py architecture into a data-parallel approach.
  • Dynamic Adjustments: Automatically adjusts batch accumulations and assigns device=cuda:local_rank to map perfectly to underlying hardware topology.
  • DDP Micro-Step Optimization: Limits require_backward_grad_sync strictly to the terminal micro-steps, massively increasing throughput when using gradient accumulation to not stall compute networking.
  • Rank 0 Isolation: Suppresses output logs on non-master devices (builtins.print disabled) to prevent terminal flooding.

Runs perfectly out of the box with zero modifications required inside prepare.py.

Usage

torchrun --nproc_per_node=4 train.py

Copilot AI review requested due to automatic review settings March 10, 2026 07:45
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds DistributedDataParallel (DDP) support to train.py, enabling multi-GPU training via torchrun. It sets up the distributed process group, adjusts random seeds per rank, scales gradient accumulation for the world size, wraps the model with DDP, implements a dataloader sharding wrapper, suppresses prints on non-master ranks, and disables gradient synchronization on non-final micro-steps for efficiency.

Changes:

  • Add DDP initialization/teardown with NCCL backend, per-rank device assignment, and print suppression on non-master processes
  • Scale gradient accumulation steps and total batch size to account for ddp_world_size, and use require_backward_grad_sync to skip allreduce on intermediate micro-steps
  • Implement a generator-based dataloader wrapper to shard training data across DDP ranks by interleaving

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +483 to +484
torch.manual_seed(42 + ddp_rank)
torch.cuda.manual_seed(42 + ddp_rank)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical bug: Setting per-rank seeds (42 + ddp_rank) before model initialization (line 509: model.init_weights()) means each DDP rank will initialize the model with different random weights. DDP requires all ranks to start with identical parameters — it only synchronizes gradients, not initial weights. This will cause training to diverge immediately.

The fix is to use the same seed across all ranks for model initialization, and only diverge seeds afterward (before the dataloader) to ensure each rank sees different data. For example, set torch.manual_seed(42) before init_weights(), then set torch.manual_seed(42 + ddp_rank) after model initialization and before the dataloader setup.

Suggested change
torch.manual_seed(42 + ddp_rank)
torch.cuda.manual_seed(42 + ddp_rank)
torch.manual_seed(42)
torch.cuda.manual_seed(42)

Copilot uses AI. Check for mistakes.

import gc
import time
import contextlib
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import: contextlib is imported but never used anywhere in the file. It should be removed to keep the imports clean.

Suggested change
import contextlib

Copilot uses AI. Check for mistakes.
Comment on lines +538 to +544
def ddp_dataloader_wrapper(loader, rank, world_size):
for _ in range(rank):
next(loader)
while True:
yield next(loader)
for _ in range(world_size - 1):
next(loader)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dataloader wrapper makes every rank produce all batches from the underlying generator (including the expensive best-fit packing work), then discard (world_size - 1) / world_size of them. For example, with 8 GPUs, each rank does 8× the CPU data-loading work but only uses 1/8 of it. This effectively scales CPU cost by world_size and can become a bottleneck on large clusters.

Consider passing the rank/world_size into make_dataloader (or its underlying _document_batches) to shard the parquet files or row groups at the source, so each rank only loads its own slice of the data.

Copilot uses AI. Check for mistakes.
@aniruddhaadak80
Copy link
Author

@copilot open a new pull request to apply changes based on the comments in this thread

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants