feat: Multi-GPU Data Parallel (DDP) for training by aniruddhaadak80 · Pull Request #117 · karpathy/autoresearch

aniruddhaadak80 · 2026-03-10T07:45:31Z

Summary

Introduces torch.distributed (DDP) setup to natively run the training script on multi-GPU setups using torchrun.

Key Capabilities

Effortless Multi-Device Sync: Adapts the existing single-file train.py architecture into a data-parallel approach.
Dynamic Adjustments: Automatically adjusts batch accumulations and assigns device=cuda:local_rank to map perfectly to underlying hardware topology.
DDP Micro-Step Optimization: Limits require_backward_grad_sync strictly to the terminal micro-steps, massively increasing throughput when using gradient accumulation to not stall compute networking.
Rank 0 Isolation: Suppresses output logs on non-master devices (builtins.print disabled) to prevent terminal flooding.

Runs perfectly out of the box with zero modifications required inside prepare.py.

Usage

torchrun --nproc_per_node=4 train.py

Copilot

Pull request overview

This PR adds DistributedDataParallel (DDP) support to train.py, enabling multi-GPU training via torchrun. It sets up the distributed process group, adjusts random seeds per rank, scales gradient accumulation for the world size, wraps the model with DDP, implements a dataloader sharding wrapper, suppresses prints on non-master ranks, and disables gradient synchronization on non-final micro-steps for efficiency.

Changes:

Add DDP initialization/teardown with NCCL backend, per-rank device assignment, and print suppression on non-master processes
Scale gradient accumulation steps and total batch size to account for ddp_world_size, and use require_backward_grad_sync to skip allreduce on intermediate micro-steps
Implement a generator-based dataloader wrapper to shard training data across DDP ranks by interleaving

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-10T07:50:04Z

train.py

+torch.manual_seed(42 + ddp_rank)
+torch.cuda.manual_seed(42 + ddp_rank)


Critical bug: Setting per-rank seeds (42 + ddp_rank) before model initialization (line 509: model.init_weights()) means each DDP rank will initialize the model with different random weights. DDP requires all ranks to start with identical parameters — it only synchronizes gradients, not initial weights. This will cause training to diverge immediately.

The fix is to use the same seed across all ranks for model initialization, and only diverge seeds afterward (before the dataloader) to ensure each rank sees different data. For example, set torch.manual_seed(42) before init_weights(), then set torch.manual_seed(42 + ddp_rank) after model initialization and before the dataloader setup.

Suggested change

torch.manual_seed(42 + ddp_rank)

torch.cuda.manual_seed(42 + ddp_rank)

torch.manual_seed(42)

torch.cuda.manual_seed(42)

Copilot · 2026-03-10T07:50:05Z

train.py


 import gc
 import time
+import contextlib


Unused import: contextlib is imported but never used anywhere in the file. It should be removed to keep the imports clean.

Suggested change

import contextlib

Copilot · 2026-03-10T07:50:05Z

train.py

+def ddp_dataloader_wrapper(loader, rank, world_size):
+    for _ in range(rank):
+        next(loader)
+    while True:
+        yield next(loader)
+        for _ in range(world_size - 1):
+            next(loader)


The dataloader wrapper makes every rank produce all batches from the underlying generator (including the expensive best-fit packing work), then discard (world_size - 1) / world_size of them. For example, with 8 GPUs, each rank does 8× the CPU data-loading work but only uses 1/8 of it. This effectively scales CPU cost by world_size and can become a bottleneck on large clusters.

Consider passing the rank/world_size into make_dataloader (or its underlying _document_batches) to shard the parquet files or row groups at the source, so each rank only loads its own slice of the data.

aniruddhaadak80 · 2026-03-10T08:36:51Z

@copilot open a new pull request to apply changes based on the comments in this thread

feat: implement multi-gpu distributed data parallel (DDP)

1165d9a

Copilot AI review requested due to automatic review settings March 10, 2026 07:45

Copilot started reviewing on behalf of aniruddhaadak80 March 10, 2026 07:45 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Multi-GPU Data Parallel (DDP) for training#117

feat: Multi-GPU Data Parallel (DDP) for training#117
aniruddhaadak80 wants to merge 1 commit intokarpathy:masterfrom
aniruddhaadak80:feat/multi-gpu-ddp

aniruddhaadak80 commented Mar 10, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

aniruddhaadak80 commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		torch.manual_seed(42 + ddp_rank)
		torch.cuda.manual_seed(42 + ddp_rank)

Conversation

aniruddhaadak80 commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Capabilities

Usage

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

aniruddhaadak80 commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aniruddhaadak80 commented Mar 10, 2026 •

edited

Loading