action dataloader: episode-shuffle stream (fix DROID grad-norm instability)#37
Merged
Conversation
f786168 to
8eec346
Compare
Collaborator
|
LGTM |
lfengad
previously approved these changes
Jun 12, 2026
…ebased on main) Rebased onto current main. main NVIDIA#34 upstreamed the DROID dataset (joint_pos, use_state, keep-ranges filter, action_space) so droid_lerobot_dataset.py now carries only the get_shuffle_blocks helper grafted onto main's version; NVIDIA#29's recipe change (dropped /cluster override) is incorporated. Remaining contribution: action_policy_droid_nano recipe (mode=policy, lr=2e-4 @ 8192 global, max_num_tokens_after_packing=-1, scrubbed comments), the episode-shuffle stream (action_sft_dataset.py), the multi-node-capable SFT launcher (NNODES/NODE_RANK/MASTER_ADDR passthrough + EXTRA_TAIL_OVERRIDES), and the post-train doc. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Hao Liang <haolia@nvidia.com>
ae3db20 to
f34031a
Compare
…None joint_pos uses raw (un-normalized) joint actions, so DROIDLeRobotDataset sets action_normalization=None — but _build_result called normalize_action() unconditionally, which raises 'Unknown normalization method: None'. Guard it so None means raw actions (caught by a 2-node sanity run on the rebased branch). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Hao Liang <haolia@nvidia.com>
…reference The bare recipe trained with the NANO default loss_scale=1.0, weighting the vision flow-matching loss 10x lower than the Cosmos3-Nano-Policy-DROID reference (which uses 10.0). Set it post-construction so the recipe reproduces without launcher overrides. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Hao Liang <haolia@nvidia.com>
This was referenced Jun 17, 2026
fwd4
added a commit
to fwd4/cosmos-framework
that referenced
this pull request
Jun 20, 2026
…ks, structure) Apply ychao-nvidia's review suggestions: rename DROID->Cosmos3-DROID (nvidia/Cosmos3-DROID HF dataset) + Cosmos3-Nano is a 16B MoT, add HF links; restructure doc (## Prerequisites, drop the obsolete 'Dataset: to be released' section + the redundant shuffle/filter blockquote); recipe table (init=Cosmos3-Nano, keep_ranges_1_0_1.json blob link, global batch 64/rank x 128 ranks); convert-cmd arg order; EXTRA_TAIL_OVERRIDES formatting + TOML link; launcher DATASET_PATH desc + dataset-check message -> Cosmos3-DROID; recipe filter comment -> keep_ranges_1_0_1.json. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Hao Liang <haolia@nvidia.com>
fwd4
added a commit
to fwd4/cosmos-framework
that referenced
this pull request
Jun 21, 2026
- Cosmos3-DROID dataset naming + HF links; doc restructured (Prerequisites / Inputs You Provide / Recipe / Full Reproduction / Checkpoints). - Launcher: DATASET_PATH + EXTRA_DATASET_CHECK for the Cosmos3-DROID success dir. - State multi-node GB200 validation at 8192 global batch (drop H200-specific notes). - Describe max_samples_per_batch precisely: samples packed into each per-rank batch (num_workers x prefetch_factor workers decode in parallel to feed it); spell out global batch = max_samples_per_batch x world size x grad_accum_iter. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Hao Liang <haolia@nvidia.com>
86be70e to
d5ab046
Compare
- Cosmos3-DROID dataset naming + HF links; doc restructured (Prerequisites / Inputs You Provide / Recipe / Full Reproduction / Checkpoints). - Launcher: DATASET_PATH + EXTRA_DATASET_CHECK for the Cosmos3-DROID success dir. - State multi-node GB200 validation at 8192 global batch (drop H200-specific notes). - Describe max_samples_per_batch precisely: samples packed into each per-rank batch (num_workers x prefetch_factor workers decode in parallel to feed it); spell out global batch = max_samples_per_batch x world size x grad_accum_iter. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Hao Liang <haolia@nvidia.com>
d5ab046 to
3f47229
Compare
lfengad
approved these changes
Jun 22, 2026
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The DROID action SFT dataloader trained with an unstable, slow-settling grad-norm (and a noisy action-loss plateau) vs the internal reference. Root cause: the DROID action dataset is map-style and — unlike the iterable vision
SFTDataset, which self-shuffles — does not shuffle, andRankPartitionedDataLoaderwraps it in aDataLoaderwith noshuffle, i.e. aSequentialSampler. Every rank then iterates the same consecutive, overlapping windows, so the all-reduced global batch is effectively ~1 episode → high gradient variance.(Forward + gradients were verified numerically equivalent to the internal model on identical input, so this was a data-path issue, not the model/loss/optimizer.)
Fix
ActionIterableShuffleDataset(iterable_shuffle=True): anIterableDatasetview of the map-style dataset that streams rank × worker-sharded, episode-order-shuffled, sequential-within-episode — decorrelated batches with sequential reads (preserves I/O locality + copy-on-write; a plainshuffle=True/RandomSamplerinstead does random-access I/O → ~11 min/iter and OOM from broken COW). Mirrors the internal iterable dataset's per-worker episode assignment.DROIDLeRobotDataset.get_shuffle_blocks()(per-episode/segment flat-index blocks the iterable streams).DataLoader/sampler change needed —IterableDatasetis handled natively (sampler=None).Validation (8192 global batch)
Per-component action loss converges to ~0.0055 (matches internal ~0.005; the no-shuffle run plateaued noisily at 0.03–0.07). Builds on #24 (recipe + FusedAdam optimizer).
🤖 Generated with Claude Code
Added commits (recipe correctness)
mode="policy"default —DROIDLeRobotDatasetdefaulted tomode="joint"(random forward_dynamics/inverse_dynamics/policy per sample), so the policy recipe was silently training multi-task.inverse_dynamicszeros the vision loss andforward_dynamicszeros the action loss, diluting each per-task loss by ~1/3 vs the policy-only internal run. Now defaults topolicy(matching i4'sDROIDLeRobotDataset);modeis also threaded throughget_action_droid_sft_dataset.max_num_tokens_after_packing=-1— uncaps the packed-sequence length (NANO default 45056) to match the internaldroid_lerobot_8brun, so the full vision sequence is processed per step. Does not change the per-token loss; widens the effective vision context per step.