Skip to content

action dataloader: episode-shuffle stream (fix DROID grad-norm instability)#37

Merged
mli0603 merged 5 commits into
NVIDIA:mainfrom
fwd4:droid-action-shuffle
Jun 22, 2026
Merged

action dataloader: episode-shuffle stream (fix DROID grad-norm instability)#37
mli0603 merged 5 commits into
NVIDIA:mainfrom
fwd4:droid-action-shuffle

Conversation

@fwd4

@fwd4 fwd4 commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Problem

The DROID action SFT dataloader trained with an unstable, slow-settling grad-norm (and a noisy action-loss plateau) vs the internal reference. Root cause: the DROID action dataset is map-style and — unlike the iterable vision SFTDataset, which self-shuffles — does not shuffle, and RankPartitionedDataLoader wraps it in a DataLoader with no shuffle, i.e. a SequentialSampler. Every rank then iterates the same consecutive, overlapping windows, so the all-reduced global batch is effectively ~1 episode → high gradient variance.

(Forward + gradients were verified numerically equivalent to the internal model on identical input, so this was a data-path issue, not the model/loss/optimizer.)

Fix

ActionIterableShuffleDataset (iterable_shuffle=True): an IterableDataset view of the map-style dataset that streams rank × worker-sharded, episode-order-shuffled, sequential-within-episode — decorrelated batches with sequential reads (preserves I/O locality + copy-on-write; a plain shuffle=True/RandomSampler instead does random-access I/O → ~11 min/iter and OOM from broken COW). Mirrors the internal iterable dataset's per-worker episode assignment.

  • Adds DROIDLeRobotDataset.get_shuffle_blocks() (per-episode/segment flat-index blocks the iterable streams).
  • No DataLoader/sampler change needed — IterableDataset is handled natively (sampler=None).

Validation (8192 global batch)

iter this fix internal ref no-shuffle
100 grad-norm 2.9 4.7 21
450 grad-norm 1.7 1.9

Per-component action loss converges to ~0.0055 (matches internal ~0.005; the no-shuffle run plateaued noisily at 0.03–0.07). Builds on #24 (recipe + FusedAdam optimizer).

🤖 Generated with Claude Code


Added commits (recipe correctness)

  • mode="policy" defaultDROIDLeRobotDataset defaulted to mode="joint" (random forward_dynamics/inverse_dynamics/policy per sample), so the policy recipe was silently training multi-task. inverse_dynamics zeros the vision loss and forward_dynamics zeros the action loss, diluting each per-task loss by ~1/3 vs the policy-only internal run. Now defaults to policy (matching i4's DROIDLeRobotDataset); mode is also threaded through get_action_droid_sft_dataset.
  • max_num_tokens_after_packing=-1 — uncaps the packed-sequence length (NANO default 45056) to match the internal droid_lerobot_8b run, so the full vision sequence is processed per step. Does not change the per-token loss; widens the effective vision context per step.

@fwd4 fwd4 force-pushed the droid-action-shuffle branch from f786168 to 8eec346 Compare June 12, 2026 03:25
@fwd4 fwd4 requested review from lfengad, mli0603 and ychao-nvidia June 12, 2026 03:38
@mli0603

mli0603 commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

LGTM

@mli0603 mli0603 enabled auto-merge (squash) June 12, 2026 05:06

@lfengad lfengad left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM

Comment thread cosmos_framework/data/vfm/action/datasets/action_sft_dataset.py
lfengad
lfengad previously approved these changes Jun 12, 2026
…ebased on main)

Rebased onto current main. main NVIDIA#34 upstreamed the DROID dataset (joint_pos,
use_state, keep-ranges filter, action_space) so droid_lerobot_dataset.py now
carries only the get_shuffle_blocks helper grafted onto main's version; NVIDIA#29's
recipe change (dropped /cluster override) is incorporated.

Remaining contribution: action_policy_droid_nano recipe (mode=policy,
lr=2e-4 @ 8192 global, max_num_tokens_after_packing=-1, scrubbed comments),
the episode-shuffle stream (action_sft_dataset.py), the multi-node-capable
SFT launcher (NNODES/NODE_RANK/MASTER_ADDR passthrough + EXTRA_TAIL_OVERRIDES),
and the post-train doc.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Hao Liang <haolia@nvidia.com>
@fwd4 fwd4 force-pushed the droid-action-shuffle branch from ae3db20 to f34031a Compare June 17, 2026 04:40
…None

joint_pos uses raw (un-normalized) joint actions, so DROIDLeRobotDataset sets
action_normalization=None — but _build_result called normalize_action()
unconditionally, which raises 'Unknown normalization method: None'. Guard it so
None means raw actions (caught by a 2-node sanity run on the rebased branch).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Hao Liang <haolia@nvidia.com>
…reference

The bare recipe trained with the NANO default loss_scale=1.0, weighting the vision
flow-matching loss 10x lower than the Cosmos3-Nano-Policy-DROID reference (which uses
10.0). Set it post-construction so the recipe reproduces without launcher overrides.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Hao Liang <haolia@nvidia.com>
Comment thread examples/launch_sft_action_policy_droid.sh Outdated
Comment thread examples/launch_sft_action_policy_droid.sh Outdated
Comment thread docs/action_policy_droid_posttrain.md Outdated
Comment thread docs/action_policy_droid_posttrain.md Outdated
Comment thread docs/action_policy_droid_posttrain.md Outdated
Comment thread docs/action_policy_droid_posttrain.md Outdated
Comment thread docs/action_policy_droid_posttrain.md Outdated
Comment thread docs/action_policy_droid_posttrain.md Outdated
fwd4 added a commit to fwd4/cosmos-framework that referenced this pull request Jun 20, 2026
…ks, structure)

Apply ychao-nvidia's review suggestions: rename DROID->Cosmos3-DROID (nvidia/Cosmos3-DROID
HF dataset) + Cosmos3-Nano is a 16B MoT, add HF links; restructure doc (## Prerequisites,
drop the obsolete 'Dataset: to be released' section + the redundant shuffle/filter blockquote);
recipe table (init=Cosmos3-Nano, keep_ranges_1_0_1.json blob link, global batch 64/rank x 128
ranks); convert-cmd arg order; EXTRA_TAIL_OVERRIDES formatting + TOML link; launcher DATASET_PATH
desc + dataset-check message -> Cosmos3-DROID; recipe filter comment -> keep_ranges_1_0_1.json.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Hao Liang <haolia@nvidia.com>
fwd4 added a commit to fwd4/cosmos-framework that referenced this pull request Jun 21, 2026
- Cosmos3-DROID dataset naming + HF links; doc restructured
  (Prerequisites / Inputs You Provide / Recipe / Full Reproduction / Checkpoints).
- Launcher: DATASET_PATH + EXTRA_DATASET_CHECK for the Cosmos3-DROID success dir.
- State multi-node GB200 validation at 8192 global batch (drop H200-specific notes).
- Describe max_samples_per_batch precisely: samples packed into each per-rank batch
  (num_workers x prefetch_factor workers decode in parallel to feed it); spell out
  global batch = max_samples_per_batch x world size x grad_accum_iter.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Hao Liang <haolia@nvidia.com>
@fwd4 fwd4 force-pushed the droid-action-shuffle branch from 86be70e to d5ab046 Compare June 21, 2026 08:16
- Cosmos3-DROID dataset naming + HF links; doc restructured
  (Prerequisites / Inputs You Provide / Recipe / Full Reproduction / Checkpoints).
- Launcher: DATASET_PATH + EXTRA_DATASET_CHECK for the Cosmos3-DROID success dir.
- State multi-node GB200 validation at 8192 global batch (drop H200-specific notes).
- Describe max_samples_per_batch precisely: samples packed into each per-rank batch
  (num_workers x prefetch_factor workers decode in parallel to feed it); spell out
  global batch = max_samples_per_batch x world size x grad_accum_iter.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Hao Liang <haolia@nvidia.com>
@fwd4 fwd4 force-pushed the droid-action-shuffle branch from d5ab046 to 3f47229 Compare June 21, 2026 12:51

@ychao-nvidia ychao-nvidia left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mli0603 mli0603 merged commit 300faa1 into NVIDIA:main Jun 22, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants