Skip to content

Bump Megatron-LM to miles-main-20260622 (latest NVIDIA dev)#1466

Open
yueming-yuan wants to merge 5 commits into
mainfrom
megatron-bump-20260622
Open

Bump Megatron-LM to miles-main-20260622 (latest NVIDIA dev)#1466
yueming-yuan wants to merge 5 commits into
mainfrom
megatron-bump-20260622

Conversation

@yueming-yuan

Copy link
Copy Markdown
Collaborator

Rebases miles' custom Megatron-LM changes onto the latest NVIDIA/Megatron dev.

Megatron side (radixark/Megatron-LM:miles-main-20260622, off nvidia/dev)

  • 13 miles-specific PRs re-applied onto new dev (skipping parts already upstreamed).
  • DeepSeek-V4 (dsv4): re-applied PR Tiny add Sample.group_index #28's mHC 4-stream-residual slice (config + transformer_block/layer + p2p/schedules + mappings + Float16Module fp32 snapshot), reconciled with dev's evolved layer structure and its native dsv4_hybrid. Native infra (hash routing, sqrtsoftplus, dsa, input_ids) reused.

Miles side (this PR)

  • docker/Dockerfile: MEGATRON_BRANCH -> miles-main-20260622.
  • dsv4 plugin/script: dev-native config names (csa_*, o_*, moe_n_hash_layers); kept dsv4_hc_* (mHC precision-branched).
  • dev API drift fixes: tokenizer import move, enable_->use_gloo_process_groups, DeepSeekV4Attention(name=).

Validation

DSv4 4-layer prune via run-megatron vs old-megatron baseline: mHC + attention numerically exact (attn_output / 4-stream residual rel-L2 = 0); logprob mean within 8e-4 (residual diffs confined to dev-native MoE nondeterminism).

Megatron dev reimplemented PR #6's MTP-in-RL support natively:
- process_mtp_loss derives MTP labels from input_ids when labels is None (RL).
- config.mtp_detach_heads detaches output head + MTP embedding gradients.
So on the miles side: set config.mtp_detach_heads=True when enable_mtp_training,
and stop passing the now-unsupported mtp_kwargs to GPTModel.forward (labels=None
+ input_ids derivation is equivalent to mtp_labels=batch['tokens']).
Run miles' custom dsv4 attention on the bumped Megatron (miles-main-20260622):
- rename plugin/script config to dev-native names: csa_window_size, csa_compress_ratios,
  csa_compress_rotary_base, o_groups, o_lora_rank, moe_n_hash_layers (mbridge). Keep dsv4_hc_*
  (mHC precision-branched). Script uses --csa-compress-ratios "[..]" string form; drop
  --no-activation-func-clamp-shared-expert.
- dev API drift: tokenizer _vocab_size_with_padding moved to megatron.core.tokenizers;
  enable_gloo_process_groups -> use_gloo_process_groups; DeepSeekV4Attention accepts name=.
- run_megatron worker: build forward-only model without DDP (wrap_with_ddp=run_backward).

Validated against the old-megatron baseline (run-megatron, 4-layer prune): mHC+attention
numerically exact, logprob mean within 8e-4.
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

… on new Megatron

convert_hf_to_torch_dist auto-forces PP=world_size for >1 GPU; bumped Megatron asserts
hash-MoE layers + PP>1 need an explicit pipeline_model_parallel_layout. The 4-layer prune
converts fine at PP1 on 1 GPU (validated). Full Flash/Pro use explicit multi-PP convert
configs (their PP>1 hash-MoE convert layout is a separate follow-up).
… lora test mock

Matches the dev Megatron arg rename used in model.py (use_gloo_process_groups).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant