Bump Megatron-LM to miles-main-20260622 (latest NVIDIA dev)#1466
Open
yueming-yuan wants to merge 5 commits into
Open
Bump Megatron-LM to miles-main-20260622 (latest NVIDIA dev)#1466yueming-yuan wants to merge 5 commits into
yueming-yuan wants to merge 5 commits into
Conversation
Megatron dev reimplemented PR #6's MTP-in-RL support natively: - process_mtp_loss derives MTP labels from input_ids when labels is None (RL). - config.mtp_detach_heads detaches output head + MTP embedding gradients. So on the miles side: set config.mtp_detach_heads=True when enable_mtp_training, and stop passing the now-unsupported mtp_kwargs to GPTModel.forward (labels=None + input_ids derivation is equivalent to mtp_labels=batch['tokens']).
Run miles' custom dsv4 attention on the bumped Megatron (miles-main-20260622): - rename plugin/script config to dev-native names: csa_window_size, csa_compress_ratios, csa_compress_rotary_base, o_groups, o_lora_rank, moe_n_hash_layers (mbridge). Keep dsv4_hc_* (mHC precision-branched). Script uses --csa-compress-ratios "[..]" string form; drop --no-activation-func-clamp-shared-expert. - dev API drift: tokenizer _vocab_size_with_padding moved to megatron.core.tokenizers; enable_gloo_process_groups -> use_gloo_process_groups; DeepSeekV4Attention accepts name=. - run_megatron worker: build forward-only model without DDP (wrap_with_ddp=run_backward). Validated against the old-megatron baseline (run-megatron, 4-layer prune): mHC+attention numerically exact, logprob mean within 8e-4.
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
… on new Megatron convert_hf_to_torch_dist auto-forces PP=world_size for >1 GPU; bumped Megatron asserts hash-MoE layers + PP>1 need an explicit pipeline_model_parallel_layout. The 4-layer prune converts fine at PP1 on 1 GPU (validated). Full Flash/Pro use explicit multi-PP convert configs (their PP>1 hash-MoE convert layout is a separate follow-up).
… lora test mock Matches the dev Megatron arg rename used in model.py (use_gloo_process_groups).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rebases miles' custom Megatron-LM changes onto the latest NVIDIA/Megatron
dev.Megatron side (
radixark/Megatron-LM:miles-main-20260622, offnvidia/dev)dsv4): re-applied PR Tiny add Sample.group_index #28's mHC 4-stream-residual slice (config + transformer_block/layer + p2p/schedules + mappings + Float16Module fp32 snapshot), reconciled with dev's evolved layer structure and its nativedsv4_hybrid. Native infra (hash routing, sqrtsoftplus, dsa, input_ids) reused.Miles side (this PR)
docker/Dockerfile:MEGATRON_BRANCH->miles-main-20260622.csa_*,o_*,moe_n_hash_layers); keptdsv4_hc_*(mHC precision-branched).enable_->use_gloo_process_groups,DeepSeekV4Attention(name=).Validation
DSv4 4-layer prune via run-megatron vs old-megatron baseline: mHC + attention numerically exact (attn_output / 4-stream residual rel-L2 = 0); logprob mean within 8e-4 (residual diffs confined to dev-native MoE nondeterminism).