[MiniMax M3] Enable and Optimize the MiniMax M3 Eagle#1333
Merged
Conversation
Collaborator
|
Can you put the test results on the accuracy and performance into the comment? Are them the same in the recipe? |
f7d612b to
9fc4833
Compare
Comment on lines
+197
to
+199
| model_path=amd/MiniMax-M3-MXFP4 | ||
| model_path=MiniMaxAI/MiniMax-M3-MXFP8 | ||
| BS=65 |
Comment on lines
+665
to
667
| if aux_hidden_states: | ||
| return hidden_states, aux_hidden_states | ||
| return hidden_states |
Comment on lines
+242
to
+251
| logits = tgemm.mm(x, self.weight, self.bias) # [N, vocab/tp] | ||
| if self.tp_size <= 1: | ||
| return logits.argmax(dim=-1) | ||
| # Pack (val, idx) as fp32 — idx < 2^24 is exact — and all-gather only the | ||
| # per-rank reductions ([N, 2]) instead of the full logits. | ||
| packed = lm_head_argmax_pack(logits, self.vocab_start_idx) | ||
| gathered = get_tp_group().all_gather(packed, dim=0).view(self.tp_size, -1, 2) | ||
| winner = gathered[:, :, 0].argmax(dim=0) # [N] winning rank (ties -> lowest) | ||
| token = gathered[:, :, 1].gather(0, winner.unsqueeze(0)).squeeze(0) # [N] fp32 | ||
| return token.to(torch.long) |
Bring the model-agnostic / draft-side MiniMax-M3 EAGLE3 work from wuhuikx/atom-m3-bf16-to-main (2f1c385). These files' pre-eagle base is byte-identical to current main, so they port as-is: - eagle3_llama.py / eagle3_deepseek_mla.py: draft fusions (fused dual-RMSNorm +concat, fused group-RMSNorm aux, AR+RMSNorm fusion), compute_draft_token, replicated-embed option. - fused_aux_rmsnorm.py (new): the fused RMSNorm kernels for the draft. - lm_head_argmax.py (new) + embed_head.py: distributed greedy argmax (all-gather [N,2] per-rank maxima instead of full [N,vocab] logits). - spec_decode/eagle.py: draft loop with distributed-argmax fast path, no-pre-concat aux, and Eagle3 MHA draft KV-cache transfer for PD disaggregation (from #1331). - envs.py: ATOM_EAGLE_REPLICATE_EMBED. - tests/test_lm_head_argmax.py (new, importorskip(aiter) for the no-aiter CI). Target-side enablement (aux-hidden capture in minimax_m3, q>1 spec-verify metadata, prepare_mtp_decode) follows in Phase 2; note eagle.py now references attn_metadata_builder.prepare_mtp_decode which Phase 2 adds. Mocked suite: 437 passed / 38 pre-existing failures / +1 new skip — no regression. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Enable MiniMax-M3 EAGLE3 on current main's (Triton-sparse) M3 base, adapting the target side to main's API instead of wuhuikx's asm/gluon infra (absent on main). aiter_attention.py: - Add the generic block-paged MHA Eagle3 draft metadata: _mtp_prepare_decode_ metadata_kernel + prepare_mtp_decode + fuse_mtp_decode_position_update (used by the migrated eagle.py for both Kimi and M3 drafts; not M3-sparse coupled). - Replace the two "speculative decode not supported" NotImplementedError sites: route q>1 spec-verify through the sparse PREFILL path (make_sparse_prefill_ metadata; per-query causal via cu_seqlens_q, which is now filled uniformly for q>1). prefix_lens is bound to a new persistent sparse_prefix_lens buffer so the CUDAGraph-captured sparse indexer reads live causal lengths on each replay. minimax_m3.py: Eagle3 aux-hidden-state capture (Dynamo-safe, mirrors deepseek_v2): aux_hidden_state_layers, in-layer residual.clone() after the fused-allreduce norm, model forward returns (hidden, aux) tuple, set/get_eagle3_aux_hidden_state_layers on the ForCausalLM + VL-wrapper delegation. model_runner.py: extend KV transfer regions with the Eagle3 draft pool for PD disaggregation (#1331). scheduler.py: trim emitted spec tokens past the stop position (rejection sampler emits past EOS) so flexible-extract doesn't pick up leaked trailing tokens. recipes/MiniMax-M3.md: full EAGLE3 section (with a note that the ASM-PA/fp8/MXFP8 specifics reflect the fully-optimized variant, not this Triton-sparse base). Drop tests/test_lm_head_argmax.py (per request). Note: the q>1 sparse-verify path is new on main and CUDAGraph-sensitive — needs GPU validation (GSM8K + accept on TP4/TP8; confirm Kimi eagle unaffected). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
make lint happy Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
main's MiniMaxM3Attention (dense layers) does not set force_triton_attn in code and attention_mha has no block-128 guard, so on this base the dense attention is routed to Triton only via ATOM_FORCE_ATTN_TRITON=1 (the MXFP4 base section already sets it). The EAGLE section migrated from wuhuikx omitted it (wuhuikx set force_triton_attn=True in code instead), so the spec-verify dense attention (q=num_spec+1) fell into paged_attention_asm and aborted in get_heuristic_kernel (no bf16 block-128 ASM-PA kernel). Add the env to the EAGLE launch and drop the stale MXFP8 model_path line. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
b481f3d to
3ee0b21
Compare
Comment on lines
+197
to
+198
| model_path=amd/MiniMax-M3-MXFP4 | ||
| model_path=MiniMaxAI/MiniMax-M3-MXFP8 |
| @@ -672,9 +684,11 @@ def forward( | |||
| hidden_states = intermediate_tensors["hidden_states"] | |||
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
valarLip
approved these changes
Jun 25, 2026
yhl-amd
approved these changes
Jun 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
FP4 M3, 8k/1k, eagle model https://huggingface.co/Inferact/MiniMax-M3-EAGLE3