Skip to content

[MiniMax M3] Enable and Optimize the MiniMax M3 Eagle#1333

Merged
valarLip merged 10 commits into
mainfrom
zejun/enable_and_opt_minimax_m3_eagle_0623
Jun 25, 2026
Merged

[MiniMax M3] Enable and Optimize the MiniMax M3 Eagle#1333
valarLip merged 10 commits into
mainfrom
zejun/enable_and_opt_minimax_m3_eagle_0623

Conversation

@zejunchen-zejun

@zejunchen-zejun zejunchen-zejun commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator
  1. enable m3 eagle functionality
  2. enable PD disagg for m3 eagle
  3. cut EOS in sequence before return to frontend user
  4. fusion kernel for prepare_mtp_decode in eagle
  5. fusion kernel for token local argmax in eagle
  6. replicated vocab emb to reduce the communication in eagle
  7. fuse triple rmsnorm for aux hidden in eagle
  8. fuse allreduce and rmsnorm in llama eagle
Model Mode flexible_extract strict_match Acceptance rate Status
MiniMax-M3-MXFP4 EAGLE 0.9462 0.9469 73.56% (90483/123000, avg_toks_fwd=3.21) PASS
MiniMax-M3-MXFP4 non-EAGLE 0.9462 0.9469 N/A PASS

FP4 M3, 8k/1k, eagle model https://huggingface.co/Inferact/MiniMax-M3-EAGLE3

Concurrency Non-Eagle Total tok/s Eagle Total tok/s Eagle Uplift
4 4,688.40 7,653.56 +63.24%
8 7,795.90 11,146.00 +42.97%
16 11,890.78 16,928.43 +42.37%
32 17,195.82 21,132.49 +22.89%
64 23,466.78 26,857.80 +14.45%
image

@wuhuikx wuhuikx marked this pull request as ready for review June 24, 2026 05:33
@wuhuikx

wuhuikx commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Can you put the test results on the accuracy and performance into the comment? Are them the same in the recipe?

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

@zejunchen-zejun zejunchen-zejun marked this pull request as draft June 24, 2026 09:26
@zejunchen-zejun zejunchen-zejun force-pushed the zejun/enable_and_opt_minimax_m3_eagle_0623 branch 2 times, most recently from f7d612b to 9fc4833 Compare June 25, 2026 03:25
@zejunchen-zejun zejunchen-zejun marked this pull request as ready for review June 25, 2026 03:46
Copilot AI review requested due to automatic review settings June 25, 2026 03:46

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Comment thread recipes/MiniMax-M3.md
Comment on lines +197 to +199
model_path=amd/MiniMax-M3-MXFP4
model_path=MiniMaxAI/MiniMax-M3-MXFP8
BS=65
Comment thread atom/models/minimax_m3.py
Comment on lines +665 to 667
if aux_hidden_states:
return hidden_states, aux_hidden_states
return hidden_states
Comment on lines +242 to +251
logits = tgemm.mm(x, self.weight, self.bias) # [N, vocab/tp]
if self.tp_size <= 1:
return logits.argmax(dim=-1)
# Pack (val, idx) as fp32 — idx < 2^24 is exact — and all-gather only the
# per-rank reductions ([N, 2]) instead of the full logits.
packed = lm_head_argmax_pack(logits, self.vocab_start_idx)
gathered = get_tp_group().all_gather(packed, dim=0).view(self.tp_size, -1, 2)
winner = gathered[:, :, 0].argmax(dim=0) # [N] winning rank (ties -> lowest)
token = gathered[:, :, 1].gather(0, winner.unsqueeze(0)).squeeze(0) # [N] fp32
return token.to(torch.long)
zejunchen-zejun and others added 9 commits June 25, 2026 21:46
Bring the model-agnostic / draft-side MiniMax-M3 EAGLE3 work from
wuhuikx/atom-m3-bf16-to-main (2f1c385). These files' pre-eagle base is
byte-identical to current main, so they port as-is:

- eagle3_llama.py / eagle3_deepseek_mla.py: draft fusions (fused dual-RMSNorm
  +concat, fused group-RMSNorm aux, AR+RMSNorm fusion), compute_draft_token,
  replicated-embed option.
- fused_aux_rmsnorm.py (new): the fused RMSNorm kernels for the draft.
- lm_head_argmax.py (new) + embed_head.py: distributed greedy argmax (all-gather
  [N,2] per-rank maxima instead of full [N,vocab] logits).
- spec_decode/eagle.py: draft loop with distributed-argmax fast path, no-pre-concat
  aux, and Eagle3 MHA draft KV-cache transfer for PD disaggregation (from #1331).
- envs.py: ATOM_EAGLE_REPLICATE_EMBED.
- tests/test_lm_head_argmax.py (new, importorskip(aiter) for the no-aiter CI).

Target-side enablement (aux-hidden capture in minimax_m3, q>1 spec-verify
metadata, prepare_mtp_decode) follows in Phase 2; note eagle.py now references
attn_metadata_builder.prepare_mtp_decode which Phase 2 adds.

Mocked suite: 437 passed / 38 pre-existing failures / +1 new skip — no regression.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Enable MiniMax-M3 EAGLE3 on current main's (Triton-sparse) M3 base, adapting the
target side to main's API instead of wuhuikx's asm/gluon infra (absent on main).

aiter_attention.py:
- Add the generic block-paged MHA Eagle3 draft metadata: _mtp_prepare_decode_
  metadata_kernel + prepare_mtp_decode + fuse_mtp_decode_position_update (used by
  the migrated eagle.py for both Kimi and M3 drafts; not M3-sparse coupled).
- Replace the two "speculative decode not supported" NotImplementedError sites:
  route q>1 spec-verify through the sparse PREFILL path (make_sparse_prefill_
  metadata; per-query causal via cu_seqlens_q, which is now filled uniformly for
  q>1). prefix_lens is bound to a new persistent sparse_prefix_lens buffer so the
  CUDAGraph-captured sparse indexer reads live causal lengths on each replay.

minimax_m3.py: Eagle3 aux-hidden-state capture (Dynamo-safe, mirrors deepseek_v2):
aux_hidden_state_layers, in-layer residual.clone() after the fused-allreduce norm,
model forward returns (hidden, aux) tuple, set/get_eagle3_aux_hidden_state_layers
on the ForCausalLM + VL-wrapper delegation.

model_runner.py: extend KV transfer regions with the Eagle3 draft pool for PD
disaggregation (#1331).

scheduler.py: trim emitted spec tokens past the stop position (rejection sampler
emits past EOS) so flexible-extract doesn't pick up leaked trailing tokens.

recipes/MiniMax-M3.md: full EAGLE3 section (with a note that the ASM-PA/fp8/MXFP8
specifics reflect the fully-optimized variant, not this Triton-sparse base).

Drop tests/test_lm_head_argmax.py (per request).

Note: the q>1 sparse-verify path is new on main and CUDAGraph-sensitive — needs
GPU validation (GSM8K + accept on TP4/TP8; confirm Kimi eagle unaffected).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
make lint happy

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
main's MiniMaxM3Attention (dense layers) does not set force_triton_attn in code
and attention_mha has no block-128 guard, so on this base the dense attention is
routed to Triton only via ATOM_FORCE_ATTN_TRITON=1 (the MXFP4 base section already
sets it). The EAGLE section migrated from wuhuikx omitted it (wuhuikx set
force_triton_attn=True in code instead), so the spec-verify dense attention
(q=num_spec+1) fell into paged_attention_asm and aborted in get_heuristic_kernel
(no bf16 block-128 ASM-PA kernel). Add the env to the EAGLE launch and drop the
stale MXFP8 model_path line.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
@zejunchen-zejun zejunchen-zejun force-pushed the zejun/enable_and_opt_minimax_m3_eagle_0623 branch from b481f3d to 3ee0b21 Compare June 25, 2026 13:50
Copilot AI review requested due to automatic review settings June 25, 2026 13:50

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Comment thread recipes/MiniMax-M3.md
Comment on lines +197 to +198
model_path=amd/MiniMax-M3-MXFP4
model_path=MiniMaxAI/MiniMax-M3-MXFP8
Comment thread atom/models/minimax_m3.py
@@ -672,9 +684,11 @@ def forward(
hidden_states = intermediate_tensors["hidden_states"]
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
@valarLip valarLip merged commit 6e565c5 into main Jun 25, 2026
20 of 31 checks passed
@valarLip valarLip deleted the zejun/enable_and_opt_minimax_m3_eagle_0623 branch June 25, 2026 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants