Add MiniMax-M3 (MXFP4/AttnFP8) model support#1317
Open
thpereir wants to merge 1 commit into
Open
Conversation
b93487e to
cb66f66
Compare
The in-tree MiniMax-M3 model already covers the BF16 checkpoint. This adds the small pieces the quantized amd/MiniMax-M3-MXFP4-AttnFP8 build needs, without disturbing the BF16 path. - config.py: register the minimax_m3_vl multimodal wrapper and parse its text sub-config (which declares no model_type) with the base PretrainedConfig so every field is retained and no deepseek/MLA defaults leak in; stamp model_type=minimax_m3 from the top-level type. The quark quantization_config (already propagated from the root) and the original architectures are preserved, so loading resolves to the existing MiniMaxM3Sparse model. The BF16 checkpoint keeps its direct minimax_m3 model_type and is unaffected. - linear.py: pad the MXFP4 Linear contraction dim to 256. The a4w4 asm GEMM reads K in 256-wide tiles, so an unaligned K (M3's shared-expert down_proj at TP=8, K=384) faults on GPU. LinearBase._pad_mxfp4_input_dim() zero-pads the fp4x2 weight, its e8m0 scale, and the activation up to 256-alignment; no-op when already aligned.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The in-tree MiniMax-M3 model already covers the BF16 checkpoint. This
adds the small pieces the quantized amd/MiniMax-M3-MXFP4-AttnFP8 build
needs, without disturbing the BF16 path.
text sub-config (which declares no model_type) with the base
PretrainedConfig so every field is retained and no deepseek/MLA
defaults leak in; stamp model_type=minimax_m3 from the top-level type.
The quark quantization_config (already propagated from the root) and
the original architectures are preserved, so loading resolves to the
existing MiniMaxM3Sparse model. The BF16 checkpoint keeps its direct
minimax_m3 model_type and is unaffected.
GEMM reads K in 256-wide tiles, so an unaligned K (M3's shared-expert
down_proj at TP=8, K=384) faults on GPU. LinearBase._pad_mxfp4_input_dim()
zero-pads the fp4x2 weight, its e8m0 scale, and the activation up to
256-alignment; no-op when already aligned.
Validated: GSM8K 93.9% at TP=8 (full 1319, 5-shot), matching the TP=1
baseline; lossless vs the aligned baseline (cosine 0.9934).
Motivation
Properly run MiniMax M3 MXFP4 on ATOM with TP=8
Test Result
Tested with ATOM with TP=8:
lm-eval
Results, for reference with TP=1 gsm8k gives ~0.9424:
origin/main(no fix)Submission Checklist