feat(minimax-m3): support index cache#1354
Merged
Merged
Conversation
Route MiniMax-M3 index Q/K through a separate projection and thread it through the attention stack so cached top-k layers can skip indexer work while preserving the non-cache path. Co-authored-by: Cursor <cursoragent@cursor.com>
Keep MiniMax-M3 index Q/K in the packed QKV projection so index-cache support only skips top-k work and does not require a separate aiter input ABI. Co-authored-by: Cursor <cursoragent@cursor.com>
Remove residual formatting-only changes from the packed index-cache refactor so the branch only carries functional sparse-attention updates. Co-authored-by: Cursor <cursoragent@cursor.com>
Drop temporary hit/miss logging and counters from the MiniMax-M3 top-k cache path now that the packed index-cache flow is settled. Co-authored-by: Cursor <cursoragent@cursor.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds support for MiniMax-M3 “index cache” scheduling by allowing selected sparse-attention layers to reuse a previously computed index top‑k result, reducing per-layer indexing overhead and improving throughput in high-concurrency runs.
Changes:
- Introduces sparse-layer ordinal–based skip logic (
use_index_cache,index_topk_freq,index_topk_pattern,index_skip_topk_offset) for MiniMax‑M3 sparse attention layers. - Adds a shared per-runner top‑k cache state and wires it into
SparseMHAPagedAttentionImplso skip layers can reuse cached(topk_idx, sparse_bt, sparse_ctx). - Normalizes MiniMax‑M3 HF config fields into
text_configand includes the new knobs inConfig.compute_hash().
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| atom/models/minimax_m3.py | Computes per-sparse-layer ordinal and passes skip/caching controls into the sparse attention impl. |
| atom/model_ops/attentions/aiter_attention.py | Allocates and binds a shared sparse top‑k cache state dict on the model runner for index-cache mode. |
| atom/model_ops/attention_mha.py | Implements skip-index fast path in rope_cache and cached top‑k reuse in sparse prefill/decode. |
| atom/config.py | Propagates index-cache knobs into MiniMax‑M3 text_config and incorporates them into config hashing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+1082
to
+1086
| if self.skip_index_topk: | ||
| from atom.model_ops.triton_fused_qkv_norm_rope_cache import ( | ||
| triton_fused_norm_rope_cache, | ||
| ) | ||
|
|
Comment on lines
+1288
to
+1290
| if index_q is None: | ||
| raise RuntimeError("MiniMax-M3 index cache miss on a skip-index layer") | ||
| topk_idx, sparse_bt, sparse_ctx = minimax_m3_index_topk( |
Comment on lines
+1360
to
+1363
| if cached_topk is None: | ||
| if index_q is None: | ||
| raise RuntimeError("MiniMax-M3 index cache miss on a skip-index layer") | ||
| topk_idx, sparse_bt, sparse_ctx = minimax_m3_index_topk_decode( |
valarLip
reviewed
Jun 25, 2026
| cached_topk = self._load_cached_topk(sparse_metadata, topk_key) | ||
| if cached_topk is None: | ||
| if index_q is None: | ||
| raise RuntimeError("MiniMax-M3 index cache miss on a skip-index layer") |
Collaborator
There was a problem hiding this comment.
put these part under model itself?…… not must in this pr,but we do need a refact…
valarLip
approved these changes
Jun 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Support index cache feature for m3 performance.
Motivation
Technical Details
Test Plan
MXFP4 and MXFP8 accuracy:
Test Result
MXFP4(conc=64,256):

MXFP8(conc=64,256):


Submission Checklist