feat(minimax-m3): support index cache by XiaobingSuper · Pull Request #1354 · ROCm/ATOM

XiaobingSuper · 2026-06-25T09:35:16Z

Support index cache feature for m3 performance.

Motivation

Technical Details

Test Plan

MXFP4 and MXFP8 accuracy:

export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:-0,1,2,3}"
# unset HIP_VISIBLE_DEVICES
# unset ROCR_VISIBLE_DEVICES
# export ATOM_PROFILER_MORE=1
export ATOM_FORCE_ATTN_TRITON=1
export AITER_QUICK_REDUCE_QUANTIZATION=INT4
# export ATOM_M3_SPARSE_USE_ASM_PA=1
export AITER_QUICK_REDUCE_CAST_BF16_TO_FP16=0

export ATOM_PROFILER_MORE=1

DEFAULT_HF_OVERRIDES='{"use_index_cache": true, "index_topk_freq": 4}'
HF_OVERRIDES="${HF_OVERRIDES:-${DEFAULT_HF_OVERRIDES}}"
HF_OVERRIDE_ARGS=()
if [[ -n "${HF_OVERRIDES}" ]]; then
  HF_OVERRIDE_ARGS=(--hf-overrides "${HF_OVERRIDES}")
fi
# Example: HF_OVERRIDES='{"use_index_cache": true, "index_topk_freq": 4}' bash run_atom.sh


# --torch-profiler-dir ./trace --mark-trace
  # --torch-profiler-dir /workdir/xiaobing/m3_trace --mark-trace \
python -m atom.entrypoints.openai_server \
  --model /shared/data/amd_int/models/MiniMax-M3-MXFP4 \
  --tensor-parallel-size 4 \
  --server-port 8005 \
  --trust-remote-code \
  --gpu-memory-utilization 0.8 \
  --kv_cache_dtype fp8 \
  --block-size 128 \
  --max-model-len 32768 \
  --max-num-seqs 256 \
  --no-enable_prefix_caching \
  "${HF_OVERRIDE_ARGS[@]}" \
  --max-num-batched-tokens 32768 2>&1 | tee m3-mxfp8-server.log

export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:-0,1,2,3}"
# unset HIP_VISIBLE_DEVICES
# unset ROCR_VISIBLE_DEVICES
# export ATOM_PROFILER_MORE=1
export ATOM_FORCE_ATTN_TRITON=1
export AITER_QUICK_REDUCE_QUANTIZATION=INT4
# export ATOM_M3_SPARSE_USE_ASM_PA=1
export AITER_QUICK_REDUCE_CAST_BF16_TO_FP16=0

export ATOM_PROFILER_MORE=1

DEFAULT_HF_OVERRIDES='{"use_index_cache": true, "index_topk_freq": 4}'
HF_OVERRIDES="${HF_OVERRIDES:-${DEFAULT_HF_OVERRIDES}}"
HF_OVERRIDE_ARGS=()
if [[ -n "${HF_OVERRIDES}" ]]; then
  HF_OVERRIDE_ARGS=(--hf-overrides "${HF_OVERRIDES}")
fi
# Example: HF_OVERRIDES='{"use_index_cache": true, "index_topk_freq": 4}' bash run_atom.sh


# --torch-profiler-dir ./trace --mark-trace
  # --torch-profiler-dir /workdir/xiaobing/m3_trace --mark-trace \
python -m atom.entrypoints.openai_server \
  --model /shared/data/amd_int/models/MiniMax-M3-MXFP8 \
  --tensor-parallel-size 4 \
  --server-port 8005 \
  --trust-remote-code \
  --gpu-memory-utilization 0.8 \
  --kv_cache_dtype fp8 \
  --block-size 128 \
  --max-model-len 32768 \
  --max-num-seqs 256 \
  --no-enable_prefix_caching \
  "${HF_OVERRIDE_ARGS[@]}" \
  --online_quant_config '{"global_quant_config": "ptpc_fp8", "exclude_layer": ["lm_head", "model.embed_tokens", "vision_tower", "multi_modal_projector", "patch_merge_mlp", "*block_sparse_moe"]}' \
  --max-num-batched-tokens 32768 2>&1 | tee m3-mxfp8-server.log

lm_eval \
  --model local-chat-completions \
  --apply_chat_template \
  --output_path /tmp/eval_out-5gtcCE \
  --log_samples \
  --tasks gsm8k \
  --model_args 'model=/shared/data/amd_int/models/MiniMax-M3-MXFP8,base_url=http://0.0.0.0:8005/v1/chat/completions,api_key=EMPTY,eos_string=</s>,max_retries=5,num_concurrent=64,timeout=1800,tokenized_requests=False,max_length=32768' \
  --gen_kwargs max_tokens=16384,temperature=0,top_p=1 2>&1 | tee m3-mxfp8-accuracy.log

Test Result

MXFP4(conc=64,256):

MXFP8(conc=64,256):

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Route MiniMax-M3 index Q/K through a separate projection and thread it through the attention stack so cached top-k layers can skip indexer work while preserving the non-cache path. Co-authored-by: Cursor <cursoragent@cursor.com>

Keep MiniMax-M3 index Q/K in the packed QKV projection so index-cache support only skips top-k work and does not require a separate aiter input ABI. Co-authored-by: Cursor <cursoragent@cursor.com>

Remove residual formatting-only changes from the packed index-cache refactor so the branch only carries functional sparse-attention updates. Co-authored-by: Cursor <cursoragent@cursor.com>

Drop temporary hit/miss logging and counters from the MiniMax-M3 top-k cache path now that the packed index-cache flow is settled. Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Adds support for MiniMax-M3 “index cache” scheduling by allowing selected sparse-attention layers to reuse a previously computed index top‑k result, reducing per-layer indexing overhead and improving throughput in high-concurrency runs.

Changes:

Introduces sparse-layer ordinal–based skip logic (use_index_cache, index_topk_freq, index_topk_pattern, index_skip_topk_offset) for MiniMax‑M3 sparse attention layers.
Adds a shared per-runner top‑k cache state and wires it into SparseMHAPagedAttentionImpl so skip layers can reuse cached (topk_idx, sparse_bt, sparse_ctx).
Normalizes MiniMax‑M3 HF config fields into text_config and includes the new knobs in Config.compute_hash().

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
atom/models/minimax_m3.py	Computes per-sparse-layer ordinal and passes skip/caching controls into the sparse attention impl.
atom/model_ops/attentions/aiter_attention.py	Allocates and binds a shared sparse top‑k cache state dict on the model runner for index-cache mode.
atom/model_ops/attention_mha.py	Implements skip-index fast path in `rope_cache` and cached top‑k reuse in sparse prefill/decode.
atom/config.py	Propagates index-cache knobs into MiniMax‑M3 `text_config` and incorporates them into config hashing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        if self.skip_index_topk:
+            from atom.model_ops.triton_fused_qkv_norm_rope_cache import (
+                triton_fused_norm_rope_cache,
+            )
+


+            if index_q is None:
+                raise RuntimeError("MiniMax-M3 index cache miss on a skip-index layer")
+            topk_idx, sparse_bt, sparse_ctx = minimax_m3_index_topk(


+        if cached_topk is None:
+            if index_q is None:
+                raise RuntimeError("MiniMax-M3 index cache miss on a skip-index layer")
+            topk_idx, sparse_bt, sparse_ctx = minimax_m3_index_topk_decode(


valarLip · 2026-06-25T11:27:50Z

+        cached_topk = self._load_cached_topk(sparse_metadata, topk_key)
+        if cached_topk is None:
+            if index_q is None:
+                raise RuntimeError("MiniMax-M3 index cache miss on a skip-index layer")


put these part under model itself？…… not must in this pr，but we do need a refact…

XiaobingSuper and others added 5 commits June 25, 2026 04:11

feat(minimax-m3): split index cache projection

bf115d5

Route MiniMax-M3 index Q/K through a separate projection and thread it through the attention stack so cached top-k layers can skip indexer work while preserving the non-cache path. Co-authored-by: Cursor <cursoragent@cursor.com>

refactor(minimax-m3): keep indexer qk packed

c9e8546

Keep MiniMax-M3 index Q/K in the packed QKV projection so index-cache support only skips top-k work and does not require a separate aiter input ABI. Co-authored-by: Cursor <cursoragent@cursor.com>

chore(minimax-m3): drop leftover formatting noise

65e76c0

Remove residual formatting-only changes from the packed index-cache refactor so the branch only carries functional sparse-attention updates. Co-authored-by: Cursor <cursoragent@cursor.com>

code format

573094a

chore(minimax-m3): remove index cache debug logging

4d12345

Drop temporary hit/miss logging and counters from the MiniMax-M3 top-k cache path now that the packed index-cache flow is settled. Co-authored-by: Cursor <cursoragent@cursor.com>

XiaobingSuper changed the title ~~feat(minimax-m3): split index cache projection~~ feat(minimax-m3): support index cahce Jun 25, 2026

Merge branch 'main' into xiaobing/index_cache

dcb8497

XiaobingSuper changed the title ~~feat(minimax-m3): support index cahce~~ feat(minimax-m3): support index cache Jun 25, 2026

XiaobingSuper marked this pull request as ready for review June 25, 2026 10:44

XiaobingSuper requested review from Copilot, ganyi1996ppo and zejunchen-zejun June 25, 2026 10:44

Copilot started reviewing on behalf of XiaobingSuper June 25, 2026 10:45 View session

XiaobingSuper requested a review from valarLip June 25, 2026 10:46

Copilot AI reviewed Jun 25, 2026

View reviewed changes

valarLip reviewed Jun 25, 2026

View reviewed changes

valarLip approved these changes Jun 25, 2026

View reviewed changes

valarLip merged commit f16bec5 into main Jun 25, 2026
41 of 46 checks passed

valarLip deleted the xiaobing/index_cache branch June 25, 2026 12:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(minimax-m3): support index cache#1354

feat(minimax-m3): support index cache#1354
valarLip merged 6 commits into
mainfrom
xiaobing/index_cache

XiaobingSuper commented Jun 25, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

valarLip Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

XiaobingSuper commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

valarLip Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

XiaobingSuper commented Jun 25, 2026 •

edited

Loading