Skip to content

feat(minimax-m3): support index cache#1354

Merged
valarLip merged 6 commits into
mainfrom
xiaobing/index_cache
Jun 25, 2026
Merged

feat(minimax-m3): support index cache#1354
valarLip merged 6 commits into
mainfrom
xiaobing/index_cache

Conversation

@XiaobingSuper

@XiaobingSuper XiaobingSuper commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Support index cache feature for m3 performance.

Motivation

Technical Details

Test Plan

MXFP4 and MXFP8 accuracy:

export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:-0,1,2,3}"
# unset HIP_VISIBLE_DEVICES
# unset ROCR_VISIBLE_DEVICES
# export ATOM_PROFILER_MORE=1
export ATOM_FORCE_ATTN_TRITON=1
export AITER_QUICK_REDUCE_QUANTIZATION=INT4
# export ATOM_M3_SPARSE_USE_ASM_PA=1
export AITER_QUICK_REDUCE_CAST_BF16_TO_FP16=0

export ATOM_PROFILER_MORE=1

DEFAULT_HF_OVERRIDES='{"use_index_cache": true, "index_topk_freq": 4}'
HF_OVERRIDES="${HF_OVERRIDES:-${DEFAULT_HF_OVERRIDES}}"
HF_OVERRIDE_ARGS=()
if [[ -n "${HF_OVERRIDES}" ]]; then
  HF_OVERRIDE_ARGS=(--hf-overrides "${HF_OVERRIDES}")
fi
# Example: HF_OVERRIDES='{"use_index_cache": true, "index_topk_freq": 4}' bash run_atom.sh


# --torch-profiler-dir ./trace --mark-trace
  # --torch-profiler-dir /workdir/xiaobing/m3_trace --mark-trace \
python -m atom.entrypoints.openai_server \
  --model /shared/data/amd_int/models/MiniMax-M3-MXFP4 \
  --tensor-parallel-size 4 \
  --server-port 8005 \
  --trust-remote-code \
  --gpu-memory-utilization 0.8 \
  --kv_cache_dtype fp8 \
  --block-size 128 \
  --max-model-len 32768 \
  --max-num-seqs 256 \
  --no-enable_prefix_caching \
  "${HF_OVERRIDE_ARGS[@]}" \
  --max-num-batched-tokens 32768 2>&1 | tee m3-mxfp8-server.log
export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:-0,1,2,3}"
# unset HIP_VISIBLE_DEVICES
# unset ROCR_VISIBLE_DEVICES
# export ATOM_PROFILER_MORE=1
export ATOM_FORCE_ATTN_TRITON=1
export AITER_QUICK_REDUCE_QUANTIZATION=INT4
# export ATOM_M3_SPARSE_USE_ASM_PA=1
export AITER_QUICK_REDUCE_CAST_BF16_TO_FP16=0

export ATOM_PROFILER_MORE=1

DEFAULT_HF_OVERRIDES='{"use_index_cache": true, "index_topk_freq": 4}'
HF_OVERRIDES="${HF_OVERRIDES:-${DEFAULT_HF_OVERRIDES}}"
HF_OVERRIDE_ARGS=()
if [[ -n "${HF_OVERRIDES}" ]]; then
  HF_OVERRIDE_ARGS=(--hf-overrides "${HF_OVERRIDES}")
fi
# Example: HF_OVERRIDES='{"use_index_cache": true, "index_topk_freq": 4}' bash run_atom.sh


# --torch-profiler-dir ./trace --mark-trace
  # --torch-profiler-dir /workdir/xiaobing/m3_trace --mark-trace \
python -m atom.entrypoints.openai_server \
  --model /shared/data/amd_int/models/MiniMax-M3-MXFP8 \
  --tensor-parallel-size 4 \
  --server-port 8005 \
  --trust-remote-code \
  --gpu-memory-utilization 0.8 \
  --kv_cache_dtype fp8 \
  --block-size 128 \
  --max-model-len 32768 \
  --max-num-seqs 256 \
  --no-enable_prefix_caching \
  "${HF_OVERRIDE_ARGS[@]}" \
  --online_quant_config '{"global_quant_config": "ptpc_fp8", "exclude_layer": ["lm_head", "model.embed_tokens", "vision_tower", "multi_modal_projector", "patch_merge_mlp", "*block_sparse_moe"]}' \
  --max-num-batched-tokens 32768 2>&1 | tee m3-mxfp8-server.log
lm_eval \
  --model local-chat-completions \
  --apply_chat_template \
  --output_path /tmp/eval_out-5gtcCE \
  --log_samples \
  --tasks gsm8k \
  --model_args 'model=/shared/data/amd_int/models/MiniMax-M3-MXFP8,base_url=http://0.0.0.0:8005/v1/chat/completions,api_key=EMPTY,eos_string=</s>,max_retries=5,num_concurrent=64,timeout=1800,tokenized_requests=False,max_length=32768' \
  --gen_kwargs max_tokens=16384,temperature=0,top_p=1 2>&1 | tee m3-mxfp8-accuracy.log

Test Result

MXFP4(conc=64,256):
image

image

MXFP8(conc=64,256):
image
image

Submission Checklist

XiaobingSuper and others added 5 commits June 25, 2026 04:11
Route MiniMax-M3 index Q/K through a separate projection and thread it through the attention stack so cached top-k layers can skip indexer work while preserving the non-cache path.

Co-authored-by: Cursor <cursoragent@cursor.com>
Keep MiniMax-M3 index Q/K in the packed QKV projection so index-cache support only skips top-k work and does not require a separate aiter input ABI.

Co-authored-by: Cursor <cursoragent@cursor.com>
Remove residual formatting-only changes from the packed index-cache refactor so the branch only carries functional sparse-attention updates.

Co-authored-by: Cursor <cursoragent@cursor.com>
Drop temporary hit/miss logging and counters from the MiniMax-M3 top-k cache path now that the packed index-cache flow is settled.

Co-authored-by: Cursor <cursoragent@cursor.com>
@XiaobingSuper XiaobingSuper changed the title feat(minimax-m3): split index cache projection feat(minimax-m3): support index cahce Jun 25, 2026
@XiaobingSuper XiaobingSuper changed the title feat(minimax-m3): support index cahce feat(minimax-m3): support index cache Jun 25, 2026
@XiaobingSuper XiaobingSuper marked this pull request as ready for review June 25, 2026 10:44
@XiaobingSuper XiaobingSuper requested a review from valarLip June 25, 2026 10:46

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for MiniMax-M3 “index cache” scheduling by allowing selected sparse-attention layers to reuse a previously computed index top‑k result, reducing per-layer indexing overhead and improving throughput in high-concurrency runs.

Changes:

  • Introduces sparse-layer ordinal–based skip logic (use_index_cache, index_topk_freq, index_topk_pattern, index_skip_topk_offset) for MiniMax‑M3 sparse attention layers.
  • Adds a shared per-runner top‑k cache state and wires it into SparseMHAPagedAttentionImpl so skip layers can reuse cached (topk_idx, sparse_bt, sparse_ctx).
  • Normalizes MiniMax‑M3 HF config fields into text_config and includes the new knobs in Config.compute_hash().

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
atom/models/minimax_m3.py Computes per-sparse-layer ordinal and passes skip/caching controls into the sparse attention impl.
atom/model_ops/attentions/aiter_attention.py Allocates and binds a shared sparse top‑k cache state dict on the model runner for index-cache mode.
atom/model_ops/attention_mha.py Implements skip-index fast path in rope_cache and cached top‑k reuse in sparse prefill/decode.
atom/config.py Propagates index-cache knobs into MiniMax‑M3 text_config and incorporates them into config hashing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1082 to +1086
if self.skip_index_topk:
from atom.model_ops.triton_fused_qkv_norm_rope_cache import (
triton_fused_norm_rope_cache,
)

Comment on lines +1288 to +1290
if index_q is None:
raise RuntimeError("MiniMax-M3 index cache miss on a skip-index layer")
topk_idx, sparse_bt, sparse_ctx = minimax_m3_index_topk(
Comment on lines +1360 to +1363
if cached_topk is None:
if index_q is None:
raise RuntimeError("MiniMax-M3 index cache miss on a skip-index layer")
topk_idx, sparse_bt, sparse_ctx = minimax_m3_index_topk_decode(
cached_topk = self._load_cached_topk(sparse_metadata, topk_key)
if cached_topk is None:
if index_q is None:
raise RuntimeError("MiniMax-M3 index cache miss on a skip-index layer")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put these part under model itself?…… not must in this pr,but we do need a refact…

@valarLip valarLip merged commit f16bec5 into main Jun 25, 2026
41 of 46 checks passed
@valarLip valarLip deleted the xiaobing/index_cache branch June 25, 2026 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants