Skip to content

feat(eplb): add per-layer expert-load statistics monitor for EP path#2

Open
JiaoliangYu wants to merge 97 commits into
mainfrom
feat/eplb-expert-load-pass
Open

feat(eplb): add per-layer expert-load statistics monitor for EP path#2
JiaoliangYu wants to merge 97 commits into
mainfrom
feat/eplb-expert-load-pass

Conversation

@JiaoliangYu

Copy link
Copy Markdown
Owner

Collect per-layer, per-expert token counts from the MORI EP dispatch output (dispatch_recv_token_num) into a windowed ExpertLoadMonitor. Logs avg/max/balancedness and can emit a one-shot offline rebalance plan (hot/cold experts) via the offline_eplb_rebalance utility command.

Scope is statistics only: per-rank, no cross-rank all-reduce, and no actual expert weight remap/transfer yet. All gated behind ATOM_ENABLE_EPLB_LOAD_STATS (default off).

valarLip and others added 30 commits June 14, 2026 23:31
* fix(v4): correct TBO + dp-attention accuracy regression

Two independent bugs collapsed DeepSeek-V4-Pro GSM8K accuracy under
`--enable-dp-attention --enable-tbo` (3-shot flexible-extract ~0.95 -> ~0.87,
and as low as ~0.57 at concurrency 1000):

1. ids-gather on a side stream (deepseek_v4.py)
   The V4 hash-routing input_ids all-gather was run via
   `_run_on_tbo_comm_stream` on a private side stream. The gathered ids must
   be row-aligned with the DP-gathered hidden/router output (both share the
   dp_group communicator and `_hash_topk` indexes them positionally via
   `tid2eid[ids[:num_tokens]]`). Running it on an independent stream desyncs
   the collective's placement relative to the per-ubatch hidden all-gathers
   and only synchronizes the forward-top stream, not the TBO ubatch stream
   that consumes it -> misaligned rows -> wrong expert routing. The
   corruption scales with batch size (0.87 at conc 65, 0.57 at conc 1000).
   Fix: run the gather INLINE on the compute stream (the ids tensor is tiny
   vs hidden, so no real overlap is lost). Do NOT wrap it in the TBO
   ping-pong -- an extra forward-top yield desyncs the ring and collapses
   accuracy to ~0.54.

2. uninitialized padding in pad_for_all_gather (moe.py)
   Padding rows for the uniform all-gather were allocated with `torch.empty`
   (garbage). Padded rows are all-gathered across DP ranks and fed straight
   into the aiter fused-MoE expert GEMM, where garbage leaks into real
   tokens' outputs (~0.7pp GSM8K drop at large batch). Fix: explicitly zero
   the pad rows.

Also in moe.py: simplify reduce_scatter_with_unpadding to a dim-0 slice
(removes the deprecated list-indexing that triggered a PyTorch 2.9
UserWarning) and tidy pad_for_all_gather / all_gather_with_padding.

Add a DeepSeek-V4-Pro TBO+DPA nightly accuracy entry (num_concurrent=1000)
to models_accuracy.json. Local 1319-sample GSM8K 3-shot across runs:
0.9439 / 0.9484 / 0.9515 / 0.9530 / 0.9538 (mean ~0.950, baseline ~0.9522).

Update recipes/DeepSeek-V4.md accordingly.

* fix(v4): use dual-stream shared_experts.forward and drop pad-zeroing

- dual_stream_moe_forward: call shared_experts.forward(x) directly on the
  alt stream.
- pad_for_all_gather: drop the explicit pad-row zeroing now that DP
  all_gather goes through the IPC-registered (no-copy-in) path under graph
  capture (see ROCm/aiter#3713). conc1024 GSM8K stays within noise (0.9492).
* ci(mesh): add Atomesh accuracy and benchmark workflows

- Validate standalone-mode accuracy via Atomesh entrypoints.
- Mocker benchmark to PD routing scenarios with topology and consumer concurrency matrix.

* [ci][mesh] add Atomesh mocker benchmark dashboard

- Add a custom dashboard for Atomesh mocker benchmark results.
- Show throughput, latency, detailed performance data, commit links, and CI run links.
- Align the benchmark matrix with 1P1D, 2P1D, and 3P1D topologies across consumer concurrency levels.

* [ci] Skip unrelated ATOM, vLLM, and SGLang CI for mesh-only PRs.

* [ci][mesh] Enable mocker dashboard publishing workflow to run on zwan/feat-mesh-ci pushes.

* Polish Atomesh mocker dashboard legends

* [ci][mesh] fix atomesh standalone accuracy data source

* Revert 'Enable mocker dashboard publishing workflow to run on zwan/feat-mesh-ci pushes.'

* [ci][mesh] add logo and display theme for mesh mocker benchmark dashboard

* [ci][mesh] Polish Atomesh dashboard and accuracy data flow
…m registration (ROCm#1214)

* [Sglang_atom][Fix] Fix the issue where DPSK v3.2 cannot recognize atom model registration after upgrading SGLang to version 0.5.12.

* [sglang_atom][scripts] add dpsk v3.2 tp8 and dpep case
* [ATOM SGL] dsv4 init

* register

* fix acc

* fix profiler error

* using 0.5.12 sglang

* precheckin

* remove local scratch files from PR

Co-authored-by: Cursor <cursoragent@cursor.com>

* remove local curtest script from PR

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: zhuyuhua-v <yuhzhu@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…g enabled (ROCm#1184)

* feat(server): add Anthropic Messages API endpoint (/v1/messages)

Enables Claude Code and other Anthropic-compatible tools to use ATOM
as a backend. Translates between Anthropic Messages format and ATOM's
internal OpenAI format.

Supports:
- Non-streaming and streaming responses
- System messages, multi-turn conversations
- Thinking/reasoning content separation (via ReasoningFilter)
- Anthropic SSE event format (message_start, content_block_delta, etc.)
- Tool definitions translation (Anthropic → OpenAI format)

Usage with Claude Code:
  ANTHROPIC_BASE_URL=http://localhost:8000 \
  ANTHROPIC_AUTH_TOKEN=dummy \
  ANTHROPIC_MODEL=MiniMax-M2.7 \
  claude

* fix(anthropic): fix streaming handler, reasoning filter, and Claude Code compat

- Fix ToolCallStreamParser integration: consume (event_type, data) tuples
  from process()/flush() instead of calling nonexistent get_content()/
  get_tool_calls() methods
- Fix cleanup_streaming_request() call with missing request_id argument
- Fix _build_sampling_params() missing ignore_eos, None top_k/top_p
- Init ReasoningFilter in state 1 when chat template ends with <think>,
  so thinking models like K2.6 have reasoning properly hidden
- Increase ReasoningFilter buffer threshold from 7 to 100 chars to avoid
  prematurely emitting thinking as visible content
- Add prompt truncation when input exceeds max_model_len
- Add cache_creation_input_tokens and cache_read_input_tokens to usage

* fix(anthropic): pass tool definitions to model via chat template

Claude Code sends tool schemas (WebSearch, Bash, etc.) in every request,
but the /v1/messages handler was hardcoding tools=None. The model never
saw tool definitions and couldn't generate proper tool_use calls.

Now converts and forwards request.tools via anthropic_to_openai_tools(),
enabling the model to use WebSearch, WebFetch, and other Claude Code tools.

* fix(anthropic): suppress thinking blocks, add signature support

- Skip streaming thinking blocks entirely to avoid Claude Code's
  signature verification rejection. Thinking still happens server-side
  but only the final answer is sent to the client.
- Add signature field to thinking content blocks and signature_delta
  SSE events for compatibility with Claude Code 2.1.143+.
- Add stream_signature_delta() helper function.

* fix(anthropic): strip attribution header, use model tool IDs

- Strip Claude Code's x-anthropic-billing-header from system prompt
  server-side (matches vLLM behavior) to preserve prefix caching
- Use model-native tool call IDs (functions.name:index) instead of
  random UUIDs, matching vLLM's kimi_k2 parser for multi-turn compat
- Remove unused uuid import from tool_parser
- Add tests for attribution header stripping

---------

Co-authored-by: carlushuang <carlus.huang@amd.com>
* [ATOM SGL] update fp8 prefill argument passing

* use simpler setting

* precheckin
…o to v4 flash and use tp4 (ROCm#1232)

* [atom-vllm] change ci case from v4 pro to
v4 flash and use tp4

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* add preflight check for atom-vllm ci

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* add nccl handle error check

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

---------

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
…rrupting MoE (ROCm#1229)

* fix(v4): zero-init all-gather padding to stop uninitialized memory corrupting MoE

pad_for_all_gather built the padding rows with torch.empty and never zeroed
them (the .zero_() was commented out), contradicting the function's own
docstring. Those uninitialized rows are all-gathered across DP ranks and fed
straight into the aiter fused-MoE expert GEMM, and the padded input_ids reach
tid2eid[ids] for V4 hash routing. Garbage there leaks into real tokens'
outputs.

Because the corruption is whatever happens to sit in freshly-allocated GPU
memory, the result is nondeterministic across machines/runs: locally it landed
at GSM8K ~0.95, but CI on a different SKU dropped to 0.9007 (TBO+DPA conc1000,
below the 0.93 threshold) and a local rerun crashed with a null-pointer GPU
memory access fault (garbage id -> out-of-range expert -> invalid weight ptr).
Restoring the zero fixes all three: padding hidden is benign and padding ids
route to expert 0.

With the pad guaranteed zero, the _hash_topk clamp band-aid is replaced by an
assert that input_ids length matches gating_output num_tokens, surfacing any
real DP-layout mismatch instead of silently masking it.

Also remove the _run_on_tbo_comm_stream side-stream helper: its only caller
(MoE.combine_outputs TP all-reduce) now runs inline, matching the ids-gather
which must stay inline to keep DP collective ordering aligned under TBO.
Rename compress_stream -> indexer_stream for accuracy.

Verified: V4-Pro TBO+DPA conc1000 GSM8K 3-shot = 0.9515 (flexible) / 0.9522
(strict), no GPU fault, drain clean.

* ci: TEMP run only DeepSeek-V4-Pro TBO+DPA conc1000 (revert before merge)

Flip every accuracy entry except the TBO+DPA conc1000 case to test_level
"off" so any trigger (pr/push/dispatch/schedule) runs only this one job,
to validate the pad zero-init fix in CI quickly.

DO NOT MERGE this commit — drop it before merging the PR.

* Fix TBO 1024c accurary issue by remove cpu yield in collective op

(cherry picked from commit 9bf2d25)

* test(v4): disable pad zero-init for CI repro + print server cmd

- moe.py: temporarily comment out pad_for_all_gather zero-init to reproduce
  the uninitialized-padding behavior in CI (the CI gate already restricts the
  run to the V4-Pro TBO+DPA conc1000 case).
- deepseek_v4.py: restore the tid2eid[ids] clamp as a bounds guard for hash
  routing.
- atom_test.sh: print the full openai_server command line before launch so the
  CI log shows the exact server args.

Experiment on top of the pad zero-init fix — not for merge as-is.

* ci: restore full accuracy matrix (undo temp single-case gate)

Reverts the test_level "off" gate from 3662ac0 — all accuracy cases are
re-enabled at their original pr/main/nightly levels. The CI experiment that
needed only DeepSeek-V4-Pro TBO+DPA conc1000 is done.

* ci: lower gpt-oss-120b accuracy threshold to 0.87

Both gpt-oss-120b entries (1-GPU and 2-GPU) drop from 0.88 to 0.87 to absorb
run-to-run GSM8K variance. Other models unchanged.

* perf(v4): fuse _hash_topk into a single Triton kernel

The hash-routing custom_routing_function for V4's first layers ran
softplus+sqrt over every routed expert (n_routed_experts ~256-384) but kept
only topk (~6) of them, plus separate clamp / tid2eid gather / score gather /
renorm / scale ops.

triton_hash_topk.py fuses all of it into one kernel (one program per token):
id clamp, tid2eid[id] lookup, gating gather at the selected experts only,
sqrt(softplus(.)), optional renorm and scaling. When shared experts are fused
it writes directly into the first topk columns of the global topK buffer,
avoiding an extra copy.

Numerics match the PyTorch path (max|dw| ~1e-7 fp32 / ~5e-7 bf16 across OOB
ids, bf16, renorm on/off, sliced-buffer write). V4-Pro TBO+DPA conc1000 GSM8K
3-shot = 0.9522.

* ci: print server cmd with [@] expansion to match actual invocation

Use ${ARRAY[@]} instead of ${ARRAY[*]} in the debug echo so the printed
command line reflects the same word-splitting/quoting as the real launch
that uses "${ARRAY[@]}" (addresses Copilot review).

---------

Co-authored-by: ZhangLirong-amd <Lirong.Zhang@amd.com>
…case (ROCm#1216)

* [atom-vllm benchmark MTP] refine benchmark command for
atom-vllm MTP case

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* add performance mode for glm4.7 mtp case and qwen3next mtp case

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* add qwen3next mtp config

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* remove perf mode because it is useless

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* fix missing allreduce for glm4.7 mtp

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* align atom-vllm acc test

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* add mtp accept ratio check

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

---------

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
…Cm#1008)

* Release mi308 benchmark results to dashboard

* dashboard hardware default mi355x

* add atom-sglang mi308 ci

* Release mi308 benchmark results to dashboard

* dashboard hardware default mi355x

* Supplementary content for MI308

* Add steps: Print runner user

* dashboard: hardware single-select filter and dynamic header meta

* Modify models for MI308X

* Modify models for MI308

* Make sure python version for MI308

* Modify python3 version for MI308 with python-build-standalone

---------

Co-authored-by: root <root@hjbog-srdc-15.amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
* feat: let rtp use atom & qwen35_moe impl

* fix: log print too much

* fix: rtp+atom default moe ep

* fix: add rtp+atom attention_inputs.position_ids

* fix: qwen35 fp8_perblock pass and add some print

* fix: remove dump_tensor

* refactor: remove dump and refactor attention_backend

* refactor: remove redundant code

* refactor: kvcache and remove redundant code

* feat: some opt in positions & layer_group_map

* refactor: some optimizations and del redundant code

* fix: RTP Qwen35 skip_python_model

* feat: enable cuda graph for ATOM+RTP

* refactor: remove redundant code

* fix: non cuda graph long input crash

* test:cover RTP plugin import and seq_lens behavior

* test: ruff check

* [RTP]Refactor ATOM-RTPLLM Attention

* Refactor RTP prepare model entrance

* fix: ruff check

* fix: address RTP plugin review feedback

* fix: remove redundant RTP Qwen3.5 import aliases

* fix: qwen35 ruff check F401
* glm_moe_dsa: support GLM-5.2 IndexShare (FP8)

GLM-5.2 (glm_moe_dsa) extends the DeepSeek-V3.2-style DSA stack with
IndexShare: layers marked "shared" in `indexer_types` reuse the preceding
"full" layer's indexer/topk and carry no indexer weights of their own in
the checkpoint.

- models/deepseek_v2.py:
  - Make `indexer_types` the authoritative source for the per-layer
    indexer-skip decision (supersedes index_topk_pattern / index_topk_freq).
  - Honor `index_skip_topk_offset` in the freq-based fallback (default 1
    preserves existing DeepSeek behavior).
  - Reuse the cached topk for the MTP layer when
    `index_share_for_mtp_iteration` is set.
  - Do not build indexer weights for "shared" layers; otherwise their
    parameters load nothing from the checkpoint, stay at init values and
    corrupt the indexer (the forward and the index-cache binding already
    guard on `indexer is not None`).
- config.py: auto-enable `use_index_cache` for glm_moe_dsa when the model
  declares an IndexShare schedule, so serving works without passing an
  --hf-overrides flag.
- plugin/vllm/model_wrapper.py: re-apply the auto-enable after vLLM
  replaces ATOM's hf_config.

Validated on 8x MI355X (TP=8, FP8): native ATOM loads all weights with no
unloaded params and generates correctly for 1k/1k and 8k/1k inputs.

* docs: document GLM-5.2 (IndexShare) serving + add News entry

- recipes/GLM-5.md: add a GLM-5.2 (IndexShare) section with the TP8 serve
  command, configuration tips (bf16 KV, gpu-mem-util 0.8), and 8xMI355X
  perf baselines for 1k/1k and 8k/1k; add a pointer from the intro.
- README.md: add a News entry announcing GLM-5.2 FP8 support.

* docs: note GLM-5.2 in README Supported Models table

* style: black formatting for indexer_types skip return

* style: condense GLM-5.2 code comments

* refactor: move maybe_enable_glm_dsa_index_cache into deepseek_v2

Own the indexer-cache auto-enable in the model: call it once in
DeepseekV2ForCausalLM.__init__ (covers native + vLLM plugin) instead of
in config.get_hf_config and the vLLM wrapper.

* refactor: inline index-cache enable into _should_skip_index_topk

Drop maybe_enable_glm_dsa_index_cache; instead, when index_topk_freq > 1
(IndexShare) turn on use_index_cache directly in _should_skip_index_topk.
No model_type gating needed.

* refactor: gate index_topk_freq check under the use_index_cache branch

* refactor: drop redundant 'or 1' guard on index_topk_freq

* benchmark: add GLM-5.2-FP8 to dashboard (perf + accuracy)

Native-engine catalog entries for the nightly dashboard:
- models.json: TP8 FP8, kv_cache_dtype fp8, --gpu-memory-utilization 0.8
  (DSA index cache OOMs at default 0.9), conc up to 256.
- models_accuracy.json: gsm8k threshold 0.92 (measured 3-shot
  flexible-extract 0.9447 on 8x MI355X).
Co-authored-by: JiaoliangYu <jiaolyu@amd.com>
… to 0.81 (ROCm#1243)

Modify qwen3-next-80b-a3b-fp8 threshold from 0.83 to 0.81 (ROCm#1243)
…seek-r1-fp4-tp4-dp8-ep8 to tp8-dp8-ep8 (ROCm#1258)

* Modify sgl accuracy schedule time and change deepseek-r1-fp4-tp4-dp8-ep8 to tp8-dp8-ep8

* Remove duplicate cases
* ci: add actionlint workflow check

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* ci: fix existing actionlint findings

* ci: keep accuracy validation steps alias

* ci: remove stale SGLang accuracy inputs

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…schedule mode (ROCm#1268)

* Modify atom-sglang-benchmark model priority for schedule mode
* temp for docker

* feat: default PA BF16 ASM on gfx1250 unified attention

Route eligible non-SWA decode layers through pa_decode_bf16_asm on gfx1250 when ATOM_USE_UNIFIED_ATTN is enabled. ATOM_FORCE_ATTN_TRITON disables this path and keeps attention on Triton.

Keep mainline unified-attention SHUFFLE KV layout, remove the old ATOM_GPTOSS_USE_PA_DECODE_BF16_ASM env gate, and warn instead of asserting when ATOM_USE_UNIFIED_ATTN sees a non-default block size.

Validation: git diff --check origin/main...HEAD; python -m py_compile atom/model_ops/attention_mha.py atom/model_ops/attentions/aiter_attention.py atom/model_ops/attentions/triton_mha.py atom/utils/envs.py.

* fix: keep bf16 query for PA ASM decode

* Fix PA ASM cudagraph metadata refresh

* Zero PA ASM padded decode rows

* fix stfmax sacel

* set is_causal=True in ps_metadata

* Refactor PA ASM decode: unify metadata and merge bf16_asm into persistent_asm

- Switch pa_decode_bf16_asm metadata from get_ps_metadata_v1 to
  get_pa_metadata_v1, so it shares the persistent worker buffers used by
  pa_persistent_fwd. Drop the separate pa_decode_bf16_asm_* buffer set,
  set_pa_decode_bf16_asm_metadata, and the per-layer fallback metadata builder.
- Merge paged_attention_pa_decode_bf16_asm into paged_attention_persistent_asm:
  sinks is None -> pa_persistent_fwd, else -> pa_decode_bf16_asm. dispatch_backend
  routes both through persistent_asm.
- Enable persistent decode for block_size 256 and 1024 (was 1024 only).
- Guard paged_attention_asm against sinks (run_pa_fwd_asm has no sink support).
- Simplify q fp8 quant: use the fixed kv_scale_float for q/k/v dequant scale
  (pre-allocated tensor, CUDAGraph-safe) instead of a dynamic q.abs().max().
- Drop the CUDAGraph-capture safe-metadata prep and input validation helpers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Drop pa_decode_bf16_asm skip logging

Remove _log_pa_decode_bf16_asm_once, _skip_pa_decode_bf16_asm, the
_pa_decode_bf16_asm_log_keys set, and the now-unused logging import.
_should_dispatch_pa_decode_bf16_asm returns False directly for the
skipped cases.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Refactor PA ASM decode: clean up dispatch, remove dead code

- Flatten _dispatch_decode: sliding_window as first gate, then unified/triton/asm
- Inline _should_dispatch_pa_decode_bf16_asm logic, delete the method
- Remove _log_pa_decode_bf16_asm_once, _skip_pa_decode_bf16_asm, log keys set,
  unused logging import
- Remove pa_decode_bf16_asm_metadata field from AttentionMetaData
- Remove _pa_decode_bf16_asm_num_head_k (write-only)
- Remove gfx1250 guard from attention_mla.py fused FP8 GEMM path
- Clean up ATOM_GFX1250_FALLBACK env var and simplify env var docs

Co-Authored-By: Claude <noreply@anthropic.com>

* Remove obsolete gptoss PA ASM shuffle repro doc

Co-Authored-By: Claude <noreply@anthropic.com>

* Fix PA ASM decode: handle fp8 query input and bf16 output

- Skip q quantization when rope_cache already produced fp8 query
- Allocate output as explicit bf16 (kernel requires bf16 output,
  empty_like inherited fp8 from q_5d)

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: ahmed-bsod <Muhammad.Ahmed@amd.com>
Co-authored-by: hwang <hwang@ctheliosp-1b112-a43-1.amd.com>
Co-authored-by: HaonanWang98 <hwang@amd.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ainer run flags (ROCm#1281)

* run MI355 and MI308 GPU shards in parallel
…OCm#1272)

* feat: fuse V4 decode SWA cache-write into qk_norm_rope_maybe_quant

Thread the SWA ring scatter through the qk_norm+rope bridge so the V4
decode path no longer launches a standalone swa_write per layer. When
swa_kv is provided, the post-norm/rope KV row is written into
swa_kv[slot, pos % cache_size, :] (slot = state_slot_mapping[
batch_id_per_token[t]]) inside the same kernel:

- flydsl path: fuses the scatter into the qk_norm launch (no extra
  kernel, no [T, D] KV HBM round-trip), via the new swa_kv /
  state_slot_mapping / batch_id_per_token args on flydsl_qk_norm_rope_quant.
- Triton fallback: emits the existing swa_write as a separate launch
  (driven by swa_cu_seqlens_q + state_slot_mapping) so both backends have
  identical side effects.

deepseek_v4.py decode deletes its standalone swa_write call and passes
the SWA args through the bridge instead; prefill is unchanged (still
writes its in-chunk SWA tail via swa_write after sparse_attn). BF16 only.

Requires the matching aiter change (ROCm/aiter#3776) for the flydsl
fused-scatter kernel support.

* ci: drop GLM-5-FP8 from benchmark matrix to stay under 256 cells

The nightly atom-benchmark grid had grown to 264 fully-expanded matrix
cells, exceeding GitHub Actions' hard limit of 256 configurations per
job. Remove the GLM-5-FP8 benchmark variant (superseded by GLM-5.2-FP8,
which is retained) and its workflow_dispatch checkbox (keeping it in sync
with the catalog prefixes). Matrix now resolves to 250 cells.

Accuracy validation (models_accuracy.json) and the dashboard color map
are left unchanged — GLM-5-FP8 stays covered there.

* fix: standardize V4 batch_id_per_token on int32 for fused SWA scatter

The fused decode SWA scatter loads batch_id_per_token at int32 width
(see ROCm/aiter#3793). The producers were int64, which raised
"batch_id_per_token must be 1-D int64" on the V4-Pro MTP decode path
(server failed to start -> accuracy job timed out).

Make all batch_id_per_token producers int32:
- v4_batch_id_per_token CpuGpuBuffer (model_runner path) int64 -> int32
- batch_id numpy sources (per-fwd + MTP draft) int64 -> int32
- sglang / vllm plugin bridge batch_id buffers + numpy sources -> int32

int32 indices are accepted by torch advanced-indexing (indexer meta) and
by the triton kernels (tl.load is dtype-agnostic); the explicit
.to(torch.int64) casts in csa_translate_pack / sglang remain and tolerate
int32 input. batch_id values are bounded by batch size, far below 2^31.

Validated end-to-end: DeepSeek-V4-Pro MTP3 GSM8K (3-shot) flexible
0.9477 / strict 0.9484, above the 0.94 CI threshold; decode drained
cleanly with no TypeError.
Replace the two-step indexer Q preparation (bf16 rope_rotate_activation +
separate get_hip_quant(per_1x128)) with the fused fp8 path: a single
rope_rotate_activation call that applies RoPE + Hadamard-rotate and writes
the fp8-quantized Q with its per-(token, head) block scale via out_scale.

The bf16 rotated Q is never read back, so quantizing it in-kernel avoids
materializing the intermediate. group_size = head_dim (128) => one scale
per (token, head). The fused kernel's fp8 quant matches
dynamic_per_group_scaled_quant_kernel.

Verified on DeepSeek-V4-Pro: GSM8K 3-shot ~0.953-0.957 and 10-shot 0.9568
(baseline 0.9522 +/- 0.0059, no regression); conc-16 throughput
1644 tok/s (on par with baseline).
* Remove static scale calculation from forward function - saves 2 kernel launches

* Forward swiglu_limit through triton_kernel_moe_forward

triton_kernel_moe_forward did not accept or forward swiglu_limit,
causing triton_kernel_fused_experts to use its default of 7.0.
Models using standard routing got incorrect activation clamping.

* Raise NotImplementedError for SiLU + FP8 in triton MoE

The SiLU branch only handles FP4 and BF16 activations. FP8 silently
fell through to the BF16 path (moe_gemm_a16w4), ignoring the
activation scales entirely. Fail loudly instead.

* Guard triton path in apply() when ATOM_V4_TORCH_MOE is set

ATOM_V4_TORCH_MOE causes process_weights_after_loading to return
early before swizzling weights. Without a matching guard in apply(),
the triton path receives un-swizzled weights producing garbage output
or crashing on missing shared expert attributes.

* Fix activation type annotation from str to ActivationType

Both triton functions and the base class annotated activation as str
with default "silu". All callers pass ActivationType enum values and
the branch comparison uses ActivationType.Swiglu. A caller trusting
the annotation could pass a string, silently taking the wrong branch.

* Use function parameter for apply_router_weight_on_input in triton path

Both triton routing paths read layer.apply_router_weight_on_input
instead of the function parameter. Currently they match because the
caller passes self.apply_router_weight_on_input, but the function
argument was being silently ignored.

* Stash and apply biases for shared experts in dense GEMM path

process_weights_after_loading stashed shared expert weights and
scales but not biases. _apply_shared_experts_dense therefore dropped
biases for any future model combining has_bias=True with fused
shared experts.

* Fix swiglu_limit default to 7.0 to match inner kernel default

The wrapper triton_kernel_moe_forward defaulted swiglu_limit to 0.0
but the inner triton_kernel_fused_experts defaults to 7.0. For SwiGLU
models (GPT-OSS), limit=0.0 clamps activations to zero producing
garbage output.

* Remove dead ATOM_V4_TORCH_MOE env var check

This debug escape hatch skipped weight swizzling in
process_weights_after_loading but had no matching guard in apply(),
silently feeding un-swizzled weights to triton kernels. Not used
anywhere else in the codebase.

* style: format moe.py for black check

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
gbyu-amd and others added 23 commits June 26, 2026 22:55
* fix online quant

* update comment

* format

---------

Co-authored-by: ganyi <ygan@amd.com>
…ig from JSON file (ROCm#1190)

* [atom-vllm nightly acc] remove config in workflow file
and fetch config from JSON file

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* remove term name

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

---------

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
* Modify Qwen3.5-35B-A3B-FP8 runner

* Replace jq with python3

* [fix](ci): mv plugin test to new node

* Modify jq with python3 for vllm-test and add runner name in actionlint

* [fix](ci): mv qwen3.5 test to mi355 node

* Add model cache mount path for vllm-test

* Add model cache mount for sglang-test

* Adapt model cache mount path for new runner

* Use host network

* Remove deepseek-r1-fp8-tp4 from sglang-test

* Align Kimi K2.5 PR CI with nightly settings

Co-authored-by: Cursor <cursoragent@cursor.com>

* Restore DeepSeek R1 FP8 TP4 SGLang CI

Co-authored-by: Cursor <cursoragent@cursor.com>

* Lower Kimi K2.5 PR accuracy threshold

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: perzhang <perzhang@amd.com>
Co-authored-by: wuhuikx <hattie.wu@amd.com>
Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…embed param naming (ROCm#1378)

* perf(server): cut event-loop work in streaming hot path

- Reuse engine-computed num_prompt_tokens in the stream response
  generators instead of re-encoding the prompt on the event loop at
  stream start (drops a redundant per-request tokenize).
- Run multimodal input prep (image download + HF processor) in a worker
  thread instead of synchronously on the event loop.
- Batch-decode a whole step's buffered stream chunks with one
  tokenizer.batch_decode in flush_stream_batch instead of one decode per
  seq on the output thread (one GIL-released call instead of N).
- Coalesce each request's finalization SSE messages (content/finish +
  usage + [DONE]) into a single send to cut socket-write syscalls when
  many requests finish simultaneously.

* perf(server): enable uvloop event loop; fix gpt-oss embed param naming

uvloop:
- Run uvicorn on uvloop (libuv) instead of the stdlib asyncio selector
  loop, with graceful fallback to the default loop if uvloop is absent.
  Under high streaming concurrency this cuts the event-loop cost of SSE
  socket I/O (sock.send / selector register-unregister): steady-state
  TPOT P99 8.50ms -> 8.18ms and frontend loop-scheduling delay roughly
  halved. Adds uvloop to dependencies.

gpt-oss:
- Register `embed_tokens` first (with `embedding` as the shared-storage
  alias) so it stays the primary, non-deduped name in named_parameters().
  The checkpoint stores `model.embed_tokens.weight`; with `embedding` as
  the primary name the load-completeness check falsely flagged
  `model.embedding.weight` as unloaded even though the weight is loaded
  via the alias. Byte-identical weights (GSM8K 0.8832, unchanged); the
  spurious "parameters were NOT loaded" warning is gone.
)

* feat(openai): support Qwen3 (qwen3_coder/qwen3_xml) tool-call format

ATOM's OpenAI/Anthropic servers previously only parsed the Kimi-K2 tool-call
token format (<|tool_calls_section_begin|>...), so Qwen3.5/Qwen3.6 tool calls --
emitted as qwen3_coder XML (<tool_call><function=NAME><parameter=...>) -- were
returned as plain text and never surfaced as structured tool_calls. Agent
frontends (qwen-code, OpenCode, etc.) therefore could not drive tools.

Add Qwen3 XML parsing alongside the Kimi format, auto-detected:

- tool_parser.py: parse <tool_call>/<function=>/<parameter=> into OpenAI
  tool_calls, with JSON-Schema type coercion of parameter values from the
  request's tools (the XML is typeless). Non-streaming + streaming (stream
  content, then buffer+parse the tool-call block -- robust against the
  partial-XML streaming edge cases seen in vLLM/SGLang). Kimi path unchanged.
- protocol.py: deserialize tool_calls[].function.arguments (a JSON string in
  OpenAI requests) to a mapping in to_template_dict, so multi-turn chat
  templates that iterate arguments.items() (Qwen, Hermes) render tool history
  instead of raising "Can only get item pairs from a mapping".
- serving_chat.py / api_server.py: thread the request's tools into the parsers
  for type coercion (default None preserves existing behavior).

Verified: Qwen3.6-27B BF16 served by ATOM drives qwen-code end-to-end on
gfx1151 -- write_file + run-shell tool calls execute and the agent reports the
program output.

* fix(openai): don't pass tools to the /v1/completions stream path

The previous commit's threading of request.tools matched the
stream_completion_response / stream_completion_response_fanout calls in the
/v1/completions handler too. CompletionRequest has no `tools` field, so
/v1/completions raised "AttributeError: 'CompletionRequest' object has no
attribute 'tools'" (HTTP 500). Tool calling only applies to chat; drop tools
from the text-completion stream calls.

* fix(openai): make tool-call ids unique across the conversation

The parser generated ids from a per-response index (call_0, call_1, ...), so the
first tool call in every assistant turn was call_0. OpenAI tool-call ids must be
unique across the whole conversation; agentic clients (e.g. qwen-code) dedupe by
id and silently ignore every repeat -> the tool never executes and the model
retries forever (endless tool-call loop on any multi-tool task). Use a random
call_<uuid> id at both the non-streaming and streaming emit sites.
…n concurrency (ROCm#1381)

The conc=1000 accuracy job intermittently failed: the server exhausted its
per-process open-file limit while accepting ~1000 concurrent connections
(plus the engine's DP-rank ZMQ and shared-memory fds), hitting EMFILE on
accept(). The default soft RLIMIT_NOFILE (~1024) is simply too low for that
connection count.

Root cause is that ATOM never raised its own fd soft limit. vLLM and SGLang
both call set_ulimit() at process startup for exactly this reason, and ATOM's
own mesh launch scripts already pass `--ulimit nofile=65536:524288` to docker
-- but plain `python -m atom.entrypoints.openai_server` launches (CI, ad-hoc)
inherit the daemon default and never bump it.

Add a set_ulimit() helper (raise soft -> min(65535, hard)) and call it at the
server entry point before the engine-core subprocesses are spawned, so the
raised limit is inherited. No-op when the soft limit is already high enough.

This is independent of the event-loop choice; it removes the fd ceiling that
turned ordinary high-concurrency load into dropped connections.
* ci: validate accuracy catalogs against JSON Schema in pre-checks

Add a JSON Schema for the flat accuracy catalogs (models_accuracy.json,
oot_models_accuracy.json, sglang_models_accuracy.json) plus a
validate_catalog.py gate wired into the pre-checks (T0) workflow.

additionalProperties:false locks the current shape so typos / stray fields
fail CI; a semantic rule requires each entry to declare exactly one pass-bar
spelling (accuracy_threshold / accuracy_test_threshold). The existing
extraArgs/extra_args and threshold-name drift is tolerated for now and will be
normalized separately. Documented in benchmark/README.md.

* ci: extract docker login into reusable docker-auth composite action

Replace the inline `echo $PASSWORD | docker login` steps in the ATOM-native
workflows (atom-test, atom-benchmark, atom-mmstar-ci, docker-release,
atomesh-accuracy-validation) with a shared .github/actions/docker-auth composite.

Credentials are passed via env instead of being interpolated into the run
command, removing the template-injection vector. The composite also supports an
explicit registry, image-derived registry, and a custom engine so the
vllm/sglang call sites can reuse it in a follow-up.

* ci: de-inline aiter wheel download into a shared script

Extract the ~163-line aiter wheel resolve+download block (byte-identical in
atom-test and atomesh-accuracy-validation) into
.github/scripts/download_aiter_wheel.sh; both workflows now call it
(net -326 inline lines).

Logic matches the previous inline block exactly. GITHUB_TOKEN is passed via env
instead of being interpolated into the run command, and the S3 / API /
workflow-id constants become overridable env defaults.

atom-mmstar-ci uses a simpler S3-only variant (no artifact fallback) and is
left for a follow-up.

* ci: de-inline aiter wheel install into a shared script

Extract the identical "Install aiter from wheel" block from atom-test and
atomesh-accuracy-validation into .github/scripts/install_aiter_wheel.sh.

Behavior matches the previous inline block (no outer set -e, so a missing wheel
still hits the explicit error+ls path). CONTAINER_NAME comes from the job env;
the wheel dir is an overridable env default (/tmp/aiter-whl).

atom-mmstar-ci uses a --no-deps variant from a different dir and is left for a
follow-up.

* ci: extract CI container startup into setup-gpu-container composite

Replace the identical ~60-line "Start CI container" steps in atom-test and
atomesh-accuracy-validation with a shared .github/actions/setup-gpu-container
composite. The three differences are inputs: network-host (atom-test sets host
networking), extra-run-flags (atomesh adds USE_ATOMESH_ENTRYPOINTS/ATOM_SERVER_PORT),
and the runner label that drives the --pull policy.

The assembled docker run command is byte-identical to the previous inline blocks
for both callers (verified with a stubbed docker). atom-mmstar-ci / docker-release
/ gpu-load-test use more divergent startup blocks and are left for a follow-up.

* ci: serialize gh-pages deploys with a shared concurrency group

All six workflows that push to the gh-pages branch (docs, deploy-pages,
atom-benchmark, atomesh-mocker-benchmark, atom-sglang-benchmark,
atom-vllm-benchmark) now run their deploy job under a shared concurrency group
(gh-pages-deploy, cancel-in-progress: false).

This serializes the fetch/checkout/commit/push dance so concurrent runs can no
longer race on the branch and drop each other's updates. Job-level
concurrency is independent of the existing workflow-level groups, so redundant-run
cancellation is unchanged.

* ci: bump artifact actions off deprecated Node 20 (@v4 -> @v7/@v8)

actions/upload-artifact@v4 and actions/download-artifact@v4 run on the
deprecated Node 20 runtime. Bump the remaining @v4 pins to the versions already
used elsewhere in the repo (upload-artifact@v7, download-artifact@v8), which run
on Node 24.

All affected download steps fetch a single named artifact to an explicit path,
so behavior is unchanged across the major bump; v4-v8 share the same artifact
backend.

* test: align per-req-cache and connector-metadata tests with current behavior

The per-req-cache tests asserted a removed design where stateful requests
deducted 'equiv blocks' from the KV pool and were tracked in a
per_req_cache_accounting dict. The current BlockManager sizes the state
tensor separately and excludes it from num_kvcache_blocks, so admission only
claims a free slot index with no extra paged-block cost. Rewrite the seven
stale tests to the slot-only model (can_allocate returns -1/hit-count, not
False/bool) and rename two to match what they now verify.

ConnectorMetadata._build_req_meta parses transfer params leniently via
dict.get, so a missing field yields None instead of raising KeyError. Update
the connector-metadata test accordingly.

* test: make non-unit disaggregation tests skip visibly off the unit path

test_proxy gains importorskip guards for its optional msgpack/quart deps, so
it runs where they are installed and skips with a reason otherwise instead of
erroring at collection.

test_transfer_engine and test_kv_connector_scheduler import the
kv_transfer_engine module that ROCm#690 split into the moriio subpackage; guard
them with importorskip so they skip visibly (with a reason pointing at the
needed path update) until the disaggregation owner refreshes them.

Delete test_kimi_k25: it exec-loads the real atom/config.py at import time,
which collides with conftest's atom package stub and cannot run under the
shared unit harness.

* test: remove obsolete mxfp4 swiglu source-introspection test

test_swiglu_branch_condition_no_bias_check asserted that
Mxfp4MoEMethod.process_weights_after_loading contains a literal
'layer.activation == ActivationType.Swiglu:' branch. That function was
refactored to route via use_triton vs the AITER shuffle path, so the branch
no longer exists in that form and the test had been @unittest.skip'd as
obsolete. Drop it; the sibling test_swiglu_branch_does_not_couple_bias_and_shuffle
still guards against the original coupled-condition regression.

* ci: add non-GPU unit test gate to pre-checks

Run the native unit suite on ubuntu-latest as part of Pre Checkin, alongside
black/ruff/validate-catalog. .github/scripts/run_unit_tests.sh centralizes the
scope: it runs tests/ minus tests/plugin (next-stage sglang/vllm/rtpllm work,
which also installs import-time sys.modules stubs that would pollute native
tests) and minus the GPU server integration test; P/D disaggregation tests
self-skip via importorskip guards. The job installs CPU torch + base deps,
emits a JUnit report, and uploads it as an artifact.

Locally: 464 passed, 2 skipped, 0 failed.

* test: fix unit gate failures on the non-GPU runner

The new pre-checks unit job failed on ubuntu (no aiter, no PIL) for two
reasons, both now fixed:

- test_api_server_helpers leaked stub modules. When the api_server import
  fails (PIL absent), the except branch reset _injected_modules to [] before
  the finally cleanup ran, so the injected stub for atom.model_engine.arg_utils
  was never popped from sys.modules. It then shadowed the real EngineArgs for
  test_arg_utils_spec (collected later), which failed with _StubEngineArgs /
  missing SpeculativeConfig. Drop the reset so finally always tears the stubs
  down, and pre-initialize _injected_modules so finally is safe if stub
  installation itself raises. Verified by blocking PIL locally: arg_utils tests
  pass, api_server tests skip cleanly.

- test_mxfp4_moe_has_bias loads atom.config / atom.model_ops.moe, which import
  the AITER GPU kernel library (no CPU build). Guard the module with
  pytest.importorskip('aiter') so it skips visibly off the non-GPU gate and
  runs in GPU CI.

* ci: checkout repo in download_aiter_wheel jobs

The download_aiter_wheel jobs in atom-test and atomesh-accuracy-validation
have no checkout step — the original inline bash ran from the YAML directly.
De-inlining the logic into .github/scripts/download_aiter_wheel.sh introduced a
dependency on the file being present on the runner, so the jobs failed with
'No such file or directory' (exit 127). Add actions/checkout@v6 to both jobs.

* ci: drop literal ${{ }} from docker-auth description

GitHub evaluates ${{ }} expressions in an action's description field, and the
secrets context is not available to composite actions. The description quoted
the inline secret-interpolation form verbatim with braces, so loading the
composite failed at runtime with 'Unrecognized named-value: secrets',
short-circuiting Docker Login in atom-test/atomesh. Reword without braces.

actionlint does not evaluate description expressions, so this only surfaced on
a real runner.

* ci: clone aiter with full history so its version isn't 0.0.0

The image build shallow-cloned aiter (git clone --depth 1), so its
setuptools_scm version fell back to 0.0.0 (no tags reachable), making the
baked-in aiter indistinguishable by version. Use --filter=blob:none instead:
full commit history + tags (so setuptools_scm computes a real version) while
deferring blob downloads to keep the clone fast. Submodule init is unaffected.

Native workflows only (atom-test, atomesh-accuracy-validation); the sglang/vllm
benchmark workflows have the same shallow clone but are out of scope for now.

* ci(benchmark): print the full benchmark command before running

Build the benchmark_serving invocation as a bash array and printf it right
after 'Running benchmark test', so the exact resolved command (model, ISL/OSL,
concurrency, extra args) is visible in the client log. Running the array
guarantees the printed command matches what executes.

* ci: notify Teams on nightly/release workflow failure

Add a workflow_run listener that posts a Teams message when a native scheduled
workflow fails (ATOM Test, ATOM Benchmark, Atomesh Accuracy Validation, Pre
Checkin, Nightly Docker Release). Single listener instead of per-workflow steps
— zero changes to the targets. Filtered to conclusion==failure and
event==schedule so only nightly/release runs notify, not PRs.

Posts an Adaptive Card (built with jq; run metadata passed via env to avoid
template injection) to a Teams 'Post to a channel when a webhook request is
received' Workflows webhook — classic O365 connector Incoming Webhooks were
retired in 2026. Requires a TEAMS_WEBHOOK_URL repo secret; until it's set the
job no-ops without failing. workflow_run fires from the default-branch copy, so
it activates after merge.

* fix(ci): unindent resolve_download_url python so the S3 fast-path works

The python3 -c body in download_aiter_wheel.sh indented its continuation lines
to match the bash block, putting leading whitespace inside the single-quoted
source -> 'IndentationError: unexpected indent'. resolve_download_url is called
under a non-set-e context (download_from_s3_manifest), so the error was swallowed
and the S3 manifest fast-path silently fell back to artifact enumeration every
run. Move the python body to column 0 (leading newline) so it parses.

* ci: serialize native accuracy-dashboard gh-pages pushes

The gh-pages serialization added the gh-pages-deploy concurrency group to the
docs/benchmark deployers but missed two native jobs that also auto-push to
gh-pages: atom-test 'Update accuracy dashboard' and atomesh 'Publish Atomesh
accuracy data'. Add the same group so their auto-push can't race the serialized
deploys on the gh-pages branch.
* [atom-vllm] enable DP/DP+EP/TP+EP for atom-vllm model

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

* support kimi2.5 64 attn heads (ROCm#886)

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* make lint happy

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* fix

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* trim decode tensors for moe

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* fix non triton routing expert mask in moe

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* fold heads to 8

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* black

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* add enable dbo

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* fix mha kv lens

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* bind single topk index buffer across sparse mla ubatches

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* add dsv32 dbo to nightly ci

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* fix topk

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* fix kimi k2.5 dbo

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* fix

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* fix

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* clean the code by removing dbo related code

* remove dbo test cases

* clean up DP+EP PR: remove dead code, fix stale docs/types

Review follow-ups (no behavior change):
- Remove unused get_mla_persistent_metadata_dtypes() in attention/layer_mla.py
  (defined, never referenced).
- Remove unused self.n_shared_experts in qwen3_next.py (only set, never read;
  all real uses go through config.n_shared_experts).
- AiterMlaPersistentMetadataForVllm fields -> torch.Tensor | None, matching the
  all-None disabled_mla_persistent_metadata() construction path.
- Update init_aiter_dist_from_vllm docstring to reflect it reuses TP/PP/DP/EP
  (not just TP), matching the earlier tp->dist rename.
- Fix misleading fold_factor docstring example (32 -> 8, the realized case).

Verified with black --check and ruff check in container guanbao_vllm_atom_0609.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(moe): replace config-time is_vllm() with Config flags (PR ROCm#1101 cleanup)

Move the three config-time is_vllm() branches out of native MoE files into
plugin-set Config flags, so the frontend decides policy in atom/plugin/config.py
instead of native code querying is_vllm() at the call site. Behavior-preserving:
defaults are native-correct and the vLLM builder sets the flags to its prior
values.

- moe_ep_flatten_tp_across_dp (default False): MoE EP computes ranks in the
  flattened DP x TP device space and disables fused shared experts. vLLM sets it
  from enable_expert_parallel. Replaces is_vllm() in topK.py
  (is_rocm_aiter_fusion_shared_expert_enabled_for_quant_config) and moe.py
  (FusedMoEParallelConfig.make).
- mori_max_tokens_per_dp_rank (default 16384): per-DP-rank MORI dispatch buffer
  size. vLLM sizes it to max_num_batched_tokens. Replaces is_vllm() in moe.py
  (_maybe_make_prepare_finalize).

Removes the native imports of is_vllm from moe.py and topK.py.

* fix(v4-ep): stop trimming MORI dispatch buffer below received-token count

DeepSeek-V4-Flash with dp4+ep4 hit an illegal memory access in the AITER FlyDSL
fused_moe kernel under concurrent/prefill load.

Root cause: in FusedMoEModularKernel.forward the cudagraph-style trim shrank
dispatch_a1 to the decode bound graph_bs*topk*dp_size while fused_moe was still
driven by num_local_tokens=expert_num_tokens (the true received-token count). In
a mixed DP+EP batch a decoding rank (small graph_bs) still receives many tokens
via the all-to-all from prefilling peers, so the bound under-counts recv (e.g.
trim to 24 rows while recv=638) and the MoE kernel reads past the trimmed buffer
-> illegal memory access. EP-only, so pure tp4 is unaffected.

Fix: atom-vllm uses an exact-recv trim (trim_vllm_mori_dispatch_tensors), which
trims to the graph_bs*topk*ep bound only under a uniform FULL-cudagraph batch
(bound >= recv by construction), skips trimming during graph capture, and
otherwise trims to the exact received-token count. The native ATOM path keeps
the decode-bound trim.

The two frontend-specific MORI seams are kept out of native fused-moe code
(PR ROCm#1101 is_vllm() cleanup): native modular_kernel/mori_prepare_finalize expose
overridable methods (_maybe_trim_dispatch_output, _get_dispatch_config) with the
native default, and the atom-vllm behavior is injected by a plugin monkeypatch
(atom/plugin/vllm/mori_patch.py, applied from register.py). No is_vllm() and no
Config flag in native files. The MORI all-to-all buffer is sized from
moe.max_num_tokens instead of a dedicated Config field.

Originally validated dp4+ep4 eager: 0 crashes across a full gsm8k run (vs
immediate crash before); gsm8k 3-shot exact_match = 0.96.

---------

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Co-authored-by: zejunchen-zejun <zejun.chen@amd.com>
Co-authored-by: Claude Opus 4 <noreply@anthropic.com>
Co-authored-by: kliuae <17350011+kliuae@users.noreply.github.com>
Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com>
Co-authored-by: Guanbao Yu <gyu@amd.com>
Co-authored-by: ganyi <ygan@amd.com>
…on (ROCm#1368)

MiniMax-M3 sparse attention reuses the unified KV cache and kv_scale for
K/V, so the fp8 per-token scales already travel with the KV blocks. It
keeps one extra per-token buffer, runner.sparse_attention_index_cache,
holding the indexer keys used for top-k block selection at decode time.
get_kv_transfer_tensors() never registered that buffer, so under PD
disaggregation the decode node ran top-k against a zero/stale index for
the prefilled tokens and attended to the wrong KV blocks. This is masked
for short prompts (the init+local+topk window already covers every block,
so selection is moot) but corrupts output once the context exceeds that
window.

Register the indexer-key cache as block-indexed transfer regions (one per
sparse layer, same physical-block striding as the KV cache), guarded by
getattr so non-sparse models and bf16 paths are unaffected.

Tested (latest image, 1P+1D TP4, fp8 KV via Triton attention): GSM8K
5-shot = 0.9401, i.e. no regression to M3 fp8 PD. Short-prompt GSM8K does
not exercise the long-context top-k path the buffer affects; that path is
covered by review, not this run.
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
…currency (ROCm#1394)

At high --max-concurrency each in-flight request holds a socket fd. The
default soft RLIMIT_NOFILE (~1024) is exhausted client-side (EMFILE on
socket()), so most requests fail before reaching the server and the run
reports only ~one concurrency-wave of successes (e.g. ~919/10240 at
conc=1024) while the server logs 200 OK for every request it actually
receives. The server already calls set_ulimit() at startup; call it in the
benchmark client too (soft is raised toward 65535, capped at the hard limit).

Co-authored-by: ZhangLirong-amd <ZhangLirong-amd@users.noreply.github.com>
…OCm#1318)

* [Feature] OFFLOAD: add LMCache CPU/NVMe KV-offload subsystem

Add a standalone KV-offload subsystem that offloads ATOM KV cache to
LMCache-backed CPU/NVMe storage and reloads it on cache hits, avoiding
prefill recompute for evicted prefixes.

- New atom/kv_transfer/offload package: LMCacheOffloadConnector, the
  ATOM<->LMCache GPU connector, a byte codec for ATOM's packed KV layout,
  a Triton staging kernel, plus metadata and config.
- Wire the connector into the disaggregation base/factory/types and
  aggregate per-worker finished/failed transfer states.
- Unit tests for the connector and byte-codec round-trip.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* [Feature] OFFLOAD: integrate KV offload load/save into engine and scheduler

Drive offload load/save from the engine loop and scheduler:
- Dispatch async KV load after connector metadata, poll worker transfer
  status, and advance idle KV transfer when no forward batch runs.
- Defer block free until a background D2H save has read the KV, and wake
  parked prefills for local recompute on a load miss (failed_recving).
- Handle chunked-prefill deferred output across the offload park/resume
  boundary so stale sampled tokens are dropped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* [Frontend] Support max completion tokens in OpenAI API

Honor max_completion_tokens (and the max_tokens alias) in the OpenAI
completion/chat protocol and server so offload benchmarks can bound
generation length. Adds protocol and server-helper unit tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* [Docs] Add design README for LMCache CPU/NVMe KV offload connector

Document the ATOM standalone lmcache_offload connector: design, module
map, scheduler/worker architecture, byte codec and AITER layout bridge,
MemoryObj/segment layout, completion protocol, reload decision and
chunk-alignment handoff, correctness/fp8/failure handling, the LMCache
reuse-vs-override boundary, configuration, benchmarks, and tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: yihonglie <hyi@amd.com>

* [Feature] OFFLOAD: support MLA models (DeepSeek R1/V3, Kimi)

The byte codec assumed a block-major KV layout (tensor.shape[0] == block
count), which holds for MHA/GQA but not for MLA: MLA stores a single
per-layer latent cache (kv_lora_rank + qk_rope_head_dim, e.g. 576) viewed
token-major as (num_blocks * block_size, 1, latent), with no separate V or
scale tensors. So shape[0] is the token count and the codec computed a
per-token (not per-block) byte stride, corrupting the offloaded KV.

Both layouts share an identical contiguous byte layout (block b always
starts at b * bytes_per_block), so instead of branching we take the physical
block count explicitly and derive each segment's per-block stride as
segment_bytes / num_blocks. The Triton fused staging kernel is byte-addressed
and needs no change.

- ATOMKVByteCodec: accept explicit num_blocks; per-block bytes from it;
  require contiguous + numel divisible by num_blocks (replaces the
  "same shape[0]" check). Falls back to shape[0] when num_blocks is None,
  preserving non-MLA behaviour.
- Thread num_physical_kvcache_blocks: model_runner.allocate_kv_cache ->
  forward_context.set_kv_cache_data -> connector.register_kv_caches ->
  codec. Optional num_blocks kwarg added to the connector base + mooncake +
  moriio impls (ignored there).
- build_lmcache_metadata: emit an MLA-shaped kv_shape (latent dim) for
  bookkeeping; storage stays opaque BINARY so use_mla remains False.
- Tests: MLA token-major block accounting + byte-identical round-trip.

Validated on DeepSeek-V3-5layer (real MLA, TP=2) end-to-end: offload save +
reload (cxs multi-round, round 2 hits cached:[~33k], no recompute).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* [Docs] Update LMCache offload README for MLA support

Document MLA (DeepSeek R1/V3, Kimi) KV-offload support in the connector
README: token-major latent cache layout, explicit num_blocks threading,
and BINARY opaque storage.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: yihonglie <hyi@amd.com>

* [Enhancement] OFFLOAD: make unaligned handoff always-on; drop OFFLOAD_UNALIGNED_HANDOFF

Unaligned HBM-prefix loads now always take the handoff path (recompute the
misaligned head up to the next chunk boundary, then load the aligned remainder
from CPU) instead of being gated behind the OFFLOAD_UNALIGNED_HANDOFF env var.
The env read and the gate check in _maybe_start_unaligned_handoff are removed;
the min_load / boundary guards are unchanged.

- connector.py: drop _allow_unaligned_handoff + OFFLOAD_UNALIGNED_HANDOFF read;
  handoff is now unconditional.
- README.md / README.zh-CN.md: remove the env var from the tuning table and the
  example commands, add a "removed" note, and document the always-on behaviour.
  Also track the previously-untracked zh-CN README.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: yihonglie <hyi@amd.com>

* [Bugfix] OFFLOAD: drop duplicate enable_chunked_prefill kwarg in test MockConfig

The rebase onto main left two enable_chunked_prefill keys in MockConfig's
defaults dict (our branch moved it up and set =True; main kept the old =False
line while adding hf_config after it). Python raises SyntaxError on a repeated
keyword in a dict() call, so conftest failed to import and the whole unit-test
suite aborted (exit 4).

Keep the single enable_chunked_prefill=True (our intended default; all tests
that depend on the value override it explicitly) plus main's hf_config stub.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: yihonglie <hyi@amd.com>

* [Bugfix] OFFLOAD: drop stale unaligned-skip test left over from always-on handoff

Commit "make unaligned handoff always-on; drop OFFLOAD_UNALIGNED_HANDOFF"
removed _allow_unaligned_handoff from the connector but left three dead
references in the test file, including test_load_is_skipped_if_hbm_floor_is_
not_chunk_aligned, which encodes the old default-skip behaviour. With handoff
now unconditional, an unaligned HBM floor (hbm=6, chunk=4, min_load=0) takes
the handoff path (park to boundary 8, emit the load once cached reaches 8)
instead of skipping the CPU load, so that test's `lookup.cleared == ["654"]`
no longer holds and it failed in CI.

The scenario is already covered by the always-on tests:
- test_unaligned_hbm_handoff_prefills_boundary_then_emits_load (handoff path)
- test_unaligned_handoff_skips_if_boundary_remainder_is_too_small (the real
  skip+clear case, gated by min_load rather than alignment)

Delete the stale test and the three no-op _allow_unaligned_handoff assignments
that the connector no longer reads.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: yihonglie <hyi@amd.com>

* [OFFLOAD] Remove Chinese README.zh-CN.md

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Signed-off-by: yihonglie <hyi@amd.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nchmark (ROCm#1371)

* [atom benchmark] add indexer cache for M3
in atom benchmark

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* remove wa qr env flag as aiter fix the issue

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* add online quant into recipe

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

---------

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
* Modify 'clean up containers' logic for sglang accuracy

* Modify model cache path to adapt hjbog-20
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
* open multistream for vllm

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

* rm patch

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

* add eager guard

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

---------

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
…hmark schedule (ROCm#1405)

* ci: reuse setup-gpu-container in bench & mmstar container start

De-inline the duplicated "Start CI container" docker run from
atom-bench-container and atom-mmstar-ci into setup-gpu-container, so the
container boilerplate lives in one place.

setup-gpu-container:
- add pull-policy input (explicit always/missing/never; falls back to the
  runner-based heuristic when empty)
- add disable-mmap input (default true; set false to skip ATOM_DISABLE_MMAP)
- make runner optional (default "")
- key the env-file by container name so concurrent containers don't clobber it
- drop the duplicated -v/-w in docker run

atom-bench-container: Start step now uses setup-gpu-container
(network-host=true, pull-policy=always, container-env -> extra-run-flags);
keeps its model-download step.

atom-mmstar-ci: Start step now uses setup-gpu-container; passes
disable-mmap=false to keep byte-for-byte parity (mmstar never set
ATOM_DISABLE_MMAP). MODEL_CACHE_MOUNT == setup-gpu's auto /models mount.

Behavior preserved; only runtime change is mmstar's image pull now hard-fails
on error (--pull always vs prior best-effort docker pull).

* ci(benchmark): advance nightly schedule to 00:12 Beijing (16:12 UTC)

Move the ATOM Benchmark nightly cron 48 min earlier (01:00 -> 00:12 Beijing,
17:00 -> 16:12 UTC).

* ci(accuracy): move base DeepSeek-R1-0528 to nightly; condense long _baseline_note

- DeepSeek-R1-0528 (base) test_level pr -> nightly (no longer per-PR)
- Trim the 4 over-long _baseline_note entries (online-quant, MiMo-V2-Flash,
  MiMo-V2-Flash MTP, V4-Pro TBO+DPA) to <=270 chars; keep all hard facts
  (baselines, run ids, thresholds, MTP tp/num-spec constraints).
Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com>
Co-authored-by: JiaoliangYu <Jiaoliang.Yu@amd.com>
* [Fix] Fix DeepSeek-V4 DP + EP on gfx942 (MI308X)

* Using the existing API
@JiaoliangYu JiaoliangYu force-pushed the feat/eplb-expert-load-pass branch from 03e503b to 042faeb Compare June 30, 2026 07:03
@JiaoliangYu JiaoliangYu force-pushed the feat/eplb-expert-load-pass branch from 042faeb to e7231dc Compare June 30, 2026 07:18
JiaoliangYu added 2 commits June 30, 2026 16:27
bincount + boolean-mask indexing + `.any()` host-sync raise
"operation not permitted when stream is capturing" inside the decode
cudagraph. Replace with fixed-shape scatter_add_ so the module-A record
hook can run within graph capture (return value unchanged).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.