feat(eplb): add per-layer expert-load statistics monitor for EP path#2
Open
JiaoliangYu wants to merge 97 commits into
Open
feat(eplb): add per-layer expert-load statistics monitor for EP path#2JiaoliangYu wants to merge 97 commits into
JiaoliangYu wants to merge 97 commits into
Conversation
* fix(v4): correct TBO + dp-attention accuracy regression Two independent bugs collapsed DeepSeek-V4-Pro GSM8K accuracy under `--enable-dp-attention --enable-tbo` (3-shot flexible-extract ~0.95 -> ~0.87, and as low as ~0.57 at concurrency 1000): 1. ids-gather on a side stream (deepseek_v4.py) The V4 hash-routing input_ids all-gather was run via `_run_on_tbo_comm_stream` on a private side stream. The gathered ids must be row-aligned with the DP-gathered hidden/router output (both share the dp_group communicator and `_hash_topk` indexes them positionally via `tid2eid[ids[:num_tokens]]`). Running it on an independent stream desyncs the collective's placement relative to the per-ubatch hidden all-gathers and only synchronizes the forward-top stream, not the TBO ubatch stream that consumes it -> misaligned rows -> wrong expert routing. The corruption scales with batch size (0.87 at conc 65, 0.57 at conc 1000). Fix: run the gather INLINE on the compute stream (the ids tensor is tiny vs hidden, so no real overlap is lost). Do NOT wrap it in the TBO ping-pong -- an extra forward-top yield desyncs the ring and collapses accuracy to ~0.54. 2. uninitialized padding in pad_for_all_gather (moe.py) Padding rows for the uniform all-gather were allocated with `torch.empty` (garbage). Padded rows are all-gathered across DP ranks and fed straight into the aiter fused-MoE expert GEMM, where garbage leaks into real tokens' outputs (~0.7pp GSM8K drop at large batch). Fix: explicitly zero the pad rows. Also in moe.py: simplify reduce_scatter_with_unpadding to a dim-0 slice (removes the deprecated list-indexing that triggered a PyTorch 2.9 UserWarning) and tidy pad_for_all_gather / all_gather_with_padding. Add a DeepSeek-V4-Pro TBO+DPA nightly accuracy entry (num_concurrent=1000) to models_accuracy.json. Local 1319-sample GSM8K 3-shot across runs: 0.9439 / 0.9484 / 0.9515 / 0.9530 / 0.9538 (mean ~0.950, baseline ~0.9522). Update recipes/DeepSeek-V4.md accordingly. * fix(v4): use dual-stream shared_experts.forward and drop pad-zeroing - dual_stream_moe_forward: call shared_experts.forward(x) directly on the alt stream. - pad_for_all_gather: drop the explicit pad-row zeroing now that DP all_gather goes through the IPC-registered (no-copy-in) path under graph capture (see ROCm/aiter#3713). conc1024 GSM8K stays within noise (0.9492).
* ci(mesh): add Atomesh accuracy and benchmark workflows - Validate standalone-mode accuracy via Atomesh entrypoints. - Mocker benchmark to PD routing scenarios with topology and consumer concurrency matrix. * [ci][mesh] add Atomesh mocker benchmark dashboard - Add a custom dashboard for Atomesh mocker benchmark results. - Show throughput, latency, detailed performance data, commit links, and CI run links. - Align the benchmark matrix with 1P1D, 2P1D, and 3P1D topologies across consumer concurrency levels. * [ci] Skip unrelated ATOM, vLLM, and SGLang CI for mesh-only PRs. * [ci][mesh] Enable mocker dashboard publishing workflow to run on zwan/feat-mesh-ci pushes. * Polish Atomesh mocker dashboard legends * [ci][mesh] fix atomesh standalone accuracy data source * Revert 'Enable mocker dashboard publishing workflow to run on zwan/feat-mesh-ci pushes.' * [ci][mesh] add logo and display theme for mesh mocker benchmark dashboard * [ci][mesh] Polish Atomesh dashboard and accuracy data flow
…m registration (ROCm#1214) * [Sglang_atom][Fix] Fix the issue where DPSK v3.2 cannot recognize atom model registration after upgrading SGLang to version 0.5.12. * [sglang_atom][scripts] add dpsk v3.2 tp8 and dpep case
* [ATOM SGL] dsv4 init * register * fix acc * fix profiler error * using 0.5.12 sglang * precheckin * remove local scratch files from PR Co-authored-by: Cursor <cursoragent@cursor.com> * remove local curtest script from PR Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: zhuyuhua-v <yuhzhu@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>
…g enabled (ROCm#1184) * feat(server): add Anthropic Messages API endpoint (/v1/messages) Enables Claude Code and other Anthropic-compatible tools to use ATOM as a backend. Translates between Anthropic Messages format and ATOM's internal OpenAI format. Supports: - Non-streaming and streaming responses - System messages, multi-turn conversations - Thinking/reasoning content separation (via ReasoningFilter) - Anthropic SSE event format (message_start, content_block_delta, etc.) - Tool definitions translation (Anthropic → OpenAI format) Usage with Claude Code: ANTHROPIC_BASE_URL=http://localhost:8000 \ ANTHROPIC_AUTH_TOKEN=dummy \ ANTHROPIC_MODEL=MiniMax-M2.7 \ claude * fix(anthropic): fix streaming handler, reasoning filter, and Claude Code compat - Fix ToolCallStreamParser integration: consume (event_type, data) tuples from process()/flush() instead of calling nonexistent get_content()/ get_tool_calls() methods - Fix cleanup_streaming_request() call with missing request_id argument - Fix _build_sampling_params() missing ignore_eos, None top_k/top_p - Init ReasoningFilter in state 1 when chat template ends with <think>, so thinking models like K2.6 have reasoning properly hidden - Increase ReasoningFilter buffer threshold from 7 to 100 chars to avoid prematurely emitting thinking as visible content - Add prompt truncation when input exceeds max_model_len - Add cache_creation_input_tokens and cache_read_input_tokens to usage * fix(anthropic): pass tool definitions to model via chat template Claude Code sends tool schemas (WebSearch, Bash, etc.) in every request, but the /v1/messages handler was hardcoding tools=None. The model never saw tool definitions and couldn't generate proper tool_use calls. Now converts and forwards request.tools via anthropic_to_openai_tools(), enabling the model to use WebSearch, WebFetch, and other Claude Code tools. * fix(anthropic): suppress thinking blocks, add signature support - Skip streaming thinking blocks entirely to avoid Claude Code's signature verification rejection. Thinking still happens server-side but only the final answer is sent to the client. - Add signature field to thinking content blocks and signature_delta SSE events for compatibility with Claude Code 2.1.143+. - Add stream_signature_delta() helper function. * fix(anthropic): strip attribution header, use model tool IDs - Strip Claude Code's x-anthropic-billing-header from system prompt server-side (matches vLLM behavior) to preserve prefix caching - Use model-native tool call IDs (functions.name:index) instead of random UUIDs, matching vLLM's kimi_k2 parser for multi-turn compat - Remove unused uuid import from tool_parser - Add tests for attribution header stripping --------- Co-authored-by: carlushuang <carlus.huang@amd.com>
* [ATOM SGL] update fp8 prefill argument passing * use simpler setting * precheckin
… on serve startup (ROCm#1221)
…o to v4 flash and use tp4 (ROCm#1232) * [atom-vllm] change ci case from v4 pro to v4 flash and use tp4 Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * add preflight check for atom-vllm ci Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * add nccl handle error check Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> --------- Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
…rrupting MoE (ROCm#1229) * fix(v4): zero-init all-gather padding to stop uninitialized memory corrupting MoE pad_for_all_gather built the padding rows with torch.empty and never zeroed them (the .zero_() was commented out), contradicting the function's own docstring. Those uninitialized rows are all-gathered across DP ranks and fed straight into the aiter fused-MoE expert GEMM, and the padded input_ids reach tid2eid[ids] for V4 hash routing. Garbage there leaks into real tokens' outputs. Because the corruption is whatever happens to sit in freshly-allocated GPU memory, the result is nondeterministic across machines/runs: locally it landed at GSM8K ~0.95, but CI on a different SKU dropped to 0.9007 (TBO+DPA conc1000, below the 0.93 threshold) and a local rerun crashed with a null-pointer GPU memory access fault (garbage id -> out-of-range expert -> invalid weight ptr). Restoring the zero fixes all three: padding hidden is benign and padding ids route to expert 0. With the pad guaranteed zero, the _hash_topk clamp band-aid is replaced by an assert that input_ids length matches gating_output num_tokens, surfacing any real DP-layout mismatch instead of silently masking it. Also remove the _run_on_tbo_comm_stream side-stream helper: its only caller (MoE.combine_outputs TP all-reduce) now runs inline, matching the ids-gather which must stay inline to keep DP collective ordering aligned under TBO. Rename compress_stream -> indexer_stream for accuracy. Verified: V4-Pro TBO+DPA conc1000 GSM8K 3-shot = 0.9515 (flexible) / 0.9522 (strict), no GPU fault, drain clean. * ci: TEMP run only DeepSeek-V4-Pro TBO+DPA conc1000 (revert before merge) Flip every accuracy entry except the TBO+DPA conc1000 case to test_level "off" so any trigger (pr/push/dispatch/schedule) runs only this one job, to validate the pad zero-init fix in CI quickly. DO NOT MERGE this commit — drop it before merging the PR. * Fix TBO 1024c accurary issue by remove cpu yield in collective op (cherry picked from commit 9bf2d25) * test(v4): disable pad zero-init for CI repro + print server cmd - moe.py: temporarily comment out pad_for_all_gather zero-init to reproduce the uninitialized-padding behavior in CI (the CI gate already restricts the run to the V4-Pro TBO+DPA conc1000 case). - deepseek_v4.py: restore the tid2eid[ids] clamp as a bounds guard for hash routing. - atom_test.sh: print the full openai_server command line before launch so the CI log shows the exact server args. Experiment on top of the pad zero-init fix — not for merge as-is. * ci: restore full accuracy matrix (undo temp single-case gate) Reverts the test_level "off" gate from 3662ac0 — all accuracy cases are re-enabled at their original pr/main/nightly levels. The CI experiment that needed only DeepSeek-V4-Pro TBO+DPA conc1000 is done. * ci: lower gpt-oss-120b accuracy threshold to 0.87 Both gpt-oss-120b entries (1-GPU and 2-GPU) drop from 0.88 to 0.87 to absorb run-to-run GSM8K variance. Other models unchanged. * perf(v4): fuse _hash_topk into a single Triton kernel The hash-routing custom_routing_function for V4's first layers ran softplus+sqrt over every routed expert (n_routed_experts ~256-384) but kept only topk (~6) of them, plus separate clamp / tid2eid gather / score gather / renorm / scale ops. triton_hash_topk.py fuses all of it into one kernel (one program per token): id clamp, tid2eid[id] lookup, gating gather at the selected experts only, sqrt(softplus(.)), optional renorm and scaling. When shared experts are fused it writes directly into the first topk columns of the global topK buffer, avoiding an extra copy. Numerics match the PyTorch path (max|dw| ~1e-7 fp32 / ~5e-7 bf16 across OOB ids, bf16, renorm on/off, sliced-buffer write). V4-Pro TBO+DPA conc1000 GSM8K 3-shot = 0.9522. * ci: print server cmd with [@] expansion to match actual invocation Use ${ARRAY[@]} instead of ${ARRAY[*]} in the debug echo so the printed command line reflects the same word-splitting/quoting as the real launch that uses "${ARRAY[@]}" (addresses Copilot review). --------- Co-authored-by: ZhangLirong-amd <Lirong.Zhang@amd.com>
…case (ROCm#1216) * [atom-vllm benchmark MTP] refine benchmark command for atom-vllm MTP case Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * add performance mode for glm4.7 mtp case and qwen3next mtp case Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * add qwen3next mtp config Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * remove perf mode because it is useless Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * fix missing allreduce for glm4.7 mtp Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * align atom-vllm acc test Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * add mtp accept ratio check Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> --------- Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
…Cm#1008) * Release mi308 benchmark results to dashboard * dashboard hardware default mi355x * add atom-sglang mi308 ci * Release mi308 benchmark results to dashboard * dashboard hardware default mi355x * Supplementary content for MI308 * Add steps: Print runner user * dashboard: hardware single-select filter and dynamic header meta * Modify models for MI308X * Modify models for MI308 * Make sure python version for MI308 * Modify python3 version for MI308 with python-build-standalone --------- Co-authored-by: root <root@hjbog-srdc-15.amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>
* feat: let rtp use atom & qwen35_moe impl * fix: log print too much * fix: rtp+atom default moe ep * fix: add rtp+atom attention_inputs.position_ids * fix: qwen35 fp8_perblock pass and add some print * fix: remove dump_tensor * refactor: remove dump and refactor attention_backend * refactor: remove redundant code * refactor: kvcache and remove redundant code * feat: some opt in positions & layer_group_map * refactor: some optimizations and del redundant code * fix: RTP Qwen35 skip_python_model * feat: enable cuda graph for ATOM+RTP * refactor: remove redundant code * fix: non cuda graph long input crash * test:cover RTP plugin import and seq_lens behavior * test: ruff check * [RTP]Refactor ATOM-RTPLLM Attention * Refactor RTP prepare model entrance * fix: ruff check * fix: address RTP plugin review feedback * fix: remove redundant RTP Qwen3.5 import aliases * fix: qwen35 ruff check F401
* glm_moe_dsa: support GLM-5.2 IndexShare (FP8)
GLM-5.2 (glm_moe_dsa) extends the DeepSeek-V3.2-style DSA stack with
IndexShare: layers marked "shared" in `indexer_types` reuse the preceding
"full" layer's indexer/topk and carry no indexer weights of their own in
the checkpoint.
- models/deepseek_v2.py:
- Make `indexer_types` the authoritative source for the per-layer
indexer-skip decision (supersedes index_topk_pattern / index_topk_freq).
- Honor `index_skip_topk_offset` in the freq-based fallback (default 1
preserves existing DeepSeek behavior).
- Reuse the cached topk for the MTP layer when
`index_share_for_mtp_iteration` is set.
- Do not build indexer weights for "shared" layers; otherwise their
parameters load nothing from the checkpoint, stay at init values and
corrupt the indexer (the forward and the index-cache binding already
guard on `indexer is not None`).
- config.py: auto-enable `use_index_cache` for glm_moe_dsa when the model
declares an IndexShare schedule, so serving works without passing an
--hf-overrides flag.
- plugin/vllm/model_wrapper.py: re-apply the auto-enable after vLLM
replaces ATOM's hf_config.
Validated on 8x MI355X (TP=8, FP8): native ATOM loads all weights with no
unloaded params and generates correctly for 1k/1k and 8k/1k inputs.
* docs: document GLM-5.2 (IndexShare) serving + add News entry
- recipes/GLM-5.md: add a GLM-5.2 (IndexShare) section with the TP8 serve
command, configuration tips (bf16 KV, gpu-mem-util 0.8), and 8xMI355X
perf baselines for 1k/1k and 8k/1k; add a pointer from the intro.
- README.md: add a News entry announcing GLM-5.2 FP8 support.
* docs: note GLM-5.2 in README Supported Models table
* style: black formatting for indexer_types skip return
* style: condense GLM-5.2 code comments
* refactor: move maybe_enable_glm_dsa_index_cache into deepseek_v2
Own the indexer-cache auto-enable in the model: call it once in
DeepseekV2ForCausalLM.__init__ (covers native + vLLM plugin) instead of
in config.get_hf_config and the vLLM wrapper.
* refactor: inline index-cache enable into _should_skip_index_topk
Drop maybe_enable_glm_dsa_index_cache; instead, when index_topk_freq > 1
(IndexShare) turn on use_index_cache directly in _should_skip_index_topk.
No model_type gating needed.
* refactor: gate index_topk_freq check under the use_index_cache branch
* refactor: drop redundant 'or 1' guard on index_topk_freq
* benchmark: add GLM-5.2-FP8 to dashboard (perf + accuracy)
Native-engine catalog entries for the nightly dashboard:
- models.json: TP8 FP8, kv_cache_dtype fp8, --gpu-memory-utilization 0.8
(DSA index cache OOMs at default 0.9), conc up to 256.
- models_accuracy.json: gsm8k threshold 0.92 (measured 3-shot
flexible-extract 0.9447 on 8x MI355X).
Co-authored-by: JiaoliangYu <jiaolyu@amd.com>
change model cache mount for AAC machine (ROCm#1247)
…seek-r1-fp4-tp4-dp8-ep8 to tp8-dp8-ep8 (ROCm#1258) * Modify sgl accuracy schedule time and change deepseek-r1-fp4-tp4-dp8-ep8 to tp8-dp8-ep8 * Remove duplicate cases
* ci: add actionlint workflow check * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * ci: fix existing actionlint findings * ci: keep accuracy validation steps alias * ci: remove stale SGLang accuracy inputs --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…schedule mode (ROCm#1268) * Modify atom-sglang-benchmark model priority for schedule mode
* temp for docker * feat: default PA BF16 ASM on gfx1250 unified attention Route eligible non-SWA decode layers through pa_decode_bf16_asm on gfx1250 when ATOM_USE_UNIFIED_ATTN is enabled. ATOM_FORCE_ATTN_TRITON disables this path and keeps attention on Triton. Keep mainline unified-attention SHUFFLE KV layout, remove the old ATOM_GPTOSS_USE_PA_DECODE_BF16_ASM env gate, and warn instead of asserting when ATOM_USE_UNIFIED_ATTN sees a non-default block size. Validation: git diff --check origin/main...HEAD; python -m py_compile atom/model_ops/attention_mha.py atom/model_ops/attentions/aiter_attention.py atom/model_ops/attentions/triton_mha.py atom/utils/envs.py. * fix: keep bf16 query for PA ASM decode * Fix PA ASM cudagraph metadata refresh * Zero PA ASM padded decode rows * fix stfmax sacel * set is_causal=True in ps_metadata * Refactor PA ASM decode: unify metadata and merge bf16_asm into persistent_asm - Switch pa_decode_bf16_asm metadata from get_ps_metadata_v1 to get_pa_metadata_v1, so it shares the persistent worker buffers used by pa_persistent_fwd. Drop the separate pa_decode_bf16_asm_* buffer set, set_pa_decode_bf16_asm_metadata, and the per-layer fallback metadata builder. - Merge paged_attention_pa_decode_bf16_asm into paged_attention_persistent_asm: sinks is None -> pa_persistent_fwd, else -> pa_decode_bf16_asm. dispatch_backend routes both through persistent_asm. - Enable persistent decode for block_size 256 and 1024 (was 1024 only). - Guard paged_attention_asm against sinks (run_pa_fwd_asm has no sink support). - Simplify q fp8 quant: use the fixed kv_scale_float for q/k/v dequant scale (pre-allocated tensor, CUDAGraph-safe) instead of a dynamic q.abs().max(). - Drop the CUDAGraph-capture safe-metadata prep and input validation helpers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Drop pa_decode_bf16_asm skip logging Remove _log_pa_decode_bf16_asm_once, _skip_pa_decode_bf16_asm, the _pa_decode_bf16_asm_log_keys set, and the now-unused logging import. _should_dispatch_pa_decode_bf16_asm returns False directly for the skipped cases. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Refactor PA ASM decode: clean up dispatch, remove dead code - Flatten _dispatch_decode: sliding_window as first gate, then unified/triton/asm - Inline _should_dispatch_pa_decode_bf16_asm logic, delete the method - Remove _log_pa_decode_bf16_asm_once, _skip_pa_decode_bf16_asm, log keys set, unused logging import - Remove pa_decode_bf16_asm_metadata field from AttentionMetaData - Remove _pa_decode_bf16_asm_num_head_k (write-only) - Remove gfx1250 guard from attention_mla.py fused FP8 GEMM path - Clean up ATOM_GFX1250_FALLBACK env var and simplify env var docs Co-Authored-By: Claude <noreply@anthropic.com> * Remove obsolete gptoss PA ASM shuffle repro doc Co-Authored-By: Claude <noreply@anthropic.com> * Fix PA ASM decode: handle fp8 query input and bf16 output - Skip q quantization when rope_cache already produced fp8 query - Allocate output as explicit bf16 (kernel requires bf16 output, empty_like inherited fp8 from q_5d) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: ahmed-bsod <Muhammad.Ahmed@amd.com> Co-authored-by: hwang <hwang@ctheliosp-1b112-a43-1.amd.com> Co-authored-by: HaonanWang98 <hwang@amd.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ainer run flags (ROCm#1281) * run MI355 and MI308 GPU shards in parallel
…OCm#1272) * feat: fuse V4 decode SWA cache-write into qk_norm_rope_maybe_quant Thread the SWA ring scatter through the qk_norm+rope bridge so the V4 decode path no longer launches a standalone swa_write per layer. When swa_kv is provided, the post-norm/rope KV row is written into swa_kv[slot, pos % cache_size, :] (slot = state_slot_mapping[ batch_id_per_token[t]]) inside the same kernel: - flydsl path: fuses the scatter into the qk_norm launch (no extra kernel, no [T, D] KV HBM round-trip), via the new swa_kv / state_slot_mapping / batch_id_per_token args on flydsl_qk_norm_rope_quant. - Triton fallback: emits the existing swa_write as a separate launch (driven by swa_cu_seqlens_q + state_slot_mapping) so both backends have identical side effects. deepseek_v4.py decode deletes its standalone swa_write call and passes the SWA args through the bridge instead; prefill is unchanged (still writes its in-chunk SWA tail via swa_write after sparse_attn). BF16 only. Requires the matching aiter change (ROCm/aiter#3776) for the flydsl fused-scatter kernel support. * ci: drop GLM-5-FP8 from benchmark matrix to stay under 256 cells The nightly atom-benchmark grid had grown to 264 fully-expanded matrix cells, exceeding GitHub Actions' hard limit of 256 configurations per job. Remove the GLM-5-FP8 benchmark variant (superseded by GLM-5.2-FP8, which is retained) and its workflow_dispatch checkbox (keeping it in sync with the catalog prefixes). Matrix now resolves to 250 cells. Accuracy validation (models_accuracy.json) and the dashboard color map are left unchanged — GLM-5-FP8 stays covered there. * fix: standardize V4 batch_id_per_token on int32 for fused SWA scatter The fused decode SWA scatter loads batch_id_per_token at int32 width (see ROCm/aiter#3793). The producers were int64, which raised "batch_id_per_token must be 1-D int64" on the V4-Pro MTP decode path (server failed to start -> accuracy job timed out). Make all batch_id_per_token producers int32: - v4_batch_id_per_token CpuGpuBuffer (model_runner path) int64 -> int32 - batch_id numpy sources (per-fwd + MTP draft) int64 -> int32 - sglang / vllm plugin bridge batch_id buffers + numpy sources -> int32 int32 indices are accepted by torch advanced-indexing (indexer meta) and by the triton kernels (tl.load is dtype-agnostic); the explicit .to(torch.int64) casts in csa_translate_pack / sglang remain and tolerate int32 input. batch_id values are bounded by batch size, far below 2^31. Validated end-to-end: DeepSeek-V4-Pro MTP3 GSM8K (3-shot) flexible 0.9477 / strict 0.9484, above the 0.94 CI threshold; decode drained cleanly with no TypeError.
Replace the two-step indexer Q preparation (bf16 rope_rotate_activation + separate get_hip_quant(per_1x128)) with the fused fp8 path: a single rope_rotate_activation call that applies RoPE + Hadamard-rotate and writes the fp8-quantized Q with its per-(token, head) block scale via out_scale. The bf16 rotated Q is never read back, so quantizing it in-kernel avoids materializing the intermediate. group_size = head_dim (128) => one scale per (token, head). The fused kernel's fp8 quant matches dynamic_per_group_scaled_quant_kernel. Verified on DeepSeek-V4-Pro: GSM8K 3-shot ~0.953-0.957 and 10-shot 0.9568 (baseline 0.9522 +/- 0.0059, no regression); conc-16 throughput 1644 tok/s (on par with baseline).
* Remove static scale calculation from forward function - saves 2 kernel launches * Forward swiglu_limit through triton_kernel_moe_forward triton_kernel_moe_forward did not accept or forward swiglu_limit, causing triton_kernel_fused_experts to use its default of 7.0. Models using standard routing got incorrect activation clamping. * Raise NotImplementedError for SiLU + FP8 in triton MoE The SiLU branch only handles FP4 and BF16 activations. FP8 silently fell through to the BF16 path (moe_gemm_a16w4), ignoring the activation scales entirely. Fail loudly instead. * Guard triton path in apply() when ATOM_V4_TORCH_MOE is set ATOM_V4_TORCH_MOE causes process_weights_after_loading to return early before swizzling weights. Without a matching guard in apply(), the triton path receives un-swizzled weights producing garbage output or crashing on missing shared expert attributes. * Fix activation type annotation from str to ActivationType Both triton functions and the base class annotated activation as str with default "silu". All callers pass ActivationType enum values and the branch comparison uses ActivationType.Swiglu. A caller trusting the annotation could pass a string, silently taking the wrong branch. * Use function parameter for apply_router_weight_on_input in triton path Both triton routing paths read layer.apply_router_weight_on_input instead of the function parameter. Currently they match because the caller passes self.apply_router_weight_on_input, but the function argument was being silently ignored. * Stash and apply biases for shared experts in dense GEMM path process_weights_after_loading stashed shared expert weights and scales but not biases. _apply_shared_experts_dense therefore dropped biases for any future model combining has_bias=True with fused shared experts. * Fix swiglu_limit default to 7.0 to match inner kernel default The wrapper triton_kernel_moe_forward defaulted swiglu_limit to 0.0 but the inner triton_kernel_fused_experts defaults to 7.0. For SwiGLU models (GPT-OSS), limit=0.0 clamps activations to zero producing garbage output. * Remove dead ATOM_V4_TORCH_MOE env var check This debug escape hatch skipped weight swizzling in process_weights_after_loading but had no matching guard in apply(), silently feeding un-swizzled weights to triton kernels. Not used anywhere else in the codebase. * style: format moe.py for black check --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
* fix online quant * update comment * format --------- Co-authored-by: ganyi <ygan@amd.com>
…ig from JSON file (ROCm#1190) * [atom-vllm nightly acc] remove config in workflow file and fetch config from JSON file Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * remove term name Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> --------- Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
* Modify Qwen3.5-35B-A3B-FP8 runner * Replace jq with python3 * [fix](ci): mv plugin test to new node * Modify jq with python3 for vllm-test and add runner name in actionlint * [fix](ci): mv qwen3.5 test to mi355 node * Add model cache mount path for vllm-test * Add model cache mount for sglang-test * Adapt model cache mount path for new runner * Use host network * Remove deepseek-r1-fp8-tp4 from sglang-test * Align Kimi K2.5 PR CI with nightly settings Co-authored-by: Cursor <cursoragent@cursor.com> * Restore DeepSeek R1 FP8 TP4 SGLang CI Co-authored-by: Cursor <cursoragent@cursor.com> * Lower Kimi K2.5 PR accuracy threshold Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: perzhang <perzhang@amd.com> Co-authored-by: wuhuikx <hattie.wu@amd.com> Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>
…embed param naming (ROCm#1378) * perf(server): cut event-loop work in streaming hot path - Reuse engine-computed num_prompt_tokens in the stream response generators instead of re-encoding the prompt on the event loop at stream start (drops a redundant per-request tokenize). - Run multimodal input prep (image download + HF processor) in a worker thread instead of synchronously on the event loop. - Batch-decode a whole step's buffered stream chunks with one tokenizer.batch_decode in flush_stream_batch instead of one decode per seq on the output thread (one GIL-released call instead of N). - Coalesce each request's finalization SSE messages (content/finish + usage + [DONE]) into a single send to cut socket-write syscalls when many requests finish simultaneously. * perf(server): enable uvloop event loop; fix gpt-oss embed param naming uvloop: - Run uvicorn on uvloop (libuv) instead of the stdlib asyncio selector loop, with graceful fallback to the default loop if uvloop is absent. Under high streaming concurrency this cuts the event-loop cost of SSE socket I/O (sock.send / selector register-unregister): steady-state TPOT P99 8.50ms -> 8.18ms and frontend loop-scheduling delay roughly halved. Adds uvloop to dependencies. gpt-oss: - Register `embed_tokens` first (with `embedding` as the shared-storage alias) so it stays the primary, non-deduped name in named_parameters(). The checkpoint stores `model.embed_tokens.weight`; with `embedding` as the primary name the load-completeness check falsely flagged `model.embedding.weight` as unloaded even though the weight is loaded via the alias. Byte-identical weights (GSM8K 0.8832, unchanged); the spurious "parameters were NOT loaded" warning is gone.
) * feat(openai): support Qwen3 (qwen3_coder/qwen3_xml) tool-call format ATOM's OpenAI/Anthropic servers previously only parsed the Kimi-K2 tool-call token format (<|tool_calls_section_begin|>...), so Qwen3.5/Qwen3.6 tool calls -- emitted as qwen3_coder XML (<tool_call><function=NAME><parameter=...>) -- were returned as plain text and never surfaced as structured tool_calls. Agent frontends (qwen-code, OpenCode, etc.) therefore could not drive tools. Add Qwen3 XML parsing alongside the Kimi format, auto-detected: - tool_parser.py: parse <tool_call>/<function=>/<parameter=> into OpenAI tool_calls, with JSON-Schema type coercion of parameter values from the request's tools (the XML is typeless). Non-streaming + streaming (stream content, then buffer+parse the tool-call block -- robust against the partial-XML streaming edge cases seen in vLLM/SGLang). Kimi path unchanged. - protocol.py: deserialize tool_calls[].function.arguments (a JSON string in OpenAI requests) to a mapping in to_template_dict, so multi-turn chat templates that iterate arguments.items() (Qwen, Hermes) render tool history instead of raising "Can only get item pairs from a mapping". - serving_chat.py / api_server.py: thread the request's tools into the parsers for type coercion (default None preserves existing behavior). Verified: Qwen3.6-27B BF16 served by ATOM drives qwen-code end-to-end on gfx1151 -- write_file + run-shell tool calls execute and the agent reports the program output. * fix(openai): don't pass tools to the /v1/completions stream path The previous commit's threading of request.tools matched the stream_completion_response / stream_completion_response_fanout calls in the /v1/completions handler too. CompletionRequest has no `tools` field, so /v1/completions raised "AttributeError: 'CompletionRequest' object has no attribute 'tools'" (HTTP 500). Tool calling only applies to chat; drop tools from the text-completion stream calls. * fix(openai): make tool-call ids unique across the conversation The parser generated ids from a per-response index (call_0, call_1, ...), so the first tool call in every assistant turn was call_0. OpenAI tool-call ids must be unique across the whole conversation; agentic clients (e.g. qwen-code) dedupe by id and silently ignore every repeat -> the tool never executes and the model retries forever (endless tool-call loop on any multi-tool task). Use a random call_<uuid> id at both the non-streaming and streaming emit sites.
…n concurrency (ROCm#1381) The conc=1000 accuracy job intermittently failed: the server exhausted its per-process open-file limit while accepting ~1000 concurrent connections (plus the engine's DP-rank ZMQ and shared-memory fds), hitting EMFILE on accept(). The default soft RLIMIT_NOFILE (~1024) is simply too low for that connection count. Root cause is that ATOM never raised its own fd soft limit. vLLM and SGLang both call set_ulimit() at process startup for exactly this reason, and ATOM's own mesh launch scripts already pass `--ulimit nofile=65536:524288` to docker -- but plain `python -m atom.entrypoints.openai_server` launches (CI, ad-hoc) inherit the daemon default and never bump it. Add a set_ulimit() helper (raise soft -> min(65535, hard)) and call it at the server entry point before the engine-core subprocesses are spawned, so the raised limit is inherited. No-op when the soft limit is already high enough. This is independent of the event-loop choice; it removes the fd ceiling that turned ordinary high-concurrency load into dropped connections.
* ci: validate accuracy catalogs against JSON Schema in pre-checks Add a JSON Schema for the flat accuracy catalogs (models_accuracy.json, oot_models_accuracy.json, sglang_models_accuracy.json) plus a validate_catalog.py gate wired into the pre-checks (T0) workflow. additionalProperties:false locks the current shape so typos / stray fields fail CI; a semantic rule requires each entry to declare exactly one pass-bar spelling (accuracy_threshold / accuracy_test_threshold). The existing extraArgs/extra_args and threshold-name drift is tolerated for now and will be normalized separately. Documented in benchmark/README.md. * ci: extract docker login into reusable docker-auth composite action Replace the inline `echo $PASSWORD | docker login` steps in the ATOM-native workflows (atom-test, atom-benchmark, atom-mmstar-ci, docker-release, atomesh-accuracy-validation) with a shared .github/actions/docker-auth composite. Credentials are passed via env instead of being interpolated into the run command, removing the template-injection vector. The composite also supports an explicit registry, image-derived registry, and a custom engine so the vllm/sglang call sites can reuse it in a follow-up. * ci: de-inline aiter wheel download into a shared script Extract the ~163-line aiter wheel resolve+download block (byte-identical in atom-test and atomesh-accuracy-validation) into .github/scripts/download_aiter_wheel.sh; both workflows now call it (net -326 inline lines). Logic matches the previous inline block exactly. GITHUB_TOKEN is passed via env instead of being interpolated into the run command, and the S3 / API / workflow-id constants become overridable env defaults. atom-mmstar-ci uses a simpler S3-only variant (no artifact fallback) and is left for a follow-up. * ci: de-inline aiter wheel install into a shared script Extract the identical "Install aiter from wheel" block from atom-test and atomesh-accuracy-validation into .github/scripts/install_aiter_wheel.sh. Behavior matches the previous inline block (no outer set -e, so a missing wheel still hits the explicit error+ls path). CONTAINER_NAME comes from the job env; the wheel dir is an overridable env default (/tmp/aiter-whl). atom-mmstar-ci uses a --no-deps variant from a different dir and is left for a follow-up. * ci: extract CI container startup into setup-gpu-container composite Replace the identical ~60-line "Start CI container" steps in atom-test and atomesh-accuracy-validation with a shared .github/actions/setup-gpu-container composite. The three differences are inputs: network-host (atom-test sets host networking), extra-run-flags (atomesh adds USE_ATOMESH_ENTRYPOINTS/ATOM_SERVER_PORT), and the runner label that drives the --pull policy. The assembled docker run command is byte-identical to the previous inline blocks for both callers (verified with a stubbed docker). atom-mmstar-ci / docker-release / gpu-load-test use more divergent startup blocks and are left for a follow-up. * ci: serialize gh-pages deploys with a shared concurrency group All six workflows that push to the gh-pages branch (docs, deploy-pages, atom-benchmark, atomesh-mocker-benchmark, atom-sglang-benchmark, atom-vllm-benchmark) now run their deploy job under a shared concurrency group (gh-pages-deploy, cancel-in-progress: false). This serializes the fetch/checkout/commit/push dance so concurrent runs can no longer race on the branch and drop each other's updates. Job-level concurrency is independent of the existing workflow-level groups, so redundant-run cancellation is unchanged. * ci: bump artifact actions off deprecated Node 20 (@v4 -> @v7/@v8) actions/upload-artifact@v4 and actions/download-artifact@v4 run on the deprecated Node 20 runtime. Bump the remaining @v4 pins to the versions already used elsewhere in the repo (upload-artifact@v7, download-artifact@v8), which run on Node 24. All affected download steps fetch a single named artifact to an explicit path, so behavior is unchanged across the major bump; v4-v8 share the same artifact backend. * test: align per-req-cache and connector-metadata tests with current behavior The per-req-cache tests asserted a removed design where stateful requests deducted 'equiv blocks' from the KV pool and were tracked in a per_req_cache_accounting dict. The current BlockManager sizes the state tensor separately and excludes it from num_kvcache_blocks, so admission only claims a free slot index with no extra paged-block cost. Rewrite the seven stale tests to the slot-only model (can_allocate returns -1/hit-count, not False/bool) and rename two to match what they now verify. ConnectorMetadata._build_req_meta parses transfer params leniently via dict.get, so a missing field yields None instead of raising KeyError. Update the connector-metadata test accordingly. * test: make non-unit disaggregation tests skip visibly off the unit path test_proxy gains importorskip guards for its optional msgpack/quart deps, so it runs where they are installed and skips with a reason otherwise instead of erroring at collection. test_transfer_engine and test_kv_connector_scheduler import the kv_transfer_engine module that ROCm#690 split into the moriio subpackage; guard them with importorskip so they skip visibly (with a reason pointing at the needed path update) until the disaggregation owner refreshes them. Delete test_kimi_k25: it exec-loads the real atom/config.py at import time, which collides with conftest's atom package stub and cannot run under the shared unit harness. * test: remove obsolete mxfp4 swiglu source-introspection test test_swiglu_branch_condition_no_bias_check asserted that Mxfp4MoEMethod.process_weights_after_loading contains a literal 'layer.activation == ActivationType.Swiglu:' branch. That function was refactored to route via use_triton vs the AITER shuffle path, so the branch no longer exists in that form and the test had been @unittest.skip'd as obsolete. Drop it; the sibling test_swiglu_branch_does_not_couple_bias_and_shuffle still guards against the original coupled-condition regression. * ci: add non-GPU unit test gate to pre-checks Run the native unit suite on ubuntu-latest as part of Pre Checkin, alongside black/ruff/validate-catalog. .github/scripts/run_unit_tests.sh centralizes the scope: it runs tests/ minus tests/plugin (next-stage sglang/vllm/rtpllm work, which also installs import-time sys.modules stubs that would pollute native tests) and minus the GPU server integration test; P/D disaggregation tests self-skip via importorskip guards. The job installs CPU torch + base deps, emits a JUnit report, and uploads it as an artifact. Locally: 464 passed, 2 skipped, 0 failed. * test: fix unit gate failures on the non-GPU runner The new pre-checks unit job failed on ubuntu (no aiter, no PIL) for two reasons, both now fixed: - test_api_server_helpers leaked stub modules. When the api_server import fails (PIL absent), the except branch reset _injected_modules to [] before the finally cleanup ran, so the injected stub for atom.model_engine.arg_utils was never popped from sys.modules. It then shadowed the real EngineArgs for test_arg_utils_spec (collected later), which failed with _StubEngineArgs / missing SpeculativeConfig. Drop the reset so finally always tears the stubs down, and pre-initialize _injected_modules so finally is safe if stub installation itself raises. Verified by blocking PIL locally: arg_utils tests pass, api_server tests skip cleanly. - test_mxfp4_moe_has_bias loads atom.config / atom.model_ops.moe, which import the AITER GPU kernel library (no CPU build). Guard the module with pytest.importorskip('aiter') so it skips visibly off the non-GPU gate and runs in GPU CI. * ci: checkout repo in download_aiter_wheel jobs The download_aiter_wheel jobs in atom-test and atomesh-accuracy-validation have no checkout step — the original inline bash ran from the YAML directly. De-inlining the logic into .github/scripts/download_aiter_wheel.sh introduced a dependency on the file being present on the runner, so the jobs failed with 'No such file or directory' (exit 127). Add actions/checkout@v6 to both jobs. * ci: drop literal ${{ }} from docker-auth description GitHub evaluates ${{ }} expressions in an action's description field, and the secrets context is not available to composite actions. The description quoted the inline secret-interpolation form verbatim with braces, so loading the composite failed at runtime with 'Unrecognized named-value: secrets', short-circuiting Docker Login in atom-test/atomesh. Reword without braces. actionlint does not evaluate description expressions, so this only surfaced on a real runner. * ci: clone aiter with full history so its version isn't 0.0.0 The image build shallow-cloned aiter (git clone --depth 1), so its setuptools_scm version fell back to 0.0.0 (no tags reachable), making the baked-in aiter indistinguishable by version. Use --filter=blob:none instead: full commit history + tags (so setuptools_scm computes a real version) while deferring blob downloads to keep the clone fast. Submodule init is unaffected. Native workflows only (atom-test, atomesh-accuracy-validation); the sglang/vllm benchmark workflows have the same shallow clone but are out of scope for now. * ci(benchmark): print the full benchmark command before running Build the benchmark_serving invocation as a bash array and printf it right after 'Running benchmark test', so the exact resolved command (model, ISL/OSL, concurrency, extra args) is visible in the client log. Running the array guarantees the printed command matches what executes. * ci: notify Teams on nightly/release workflow failure Add a workflow_run listener that posts a Teams message when a native scheduled workflow fails (ATOM Test, ATOM Benchmark, Atomesh Accuracy Validation, Pre Checkin, Nightly Docker Release). Single listener instead of per-workflow steps — zero changes to the targets. Filtered to conclusion==failure and event==schedule so only nightly/release runs notify, not PRs. Posts an Adaptive Card (built with jq; run metadata passed via env to avoid template injection) to a Teams 'Post to a channel when a webhook request is received' Workflows webhook — classic O365 connector Incoming Webhooks were retired in 2026. Requires a TEAMS_WEBHOOK_URL repo secret; until it's set the job no-ops without failing. workflow_run fires from the default-branch copy, so it activates after merge. * fix(ci): unindent resolve_download_url python so the S3 fast-path works The python3 -c body in download_aiter_wheel.sh indented its continuation lines to match the bash block, putting leading whitespace inside the single-quoted source -> 'IndentationError: unexpected indent'. resolve_download_url is called under a non-set-e context (download_from_s3_manifest), so the error was swallowed and the S3 manifest fast-path silently fell back to artifact enumeration every run. Move the python body to column 0 (leading newline) so it parses. * ci: serialize native accuracy-dashboard gh-pages pushes The gh-pages serialization added the gh-pages-deploy concurrency group to the docs/benchmark deployers but missed two native jobs that also auto-push to gh-pages: atom-test 'Update accuracy dashboard' and atomesh 'Publish Atomesh accuracy data'. Add the same group so their auto-push can't race the serialized deploys on the gh-pages branch.
* [atom-vllm] enable DP/DP+EP/TP+EP for atom-vllm model Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * support kimi2.5 64 attn heads (ROCm#886) Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * make lint happy Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * fix Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * trim decode tensors for moe Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * fix non triton routing expert mask in moe Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * fold heads to 8 Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * black Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * add enable dbo Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * fix mha kv lens Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * bind single topk index buffer across sparse mla ubatches Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * add dsv32 dbo to nightly ci Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * fix topk Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * fix kimi k2.5 dbo Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * fix Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * fix Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * clean the code by removing dbo related code * remove dbo test cases * clean up DP+EP PR: remove dead code, fix stale docs/types Review follow-ups (no behavior change): - Remove unused get_mla_persistent_metadata_dtypes() in attention/layer_mla.py (defined, never referenced). - Remove unused self.n_shared_experts in qwen3_next.py (only set, never read; all real uses go through config.n_shared_experts). - AiterMlaPersistentMetadataForVllm fields -> torch.Tensor | None, matching the all-None disabled_mla_persistent_metadata() construction path. - Update init_aiter_dist_from_vllm docstring to reflect it reuses TP/PP/DP/EP (not just TP), matching the earlier tp->dist rename. - Fix misleading fold_factor docstring example (32 -> 8, the realized case). Verified with black --check and ruff check in container guanbao_vllm_atom_0609. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(moe): replace config-time is_vllm() with Config flags (PR ROCm#1101 cleanup) Move the three config-time is_vllm() branches out of native MoE files into plugin-set Config flags, so the frontend decides policy in atom/plugin/config.py instead of native code querying is_vllm() at the call site. Behavior-preserving: defaults are native-correct and the vLLM builder sets the flags to its prior values. - moe_ep_flatten_tp_across_dp (default False): MoE EP computes ranks in the flattened DP x TP device space and disables fused shared experts. vLLM sets it from enable_expert_parallel. Replaces is_vllm() in topK.py (is_rocm_aiter_fusion_shared_expert_enabled_for_quant_config) and moe.py (FusedMoEParallelConfig.make). - mori_max_tokens_per_dp_rank (default 16384): per-DP-rank MORI dispatch buffer size. vLLM sizes it to max_num_batched_tokens. Replaces is_vllm() in moe.py (_maybe_make_prepare_finalize). Removes the native imports of is_vllm from moe.py and topK.py. * fix(v4-ep): stop trimming MORI dispatch buffer below received-token count DeepSeek-V4-Flash with dp4+ep4 hit an illegal memory access in the AITER FlyDSL fused_moe kernel under concurrent/prefill load. Root cause: in FusedMoEModularKernel.forward the cudagraph-style trim shrank dispatch_a1 to the decode bound graph_bs*topk*dp_size while fused_moe was still driven by num_local_tokens=expert_num_tokens (the true received-token count). In a mixed DP+EP batch a decoding rank (small graph_bs) still receives many tokens via the all-to-all from prefilling peers, so the bound under-counts recv (e.g. trim to 24 rows while recv=638) and the MoE kernel reads past the trimmed buffer -> illegal memory access. EP-only, so pure tp4 is unaffected. Fix: atom-vllm uses an exact-recv trim (trim_vllm_mori_dispatch_tensors), which trims to the graph_bs*topk*ep bound only under a uniform FULL-cudagraph batch (bound >= recv by construction), skips trimming during graph capture, and otherwise trims to the exact received-token count. The native ATOM path keeps the decode-bound trim. The two frontend-specific MORI seams are kept out of native fused-moe code (PR ROCm#1101 is_vllm() cleanup): native modular_kernel/mori_prepare_finalize expose overridable methods (_maybe_trim_dispatch_output, _get_dispatch_config) with the native default, and the atom-vllm behavior is injected by a plugin monkeypatch (atom/plugin/vllm/mori_patch.py, applied from register.py). No is_vllm() and no Config flag in native files. The MORI all-to-all buffer is sized from moe.max_num_tokens instead of a dedicated Config field. Originally validated dp4+ep4 eager: 0 crashes across a full gsm8k run (vs immediate crash before); gsm8k 3-shot exact_match = 0.96. --------- Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> Co-authored-by: zejunchen-zejun <zejun.chen@amd.com> Co-authored-by: Claude Opus 4 <noreply@anthropic.com> Co-authored-by: kliuae <17350011+kliuae@users.noreply.github.com> Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com> Co-authored-by: Guanbao Yu <gyu@amd.com> Co-authored-by: ganyi <ygan@amd.com>
…on (ROCm#1368) MiniMax-M3 sparse attention reuses the unified KV cache and kv_scale for K/V, so the fp8 per-token scales already travel with the KV blocks. It keeps one extra per-token buffer, runner.sparse_attention_index_cache, holding the indexer keys used for top-k block selection at decode time. get_kv_transfer_tensors() never registered that buffer, so under PD disaggregation the decode node ran top-k against a zero/stale index for the prefilled tokens and attended to the wrong KV blocks. This is masked for short prompts (the init+local+topk window already covers every block, so selection is moot) but corrupts output once the context exceeds that window. Register the indexer-key cache as block-indexed transfer regions (one per sparse layer, same physical-block striding as the KV cache), guarded by getattr so non-sparse models and bf16 paths are unaffected. Tested (latest image, 1P+1D TP4, fp8 KV via Triton attention): GSM8K 5-shot = 0.9401, i.e. no regression to M3 fp8 PD. Short-prompt GSM8K does not exercise the long-context top-k path the buffer affects; that path is covered by review, not this run.
…nction comments. (ROCm#1393)
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
…currency (ROCm#1394) At high --max-concurrency each in-flight request holds a socket fd. The default soft RLIMIT_NOFILE (~1024) is exhausted client-side (EMFILE on socket()), so most requests fail before reaching the server and the run reports only ~one concurrency-wave of successes (e.g. ~919/10240 at conc=1024) while the server logs 200 OK for every request it actually receives. The server already calls set_ulimit() at startup; call it in the benchmark client too (soft is raised toward 65535, capped at the hard limit). Co-authored-by: ZhangLirong-amd <ZhangLirong-amd@users.noreply.github.com>
…OCm#1318) * [Feature] OFFLOAD: add LMCache CPU/NVMe KV-offload subsystem Add a standalone KV-offload subsystem that offloads ATOM KV cache to LMCache-backed CPU/NVMe storage and reloads it on cache hits, avoiding prefill recompute for evicted prefixes. - New atom/kv_transfer/offload package: LMCacheOffloadConnector, the ATOM<->LMCache GPU connector, a byte codec for ATOM's packed KV layout, a Triton staging kernel, plus metadata and config. - Wire the connector into the disaggregation base/factory/types and aggregate per-worker finished/failed transfer states. - Unit tests for the connector and byte-codec round-trip. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * [Feature] OFFLOAD: integrate KV offload load/save into engine and scheduler Drive offload load/save from the engine loop and scheduler: - Dispatch async KV load after connector metadata, poll worker transfer status, and advance idle KV transfer when no forward batch runs. - Defer block free until a background D2H save has read the KV, and wake parked prefills for local recompute on a load miss (failed_recving). - Handle chunked-prefill deferred output across the offload park/resume boundary so stale sampled tokens are dropped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * [Frontend] Support max completion tokens in OpenAI API Honor max_completion_tokens (and the max_tokens alias) in the OpenAI completion/chat protocol and server so offload benchmarks can bound generation length. Adds protocol and server-helper unit tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * [Docs] Add design README for LMCache CPU/NVMe KV offload connector Document the ATOM standalone lmcache_offload connector: design, module map, scheduler/worker architecture, byte codec and AITER layout bridge, MemoryObj/segment layout, completion protocol, reload decision and chunk-alignment handoff, correctness/fp8/failure handling, the LMCache reuse-vs-override boundary, configuration, benchmarks, and tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: yihonglie <hyi@amd.com> * [Feature] OFFLOAD: support MLA models (DeepSeek R1/V3, Kimi) The byte codec assumed a block-major KV layout (tensor.shape[0] == block count), which holds for MHA/GQA but not for MLA: MLA stores a single per-layer latent cache (kv_lora_rank + qk_rope_head_dim, e.g. 576) viewed token-major as (num_blocks * block_size, 1, latent), with no separate V or scale tensors. So shape[0] is the token count and the codec computed a per-token (not per-block) byte stride, corrupting the offloaded KV. Both layouts share an identical contiguous byte layout (block b always starts at b * bytes_per_block), so instead of branching we take the physical block count explicitly and derive each segment's per-block stride as segment_bytes / num_blocks. The Triton fused staging kernel is byte-addressed and needs no change. - ATOMKVByteCodec: accept explicit num_blocks; per-block bytes from it; require contiguous + numel divisible by num_blocks (replaces the "same shape[0]" check). Falls back to shape[0] when num_blocks is None, preserving non-MLA behaviour. - Thread num_physical_kvcache_blocks: model_runner.allocate_kv_cache -> forward_context.set_kv_cache_data -> connector.register_kv_caches -> codec. Optional num_blocks kwarg added to the connector base + mooncake + moriio impls (ignored there). - build_lmcache_metadata: emit an MLA-shaped kv_shape (latent dim) for bookkeeping; storage stays opaque BINARY so use_mla remains False. - Tests: MLA token-major block accounting + byte-identical round-trip. Validated on DeepSeek-V3-5layer (real MLA, TP=2) end-to-end: offload save + reload (cxs multi-round, round 2 hits cached:[~33k], no recompute). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * [Docs] Update LMCache offload README for MLA support Document MLA (DeepSeek R1/V3, Kimi) KV-offload support in the connector README: token-major latent cache layout, explicit num_blocks threading, and BINARY opaque storage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: yihonglie <hyi@amd.com> * [Enhancement] OFFLOAD: make unaligned handoff always-on; drop OFFLOAD_UNALIGNED_HANDOFF Unaligned HBM-prefix loads now always take the handoff path (recompute the misaligned head up to the next chunk boundary, then load the aligned remainder from CPU) instead of being gated behind the OFFLOAD_UNALIGNED_HANDOFF env var. The env read and the gate check in _maybe_start_unaligned_handoff are removed; the min_load / boundary guards are unchanged. - connector.py: drop _allow_unaligned_handoff + OFFLOAD_UNALIGNED_HANDOFF read; handoff is now unconditional. - README.md / README.zh-CN.md: remove the env var from the tuning table and the example commands, add a "removed" note, and document the always-on behaviour. Also track the previously-untracked zh-CN README. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: yihonglie <hyi@amd.com> * [Bugfix] OFFLOAD: drop duplicate enable_chunked_prefill kwarg in test MockConfig The rebase onto main left two enable_chunked_prefill keys in MockConfig's defaults dict (our branch moved it up and set =True; main kept the old =False line while adding hf_config after it). Python raises SyntaxError on a repeated keyword in a dict() call, so conftest failed to import and the whole unit-test suite aborted (exit 4). Keep the single enable_chunked_prefill=True (our intended default; all tests that depend on the value override it explicitly) plus main's hf_config stub. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: yihonglie <hyi@amd.com> * [Bugfix] OFFLOAD: drop stale unaligned-skip test left over from always-on handoff Commit "make unaligned handoff always-on; drop OFFLOAD_UNALIGNED_HANDOFF" removed _allow_unaligned_handoff from the connector but left three dead references in the test file, including test_load_is_skipped_if_hbm_floor_is_ not_chunk_aligned, which encodes the old default-skip behaviour. With handoff now unconditional, an unaligned HBM floor (hbm=6, chunk=4, min_load=0) takes the handoff path (park to boundary 8, emit the load once cached reaches 8) instead of skipping the CPU load, so that test's `lookup.cleared == ["654"]` no longer holds and it failed in CI. The scenario is already covered by the always-on tests: - test_unaligned_hbm_handoff_prefills_boundary_then_emits_load (handoff path) - test_unaligned_handoff_skips_if_boundary_remainder_is_too_small (the real skip+clear case, gated by min_load rather than alignment) Delete the stale test and the three no-op _allow_unaligned_handoff assignments that the connector no longer reads. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: yihonglie <hyi@amd.com> * [OFFLOAD] Remove Chinese README.zh-CN.md Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Signed-off-by: yihonglie <hyi@amd.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nchmark (ROCm#1371) * [atom benchmark] add indexer cache for M3 in atom benchmark Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * remove wa qr env flag as aiter fix the issue Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * add online quant into recipe Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> --------- Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
* Modify 'clean up containers' logic for sglang accuracy * Modify model cache path to adapt hjbog-20
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
* open multistream for vllm Signed-off-by: whx-sjtu <xiaowang990929@gmail.com> * rm patch Signed-off-by: whx-sjtu <xiaowang990929@gmail.com> * add eager guard Signed-off-by: whx-sjtu <xiaowang990929@gmail.com> --------- Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
…hmark schedule (ROCm#1405) * ci: reuse setup-gpu-container in bench & mmstar container start De-inline the duplicated "Start CI container" docker run from atom-bench-container and atom-mmstar-ci into setup-gpu-container, so the container boilerplate lives in one place. setup-gpu-container: - add pull-policy input (explicit always/missing/never; falls back to the runner-based heuristic when empty) - add disable-mmap input (default true; set false to skip ATOM_DISABLE_MMAP) - make runner optional (default "") - key the env-file by container name so concurrent containers don't clobber it - drop the duplicated -v/-w in docker run atom-bench-container: Start step now uses setup-gpu-container (network-host=true, pull-policy=always, container-env -> extra-run-flags); keeps its model-download step. atom-mmstar-ci: Start step now uses setup-gpu-container; passes disable-mmap=false to keep byte-for-byte parity (mmstar never set ATOM_DISABLE_MMAP). MODEL_CACHE_MOUNT == setup-gpu's auto /models mount. Behavior preserved; only runtime change is mmstar's image pull now hard-fails on error (--pull always vs prior best-effort docker pull). * ci(benchmark): advance nightly schedule to 00:12 Beijing (16:12 UTC) Move the ATOM Benchmark nightly cron 48 min earlier (01:00 -> 00:12 Beijing, 17:00 -> 16:12 UTC). * ci(accuracy): move base DeepSeek-R1-0528 to nightly; condense long _baseline_note - DeepSeek-R1-0528 (base) test_level pr -> nightly (no longer per-PR) - Trim the 4 over-long _baseline_note entries (online-quant, MiMo-V2-Flash, MiMo-V2-Flash MTP, V4-Pro TBO+DPA) to <=270 chars; keep all hard facts (baselines, run ids, thresholds, MTP tp/num-spec constraints).
Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com> Co-authored-by: JiaoliangYu <Jiaoliang.Yu@amd.com>
* [Fix] Fix DeepSeek-V4 DP + EP on gfx942 (MI308X) * Using the existing API
03e503b to
042faeb
Compare
042faeb to
e7231dc
Compare
added 2 commits
June 30, 2026 16:27
bincount + boolean-mask indexing + `.any()` host-sync raise "operation not permitted when stream is capturing" inside the decode cudagraph. Replace with fixed-shape scatter_add_ so the module-A record hook can run within graph capture (return value unchanged).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Collect per-layer, per-expert token counts from the MORI EP dispatch output (dispatch_recv_token_num) into a windowed ExpertLoadMonitor. Logs avg/max/balancedness and can emit a one-shot offline rebalance plan (hot/cold experts) via the offline_eplb_rebalance utility command.
Scope is statistics only: per-rank, no cross-rank all-reduce, and no actual expert weight remap/transfer yet. All gated behind ATOM_ENABLE_EPLB_LOAD_STATS (default off).