[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5, with working MTP by carlushuang · Pull Request #1337 · ROCm/ATOM

carlushuang · 2026-06-24T06:45:59Z

[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5 (Strix Halo), with working MTP

Builds on the gfx1151 BF16 enablement (#1314) to add online INT8 W8A8 (Quark-style, no offline quant step) for the Qwen3.6 dense (27B) and MoE (35B-A3B) architectures, plus the fixes needed to make MTP speculative decoding work on the MoE draft. RDNA3.5 WMMA supports int8 natively (no FP8/FP4), so int8 is the right quantization target for this iGPU.

Precision split (chosen for quality): A8W8 (int8 weight + dynamic per-token int8 activation) for all dense GEMMs and the MoE experts; BF16 for the Gated-DeltaNet linear-attention (recurrent, quant-sensitive — int8 there produces garbage), the MoE router gate, lm_head, and embeddings; KV cache BF16. This keeps gsm8k at BF16-equivalent quality while halving weight bytes (the decode bottleneck on a bandwidth-bound iGPU).

What this enables

35B-A3B runs at all: there is no BF16 MoE kernel on gfx1151 (the asm fused-MoE is gfx9-only; ATOM's Triton MoE is MXFP4-weight-only). The int8 path uses aiter's moe_gemm_int8_smoothquant, which is the only int8-W8A8 grouped-GEMM that runs on RDNA3.5.
~6× faster than the BF16 27B baseline for the 35B-A3B (3B active params × int8 + MTP).

Changes

model_ops/linear.py — per_Token int8 branch routes to aiter Triton gemm_a8w8 on non-gfx9 (CK gemm_a8w8_CK is gfx9-only), wrapped as a torch custom op (torch.ops.aiter.atom_gemm_a8w8_triton) so it is HIP-graph / torch.compile safe. Online-quant allow-list += torch.int8.
model_ops/moe.py — new Int8MoEMethod (int8 w13/w2 + per-channel fp32 scales) and an int8 branch in FusedMoE._online_quant.
model_ops/fused_moe_triton.py — triton_kernel_int8_moe_forward: matmul-ogs routing → per-token int8 quant → moe_gemm_int8_smoothquant (gemm1 with fused gated-SiLU via interleaved w13 columns) → per-token int8 quant → gemm2 with scatter/combine.
models/qwen3_5_mtp.py + model_loader/loader.py — MTP-MoE drafter fix: the draft's fused expert weights (experts.gate_up_proj/down_proj) were silently dropped at load → 0% draft acceptance → MTP was pure overhead. Add the draft's fused-expert mapping (detect_fused_expert_format / get_fused_expert_mapping / load_fused_expert_weights), fix get_expert_mapping to use num_experts, and let the loader resolve load_fused_expert_weights_fn from the model. After the fix: acceptance 0 → 0.83.
entrypoints/openai/tool_parser.py — unique tool-call ids (call_<uuid> instead of a per-response call_0). Non-unique ids made agentic clients (qwen-code) dedupe every tool call after the first → endless tool-call loop. Extends feat(openai): Qwen3 (qwen3_coder/qwen3_xml) tool-call support #1319.

Quality (gsm8k, 5-shot-equivalent, chat + thinking, greedy)

35B-A3B INT8 W8A8 = 0.84 — BF16-equivalent (int8 is faithful). MTP is lossless (accepts a draft token only when it matches the target's greedy argmax), so the MTP build has identical quality.

Performance (gfx1151 / Radeon 8060S, bs=1)

Decode (single-stream, short context):

Model	Config	Decode tok/s
27B dense	INT8 W8A8	6.0
27B dense	INT8 W8A8 + MTP-1	9.4
35B-A3B	INT8 W8A8 + HIP graph	24.8
35B-A3B	INT8 W8A8 + MTP-1 + HIP graph	~35

Long-context (35B-A3B INT8 W8A8 + MTP-1, bs=1):

Context	Prefill TTFT	Prefill tok/s	Decode (output) tok/s	Total tok/s
64K (60,016 tok)	85.4 s	703	23.4	661
128K (119,071 tok)	191.8 s	621	17.3	598

Decode tok/s falls with context (each step reads the growing KV); prefill is compute-bound (one-time prompt-ingestion cost). The hybrid model's KV is cheap (only the interleaved full-attn layers cache KV), so 128K fits easily — at gpu-memory-utilization 0.9 the KV pool holds ~2.1M tokens; the limit is --max-model-len, not memory.

Serve

ATOM_USE_UNIFIED_ATTN=1 \
python -m atom.entrypoints.openai_server --model Qwen/Qwen3.6-35B-A3B \
  --trust-remote-code -tp 1 --kv_cache_dtype bf16 --block-size 64 \
  --max-model-len 131072 --max-num-seqs 2 --gpu-memory-utilization 0.9 \
  --method mtp --num-speculative-tokens 1 \
  --online_quant_config '{"global_quant_config":"ptpc_i8","exclude_layer":["*linear_attn*","*lm_head*","*shared_head*","*embed_tokens*","*mlp.gate"]}'

(Drop --method mtp ... for 35B if you don't want MTP; for the dense 27B MTP is a ~1.6× lossless win.)

Dependency

[OPUS]: arch-guard fp8/bf8 packed-cvt builtins for RDNA3/3.5 (gfx1151) aiter#3860 (carhuang/gfx1151_opus_fp8_guard) — arch-guard the gfx9-only fp8/bf8-cvt builtins. Required (shared with [gfx1151] Qwen3.5/3.6 (GDN hybrid) BF16 on RDNA3.5 via native Triton attention #1314). The int8 GEMM/MoE kernels (gemm_a8w8 Triton, moe_gemm_int8_smoothquant, per_token_quant_hip) are already upstream in aiter; no new aiter code is needed for the int8 path.
[gfx1151] Qwen3.5/3.6 (GDN hybrid) BF16 on RDNA3.5 via native Triton attention #1314 (carhuang/support_gfx1151_qwen36) — the gfx1151 BF16 base enablement (arch gate, native Triton attention, GDN block_tables). Prerequisite.
feat(openai): Qwen3 (qwen3_coder/qwen3_xml) tool-call support #1319 (carhuang/qwen3_xml_tool_parser) — qwen3_xml tool-call parsing; the unique-tool-call-id fix here extends it.

… on MoE RDNA3.5 WMMA supports int8 natively (no FP8/FP4), so int8 is the quantization target for this iGPU. A8W8 (int8 weight + dynamic per-token int8 activation) for all dense GEMMs and MoE experts; BF16 for the GDN linear-attn (recurrent, quant-sensitive), router gate, lm_head, embeddings; KV cache BF16. - model_ops/linear.py: per_Token int8 branch -> aiter Triton gemm_a8w8 on non-gfx9 (CK gemm_a8w8_CK is gfx9-only), wrapped as a torch custom op so it is HIP-graph / torch.compile safe. Online-quant allow-list += torch.int8. - model_ops/moe.py + model_ops/fused_moe_triton.py: Int8MoEMethod + int8 branch in FusedMoE._online_quant, and triton_kernel_int8_moe_forward using aiter moe_gemm_int8_smoothquant (gemm1 with fused gated-SiLU via interleaved w13). Enables 35B-A3B, which has no BF16 MoE kernel on gfx1151. - models/qwen3_5_mtp.py + model_loader/loader.py: fix the MTP-MoE drafter so the draft's fused expert weights load (add detect_fused_expert_format / get_fused_expert_mapping / load_fused_expert_weights; get_expert_mapping uses num_experts; loader resolves load_fused_expert_weights_fn from the model). Draft acceptance 0 -> 0.83; MTP now a net win on the MoE model. - model_ops/topK.py: keep the shared expert as a separate MLP on non-gfx9 so the routed MoE uses the portable Triton path.

carlushuang mentioned this pull request Jun 24, 2026

[gfx1151] Qwen3.5/3.6 (GDN hybrid) BF16 on RDNA3.5 via native Triton attention #1314

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5, with working MTP#1337

[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5, with working MTP#1337
carlushuang wants to merge 1 commit into
carhuang/support_gfx1151_qwen36from
carhuang/gfx1151_int8_qwen36

carlushuang commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

carlushuang commented Jun 24, 2026

[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5 (Strix Halo), with working MTP

What this enables

Changes

Quality (gsm8k, 5-shot-equivalent, chat + thinking, greedy)

Performance (gfx1151 / Radeon 8060S, bs=1)

Serve

Dependency

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant