Skip to content

[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5, with working MTP#1337

Open
carlushuang wants to merge 1 commit into
carhuang/support_gfx1151_qwen36from
carhuang/gfx1151_int8_qwen36
Open

[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5, with working MTP#1337
carlushuang wants to merge 1 commit into
carhuang/support_gfx1151_qwen36from
carhuang/gfx1151_int8_qwen36

Conversation

@carlushuang

Copy link
Copy Markdown
Collaborator

[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5 (Strix Halo), with working MTP

Builds on the gfx1151 BF16 enablement (#1314) to add online INT8 W8A8 (Quark-style, no offline quant step) for the Qwen3.6 dense (27B) and MoE (35B-A3B) architectures, plus the fixes needed to make MTP speculative decoding work on the MoE draft. RDNA3.5 WMMA supports int8 natively (no FP8/FP4), so int8 is the right quantization target for this iGPU.

Precision split (chosen for quality): A8W8 (int8 weight + dynamic per-token int8 activation) for all dense GEMMs and the MoE experts; BF16 for the Gated-DeltaNet linear-attention (recurrent, quant-sensitive — int8 there produces garbage), the MoE router gate, lm_head, and embeddings; KV cache BF16. This keeps gsm8k at BF16-equivalent quality while halving weight bytes (the decode bottleneck on a bandwidth-bound iGPU).

What this enables

  • 35B-A3B runs at all: there is no BF16 MoE kernel on gfx1151 (the asm fused-MoE is gfx9-only; ATOM's Triton MoE is MXFP4-weight-only). The int8 path uses aiter's moe_gemm_int8_smoothquant, which is the only int8-W8A8 grouped-GEMM that runs on RDNA3.5.
  • ~6× faster than the BF16 27B baseline for the 35B-A3B (3B active params × int8 + MTP).

Changes

  • model_ops/linear.pyper_Token int8 branch routes to aiter Triton gemm_a8w8 on non-gfx9 (CK gemm_a8w8_CK is gfx9-only), wrapped as a torch custom op (torch.ops.aiter.atom_gemm_a8w8_triton) so it is HIP-graph / torch.compile safe. Online-quant allow-list += torch.int8.
  • model_ops/moe.py — new Int8MoEMethod (int8 w13/w2 + per-channel fp32 scales) and an int8 branch in FusedMoE._online_quant.
  • model_ops/fused_moe_triton.pytriton_kernel_int8_moe_forward: matmul-ogs routing → per-token int8 quant → moe_gemm_int8_smoothquant (gemm1 with fused gated-SiLU via interleaved w13 columns) → per-token int8 quant → gemm2 with scatter/combine.
  • models/qwen3_5_mtp.py + model_loader/loader.pyMTP-MoE drafter fix: the draft's fused expert weights (experts.gate_up_proj/down_proj) were silently dropped at load → 0% draft acceptance → MTP was pure overhead. Add the draft's fused-expert mapping (detect_fused_expert_format / get_fused_expert_mapping / load_fused_expert_weights), fix get_expert_mapping to use num_experts, and let the loader resolve load_fused_expert_weights_fn from the model. After the fix: acceptance 0 → 0.83.
  • entrypoints/openai/tool_parser.pyunique tool-call ids (call_<uuid> instead of a per-response call_0). Non-unique ids made agentic clients (qwen-code) dedupe every tool call after the first → endless tool-call loop. Extends feat(openai): Qwen3 (qwen3_coder/qwen3_xml) tool-call support #1319.

Quality (gsm8k, 5-shot-equivalent, chat + thinking, greedy)

  • 35B-A3B INT8 W8A8 = 0.84 — BF16-equivalent (int8 is faithful). MTP is lossless (accepts a draft token only when it matches the target's greedy argmax), so the MTP build has identical quality.

Performance (gfx1151 / Radeon 8060S, bs=1)

Decode (single-stream, short context):

Model Config Decode tok/s
27B dense INT8 W8A8 6.0
27B dense INT8 W8A8 + MTP-1 9.4
35B-A3B INT8 W8A8 + HIP graph 24.8
35B-A3B INT8 W8A8 + MTP-1 + HIP graph ~35

Long-context (35B-A3B INT8 W8A8 + MTP-1, bs=1):

Context Prefill TTFT Prefill tok/s Decode (output) tok/s Total tok/s
64K (60,016 tok) 85.4 s 703 23.4 661
128K (119,071 tok) 191.8 s 621 17.3 598

Decode tok/s falls with context (each step reads the growing KV); prefill is compute-bound (one-time prompt-ingestion cost). The hybrid model's KV is cheap (only the interleaved full-attn layers cache KV), so 128K fits easily — at gpu-memory-utilization 0.9 the KV pool holds ~2.1M tokens; the limit is --max-model-len, not memory.

Serve

ATOM_USE_UNIFIED_ATTN=1 \
python -m atom.entrypoints.openai_server --model Qwen/Qwen3.6-35B-A3B \
  --trust-remote-code -tp 1 --kv_cache_dtype bf16 --block-size 64 \
  --max-model-len 131072 --max-num-seqs 2 --gpu-memory-utilization 0.9 \
  --method mtp --num-speculative-tokens 1 \
  --online_quant_config '{"global_quant_config":"ptpc_i8","exclude_layer":["*linear_attn*","*lm_head*","*shared_head*","*embed_tokens*","*mlp.gate"]}'

(Drop --method mtp ... for 35B if you don't want MTP; for the dense 27B MTP is a ~1.6× lossless win.)

Dependency

… on MoE

RDNA3.5 WMMA supports int8 natively (no FP8/FP4), so int8 is the quantization
target for this iGPU. A8W8 (int8 weight + dynamic per-token int8 activation) for
all dense GEMMs and MoE experts; BF16 for the GDN linear-attn (recurrent,
quant-sensitive), router gate, lm_head, embeddings; KV cache BF16.

- model_ops/linear.py: per_Token int8 branch -> aiter Triton gemm_a8w8 on
  non-gfx9 (CK gemm_a8w8_CK is gfx9-only), wrapped as a torch custom op so it is
  HIP-graph / torch.compile safe. Online-quant allow-list += torch.int8.
- model_ops/moe.py + model_ops/fused_moe_triton.py: Int8MoEMethod + int8 branch
  in FusedMoE._online_quant, and triton_kernel_int8_moe_forward using aiter
  moe_gemm_int8_smoothquant (gemm1 with fused gated-SiLU via interleaved w13).
  Enables 35B-A3B, which has no BF16 MoE kernel on gfx1151.
- models/qwen3_5_mtp.py + model_loader/loader.py: fix the MTP-MoE drafter so the
  draft's fused expert weights load (add detect_fused_expert_format /
  get_fused_expert_mapping / load_fused_expert_weights; get_expert_mapping uses
  num_experts; loader resolves load_fused_expert_weights_fn from the model).
  Draft acceptance 0 -> 0.83; MTP now a net win on the MoE model.
- model_ops/topK.py: keep the shared expert as a separate MLP on non-gfx9 so the
  routed MoE uses the portable Triton path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant