[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5, with working MTP#1337
Open
carlushuang wants to merge 1 commit into
Open
[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5, with working MTP#1337carlushuang wants to merge 1 commit into
carlushuang wants to merge 1 commit into
Conversation
… on MoE RDNA3.5 WMMA supports int8 natively (no FP8/FP4), so int8 is the quantization target for this iGPU. A8W8 (int8 weight + dynamic per-token int8 activation) for all dense GEMMs and MoE experts; BF16 for the GDN linear-attn (recurrent, quant-sensitive), router gate, lm_head, embeddings; KV cache BF16. - model_ops/linear.py: per_Token int8 branch -> aiter Triton gemm_a8w8 on non-gfx9 (CK gemm_a8w8_CK is gfx9-only), wrapped as a torch custom op so it is HIP-graph / torch.compile safe. Online-quant allow-list += torch.int8. - model_ops/moe.py + model_ops/fused_moe_triton.py: Int8MoEMethod + int8 branch in FusedMoE._online_quant, and triton_kernel_int8_moe_forward using aiter moe_gemm_int8_smoothquant (gemm1 with fused gated-SiLU via interleaved w13). Enables 35B-A3B, which has no BF16 MoE kernel on gfx1151. - models/qwen3_5_mtp.py + model_loader/loader.py: fix the MTP-MoE drafter so the draft's fused expert weights load (add detect_fused_expert_format / get_fused_expert_mapping / load_fused_expert_weights; get_expert_mapping uses num_experts; loader resolves load_fused_expert_weights_fn from the model). Draft acceptance 0 -> 0.83; MTP now a net win on the MoE model. - model_ops/topK.py: keep the shared expert as a separate MLP on non-gfx9 so the routed MoE uses the portable Triton path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5 (Strix Halo), with working MTP
Builds on the gfx1151 BF16 enablement (#1314) to add online INT8 W8A8 (Quark-style, no offline quant step) for the Qwen3.6 dense (27B) and MoE (35B-A3B) architectures, plus the fixes needed to make MTP speculative decoding work on the MoE draft. RDNA3.5 WMMA supports int8 natively (no FP8/FP4), so int8 is the right quantization target for this iGPU.
Precision split (chosen for quality): A8W8 (int8 weight + dynamic per-token int8 activation) for all dense GEMMs and the MoE experts; BF16 for the Gated-DeltaNet linear-attention (recurrent, quant-sensitive — int8 there produces garbage), the MoE router gate,
lm_head, and embeddings; KV cache BF16. This keeps gsm8k at BF16-equivalent quality while halving weight bytes (the decode bottleneck on a bandwidth-bound iGPU).What this enables
moe_gemm_int8_smoothquant, which is the only int8-W8A8 grouped-GEMM that runs on RDNA3.5.Changes
model_ops/linear.py—per_Tokenint8 branch routes to aiter Tritongemm_a8w8on non-gfx9 (CKgemm_a8w8_CKis gfx9-only), wrapped as a torch custom op (torch.ops.aiter.atom_gemm_a8w8_triton) so it is HIP-graph /torch.compilesafe. Online-quant allow-list +=torch.int8.model_ops/moe.py— newInt8MoEMethod(int8w13/w2+ per-channel fp32 scales) and an int8 branch inFusedMoE._online_quant.model_ops/fused_moe_triton.py—triton_kernel_int8_moe_forward: matmul-ogs routing → per-token int8 quant →moe_gemm_int8_smoothquant(gemm1 with fused gated-SiLU via interleavedw13columns) → per-token int8 quant → gemm2 with scatter/combine.models/qwen3_5_mtp.py+model_loader/loader.py— MTP-MoE drafter fix: the draft's fused expert weights (experts.gate_up_proj/down_proj) were silently dropped at load → 0% draft acceptance → MTP was pure overhead. Add the draft's fused-expert mapping (detect_fused_expert_format/get_fused_expert_mapping/load_fused_expert_weights), fixget_expert_mappingto usenum_experts, and let the loader resolveload_fused_expert_weights_fnfrom the model. After the fix: acceptance 0 → 0.83.entrypoints/openai/tool_parser.py— unique tool-call ids (call_<uuid>instead of a per-responsecall_0). Non-unique ids made agentic clients (qwen-code) dedupe every tool call after the first → endless tool-call loop. Extends feat(openai): Qwen3 (qwen3_coder/qwen3_xml) tool-call support #1319.Quality (gsm8k, 5-shot-equivalent, chat + thinking, greedy)
Performance (gfx1151 / Radeon 8060S, bs=1)
Decode (single-stream, short context):
Long-context (35B-A3B INT8 W8A8 + MTP-1, bs=1):
Decode tok/s falls with context (each step reads the growing KV); prefill is compute-bound (one-time prompt-ingestion cost). The hybrid model's KV is cheap (only the interleaved full-attn layers cache KV), so 128K fits easily — at
gpu-memory-utilization 0.9the KV pool holds ~2.1M tokens; the limit is--max-model-len, not memory.Serve
(Drop
--method mtp ...for 35B if you don't want MTP; for the dense 27B MTP is a ~1.6× lossless win.)Dependency
carhuang/gfx1151_opus_fp8_guard) — arch-guard the gfx9-only fp8/bf8-cvt builtins. Required (shared with [gfx1151] Qwen3.5/3.6 (GDN hybrid) BF16 on RDNA3.5 via native Triton attention #1314). The int8 GEMM/MoE kernels (gemm_a8w8Triton,moe_gemm_int8_smoothquant,per_token_quant_hip) are already upstream in aiter; no new aiter code is needed for the int8 path.carhuang/support_gfx1151_qwen36) — the gfx1151 BF16 base enablement (arch gate, native Triton attention, GDNblock_tables). Prerequisite.carhuang/qwen3_xml_tool_parser) — qwen3_xml tool-call parsing; the unique-tool-call-id fix here extends it.