[gfx1151] Qwen3.5/3.6 (GDN hybrid) BF16 on RDNA3.5 via native Triton attention#1314
Open
carlushuang wants to merge 2 commits into
Open
[gfx1151] Qwen3.5/3.6 (GDN hybrid) BF16 on RDNA3.5 via native Triton attention#1314carlushuang wants to merge 2 commits into
carlushuang wants to merge 2 commits into
Conversation
…/act + GDN prefill block_tables
Enable native-engine inference of the Qwen3.5/Qwen3.6 architectures
(Qwen3_5ForConditionalGeneration dense / Qwen3_5MoeForConditionalGeneration
MoE: GDN linear-attn + interleaved full-attn + MTP) on gfx1151 (Strix Halo /
Radeon 8060S, RDNA3.5). No new model code — these archs already exist; this is
arch-enablement only, mirroring the gfx1201 path (ATOM_USE_UNIFIED_ATTN=1).
aiter ships hand-written HIP kernels that emit gfx9-only instructions
(v_pk_mul_f32, packed fp8-cvt). Route the affected ops to their existing
portable implementations on non-gfx9 arches:
* atom/utils/arch.py: aiter_hip_kernels_supported() capability gate (gfx9).
* layernorm.py: GemmaRMSNorm.forward -> forward_native on non-gfx9
(fused_qk_rmsnorm_group_quant uses v_pk_mul_f32).
* activation.py: SiluAndMul -> forward_native on non-gfx9
(silu_and_mul activation kernel pulls aiter_opus_plus.h).
* attentions/gdn_attn.py: populate block_tables in prepare_prefill so the
hybrid models interleaved full-attention layers work under
unified_attention (the GDN builder previously left it None; only
TritonMHAMetadataBuilder populated it).
Requires aiter built with fp8/bf8-cvt builtins arch-guarded for RDNA3.5 —
companion: ROCm/aiter PR (carhuang/gfx1151_opus_fp8_guard).
Verified: Qwen3.6-27B BF16 on gfx1151 generates correct output
(ATOM_USE_UNIFIED_ATTN=1, bf16 KV, --block-size 64, eager).
- atom/utils/arch.py: blank line after module docstring (black) - atom/model_ops/activation.py: move _AITER_HIP_ACT_SUPPORTED constant below the import block (ruff E402)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[gfx1151] Qwen3.5/3.6 (GDN hybrid) BF16 on RDNA3.5 (Strix Halo) via native Triton attention
gfx1151 (AMD Ryzen AI MAX+ / Radeon 8060S, RDNA3.5) support for the Qwen3.5/3.6 architectures. These archs already exist in ATOM (
Qwen3_5ForConditionalGenerationdense /Qwen3_5MoeForConditionalGenerationMoE: Gated-DeltaNet linear-attn + interleaved full-attn + MTP), so no new model code — this is arch-enablement only, in the spirit of the gfx1201 path (ATOM_USE_UNIFIED_ATTN=1, attention/GEMM via Triton/hipBLASLt).aiter ships hand-written HIP kernels that emit gfx9-only instructions (
v_pk_mul_f32, packed fp8-cvt) which don't exist on RDNA3.5. Route the affected ops to their existing portable implementations on non-gfx9 arches:atom/utils/arch.py—aiter_hip_kernels_supported()capability gate (gfx9).model_ops/layernorm.py—GemmaRMSNorm.forward→forward_nativeon non-gfx9 (the aiterfused_qk_rmsnorm_group_quantkernel usesv_pk_mul_f32).model_ops/activation.py—SiluAndMul→forward_nativeon non-gfx9 (thesilu_and_mulactivation kernel pulls inaiter_opus_plus.h).model_ops/attentions/gdn_attn.py— populateblock_tablesinprepare_prefillso the hybrid model's interleaved full-attention layers work underunified_attention. The GDN metadata builder previously leftblock_tables=None(onlyTritonMHAMetadataBuilderpopulated it), so the full-attn layers crashed with'NoneType' object has no attribute 'stride'.Models verified (gfx1151, ATOM_USE_UNIFIED_ATTN=1, bf16 KV, --block-size 64, eager)
Serve
Notes
--max-num-seqssmall: the GDN per-seq state cache is large (~73 MB/slot).amd-ttm --set <GB>; reboot) so GTT exceeds the dedicated-VRAM carveout and the driver serves VRAM allocations from GTT. GTT is bounded by system RAM, so if the BIOS dedicates a large carveout (e.g. 64 GB, leaving only ~62 GB system) GTT cannot exceed it — set the BIOS UMA/dedicated-VRAM small (e.g. 512 MB) first, then grow GTT.Performance (gfx1151 / Radeon 8060S, Qwen3.6-27B BF16, bf16 KV, in≈500 / out=64)
--enforce-eagerEnabling HIP/CUDA graphs (i.e. not passing
--enforce-eager) cuts first-token latency ~12× (6.3 s → 0.54 s) with no throughput change; cudagraph capture for bs∈{1,2,4,8} costs ~1.2 s. Recommended to leave graphs on.Decode is memory-bandwidth-bound, not compute-bound: reading the ~54 GB of BF16 weights per token over Strix Halo's ~256 GB/s LPDDR5X caps single-stream decode near ~4.6 tok/s, and we measure 4.38 (~95% of that roofline). Prefill runs ~950 tok/s (compute-bound, near the iGPU's BF16 peak). Consequently Triton-GEMM/attention tuning and
ROCBLAS_USE_HIPBLASLT=1yield no single-stream gain here (verified identical); the only lever for materially faster decode is reducing bytes/token — FP8 (~2×) or MXFP4 (~4×). Throughput scales with batch, bounded by the GDN per-seq state + KV memory.Full-stack status (this PR + stacked follow-ups)
This is the umbrella PR for Qwen3.6 on gfx1151 (Radeon 8060S / RDNA3.5, Strix Halo). This PR lands the BF16 arch-enablement; the stacked PRs below add online INT8 W8A8, the 35B-A3B MoE path, MTP speculative decoding, and agentic tool-calling. All measured on a single Radeon 8060S, ROCm 7.13,
ATOM_USE_UNIFIED_ATTN=1, bf16 KV,--block-size 64.Best performance (measured, single Radeon 8060S)
Single-stream decode, short context:
35B-A3B INT8 W8A8 (out_proj int8) + MTP-1 + HIP graph — decode tok/s across batch × context (shared-prefix; aggregate, per-stream in parens):
Long-context decode reflects an attention-kernel fix: profiled against the measured LPDDR5X roofline (207 GB/s), the bf16
unified_attentiondecode kernel was running at only ~31% of bandwidth — parallelism-bound at bs=1, not bandwidth-bound — because each attention workgroup usednum_warps=2. Raising it to 8 on gfx1151 (output bitwise-identical, max-relerr 0) lifts the kernel to ~59% of roofline (1.5–1.9×) and drives the long-context gains above vs the prior surface (64K bs1 26.0→34.3 +32%, 128K bs1 18.4→28.2 +53%, 256K bs1 11.2→22.2 +98%; bs8 up to +126%). Short-context (8K) is mostly unchanged because attention is a small share of decode there. Requires a small aiter change (see dependent PRs). 256K bs≥4 is KV-capacity-limited (preemption) → weak/non-monotonic scaling.Notes: MTP-1 helps single-stream (low bs); at high bs it is net-neutral/negative (the draft competes), so for multi-user throughput drop MTP (no-MTP 35B-A3B hits ~160 tok/s aggregate at bs=8, short ctx). Decode tok/s falls with context (KV read grows; long-context already flash-decodes via
unified_attention's 3D segmented path). Cold prefill TTFT scales with context (≈85 s at 64K, ≈3 min at 128K, ≈13 min at 256K) — a one-time per-fresh-prompt cost; prefix caching avoids re-paying it across turns.out_proj-int8 lifts short-context decode ~+11% (BF16 GDNout_proj→ int8, quality-safe); the GDNin_projstays BF16 (int8 there fails gsm8k — feeds the delta-net recurrence).Quality: 35B-A3B INT8 W8A8 gsm8k = 0.84 (BF16-equivalent; MTP is lossless). Decode is memory-bandwidth-bound for the dense GEMM/MoE weight reads; long-context decode flash-decodes via
unified_attention's 3D segmented path (128 KV-splits). The attention kernel itself was not occupancy-saturated at bs=1 (see the num_warps fix above).Reproduce from scratch (verified clean-room build)
Verified by building aiter + ATOM from source in a fresh container and reproducing the decode matrix above within ~1% (8K bs1 40.7 vs 41.2, 64K bs1 34.4 vs 34.3, 128K bs1 28.2 vs 28.2).
Base image (ROCm 7.13 gfx1151 PyTorch stack):
Build aiter (with the two dependency PRs above) and ATOM:
Serve 35B-A3B online INT8 W8A8 + MTP-1 (HIP graphs default-on). The full env matters:
Notes:
ATOM_USE_TRITON_MOE=1andHSA_OVERRIDE_GFX_VERSION=11.5.1are required (in addition toATOM_USE_UNIFIED_ATTN=1). Do not add*mtp*toexclude_layerwhen--method mtpis on, or the draft MoE goes unquantized and falls to an MXFP4 path that asserts. 35B needs >64 GB GPU-visible memory via GTT (see the BF16 note above) — a host/BIOS prerequisite. The decode matrix above was measured with greedy decoding (temperature=0), which does not exercise the sampling path; interactive/agentic use (temperature>0) additionally requires aiter#3919 (or ATOM's native-sampler fallback), otherwise the engine crashes on the first sampled token.Dependent PRs
Stack: aiter#3860 → ATOM#1314 (this) → ATOM#1337. The other aiter PRs land independently.
block_tablescpp_itfsarch allow-list (top-p/top-k sampling)num_warps=8(~31%→59% of LPDDR5X roofline, 1.5–1.9×)