[gfx1151] Qwen3.5/3.6 (GDN hybrid) BF16 on RDNA3.5 via native Triton attention by carlushuang · Pull Request #1314 · ROCm/ATOM

carlushuang · 2026-06-22T15:55:49Z

[gfx1151] Qwen3.5/3.6 (GDN hybrid) BF16 on RDNA3.5 (Strix Halo) via native Triton attention

gfx1151 (AMD Ryzen AI MAX+ / Radeon 8060S, RDNA3.5) support for the Qwen3.5/3.6 architectures. These archs already exist in ATOM (Qwen3_5ForConditionalGeneration dense / Qwen3_5MoeForConditionalGeneration MoE: Gated-DeltaNet linear-attn + interleaved full-attn + MTP), so no new model code — this is arch-enablement only, in the spirit of the gfx1201 path (ATOM_USE_UNIFIED_ATTN=1, attention/GEMM via Triton/hipBLASLt).

aiter ships hand-written HIP kernels that emit gfx9-only instructions (v_pk_mul_f32, packed fp8-cvt) which don't exist on RDNA3.5. Route the affected ops to their existing portable implementations on non-gfx9 arches:

atom/utils/arch.py — aiter_hip_kernels_supported() capability gate (gfx9).
model_ops/layernorm.py — GemmaRMSNorm.forward → forward_native on non-gfx9 (the aiter fused_qk_rmsnorm_group_quant kernel uses v_pk_mul_f32).
model_ops/activation.py — SiluAndMul → forward_native on non-gfx9 (the silu_and_mul activation kernel pulls in aiter_opus_plus.h).
model_ops/attentions/gdn_attn.py — populate block_tables in prepare_prefill so the hybrid model's interleaved full-attention layers work under unified_attention. The GDN metadata builder previously left block_tables=None (only TritonMHAMetadataBuilder populated it), so the full-attn layers crashed with 'NoneType' object has no attribute 'stride'.

Scope: text path. Qwen3.5/3.6 are multimodal (VL) checkpoints (~1/3 of tensors are the model.visual.* ViT); this enablement and every number below cover the text path. The vision tower itself does run correctly on gfx1151 (verified: images perceived correctly) — but multimodal serving currently requires --no-enable_prefix_caching: ATOM's prefix cache collides image-placeholder tokens (all id 248056) across different images and crashes the runner. That's an arch-independent engine issue (needs multimodal-aware cache hashing), tracked as a separate follow-up.

Models verified (gfx1151, ATOM_USE_UNIFIED_ATTN=1, bf16 KV, --block-size 64, eager)

Qwen3.6-27B BF16 — coherent + correct (e.g. "60 km in 45 min" → 80 km/h).

Serve

ATOM_USE_UNIFIED_ATTN=1 \
python -m atom.entrypoints.openai_server --model Qwen/Qwen3.6-27B \
  --trust-remote-code -tp 1 --kv_cache_dtype bf16 --block-size 64 \
  --max-model-len 4096 --max-num-seqs 8 --gpu-memory-utilization 0.97

Notes

--max-num-seqs small: the GDN per-seq state cache is large (~73 MB/slot).
GEMM uses hipBLASLt ("torch solution"); no gfx1151 Triton GEMM tuning yet.
35B-A3B BF16 (~72 GB) needs >64 GB of GPU-visible memory. On unified-memory Strix Halo this comes from GTT: raise the TTM page limit (amd-ttm --set <GB>; reboot) so GTT exceeds the dedicated-VRAM carveout and the driver serves VRAM allocations from GTT. GTT is bounded by system RAM, so if the BIOS dedicates a large carveout (e.g. 64 GB, leaving only ~62 GB system) GTT cannot exceed it — set the BIOS UMA/dedicated-VRAM small (e.g. 512 MB) first, then grow GTT.

Performance (gfx1151 / Radeon 8060S, Qwen3.6-27B BF16, bf16 KV, in≈500 / out=64)

Mode	TTFT	Decode (single)	Throughput (batch 8)
`--enforce-eager`	~6.3 s	4.2 tok/s	~31 tok/s
HIP graphs (default)	0.54 s	4.38 tok/s	~30 tok/s

Enabling HIP/CUDA graphs (i.e. not passing --enforce-eager) cuts first-token latency ~12× (6.3 s → 0.54 s) with no throughput change; cudagraph capture for bs∈{1,2,4,8} costs ~1.2 s. Recommended to leave graphs on.

Decode is memory-bandwidth-bound, not compute-bound: reading the ~54 GB of BF16 weights per token over Strix Halo's ~256 GB/s LPDDR5X caps single-stream decode near ~4.6 tok/s, and we measure 4.38 (~95% of that roofline). Prefill runs ~950 tok/s (compute-bound, near the iGPU's BF16 peak). Consequently Triton-GEMM/attention tuning and ROCBLAS_USE_HIPBLASLT=1 yield no single-stream gain here (verified identical); the only lever for materially faster decode is reducing bytes/token — FP8 (~2×) or MXFP4 (~4×). Throughput scales with batch, bounded by the GDN per-seq state + KV memory.

Full-stack status (this PR + stacked follow-ups)

This is the umbrella PR for Qwen3.6 on gfx1151 (Radeon 8060S / RDNA3.5, Strix Halo). This PR lands the BF16 arch-enablement; the stacked PRs below add online INT8 W8A8, the 35B-A3B MoE path, MTP speculative decoding, and agentic tool-calling. All measured on a single Radeon 8060S, ROCm 7.13, ATOM_USE_UNIFIED_ATTN=1, bf16 KV, --block-size 64.

Best performance (measured, single Radeon 8060S)

Single-stream decode, short context:

Model	Precision	Config	Decode tok/s
27B dense	BF16	HIP graph	4.4
27B dense	INT8 W8A8	HIP graph	6.0
27B dense	INT8 W8A8	+ MTP-1	9.4
35B-A3B	INT8 W8A8	HIP graph	24.8
35B-A3B	INT8 W8A8	+ MTP-1	~41

35B-A3B INT8 W8A8 (out_proj int8) + MTP-1 + HIP graph — decode tok/s across batch × context (shared-prefix; aggregate, per-stream in parens):

context \ bs	1	2	4	8
8K	41.2	77.6 (38.8)	134.4 (33.6)	224.0 (28.0)
64K	34.3	58.7 (29.3)	94.4 (23.6)	141.1 (17.6)
128K	28.2	47.5 (23.8)	61.9 (15.5)	74.5 (9.3)
256K	22.2	28.2 (14.1)	26.0 (6.5)	41.4 (5.2)

Long-context decode reflects an attention-kernel fix: profiled against the measured LPDDR5X roofline (207 GB/s), the bf16 unified_attention decode kernel was running at only ~31% of bandwidth — parallelism-bound at bs=1, not bandwidth-bound — because each attention workgroup used num_warps=2. Raising it to 8 on gfx1151 (output bitwise-identical, max-relerr 0) lifts the kernel to ~59% of roofline (1.5–1.9×) and drives the long-context gains above vs the prior surface (64K bs1 26.0→34.3 +32%, 128K bs1 18.4→28.2 +53%, 256K bs1 11.2→22.2 +98%; bs8 up to +126%). Short-context (8K) is mostly unchanged because attention is a small share of decode there. Requires a small aiter change (see dependent PRs). 256K bs≥4 is KV-capacity-limited (preemption) → weak/non-monotonic scaling.

Notes: MTP-1 helps single-stream (low bs); at high bs it is net-neutral/negative (the draft competes), so for multi-user throughput drop MTP (no-MTP 35B-A3B hits ~160 tok/s aggregate at bs=8, short ctx). Decode tok/s falls with context (KV read grows; long-context already flash-decodes via unified_attention's 3D segmented path). Cold prefill TTFT scales with context (≈85 s at 64K, ≈3 min at 128K, ≈13 min at 256K) — a one-time per-fresh-prompt cost; prefix caching avoids re-paying it across turns. out_proj-int8 lifts short-context decode ~+11% (BF16 GDN out_proj → int8, quality-safe); the GDN in_proj stays BF16 (int8 there fails gsm8k — feeds the delta-net recurrence).
Quality: 35B-A3B INT8 W8A8 gsm8k = 0.84 (BF16-equivalent; MTP is lossless). Decode is memory-bandwidth-bound for the dense GEMM/MoE weight reads; long-context decode flash-decodes via unified_attention's 3D segmented path (128 KV-splits). The attention kernel itself was not occupancy-saturated at bs=1 (see the num_warps fix above).

Reproduce from scratch (verified clean-room build)

Verified by building aiter + ATOM from source in a fresh container and reproducing the decode matrix above within ~1% (8K bs1 40.7 vs 41.2, 64K bs1 34.4 vs 34.3, 128K bs1 28.2 vs 28.2).

Base image (ROCm 7.13 gfx1151 PyTorch stack):

docker run -d --name atom --device /dev/kfd --device /dev/dri --group-add video \
  --ipc host --network host --shm-size 16g --security-opt label=disable \
  -v $PWD:/work rocm/vllm:rocm7.13.0_gfx1151_ubuntu24.04_py3.13_pytorch_2.10.0_vllm_0.19.1 sleep infinity

Build aiter (with the two dependency PRs above) and ATOM:

# aiter — recursive for the composable_kernel submodule; keep the image's triton
git clone --recursive https://github.com/ROCm/aiter.git && cd aiter
#   apply aiter#3917 (INT8 GEMM config, required), aiter#3919 (sampler arch gate, required for temp>0), aiter#3915 (attention num_warps, perf)
GPU_ARCHS=gfx1151 AITER_USE_SYSTEM_TRITON=1 python3 setup.py develop && cd ..

# ATOM (#1337 branch = this PR's BF16 base + INT8 + MTP)
git clone -b carhuang/gfx1151_int8_qwen36 https://github.com/ROCm/ATOM.git && cd ATOM
pip install -e .   # pins transformers==5.2.0; the resulting vllm pin warning is benign (native engine)

Serve 35B-A3B online INT8 W8A8 + MTP-1 (HIP graphs default-on). The full env matters:

export GPU_ARCHS=gfx1151 HSA_OVERRIDE_GFX_VERSION=11.5.1 PREBUILD_KERNELS=0 \
       ATOM_USE_UNIFIED_ATTN=1 ATOM_USE_TRITON_MOE=1
python3 -m atom.entrypoints.openai_server --model <Qwen3.6-35B-A3B> --trust-remote-code \
  -tp 1 --kv_cache_dtype bf16 --block-size 64 --max-model-len 131072 --max-num-seqs 8 \
  --gpu-memory-utilization 0.9 --method mtp --num-speculative-tokens 1 \
  --online_quant_config '{"global_quant_config":"ptpc_i8","exclude_layer":["*in_proj*","*linear_attn.conv1d*","*lm_head*","*shared_head*","*embed_tokens*","*mlp.gate"]}'

Notes: ATOM_USE_TRITON_MOE=1 and HSA_OVERRIDE_GFX_VERSION=11.5.1 are required (in addition to ATOM_USE_UNIFIED_ATTN=1). Do not add *mtp* to exclude_layer when --method mtp is on, or the draft MoE goes unquantized and falls to an MXFP4 path that asserts. 35B needs >64 GB GPU-visible memory via GTT (see the BF16 note above) — a host/BIOS prerequisite. The decode matrix above was measured with greedy decoding (temperature=0), which does not exercise the sampling path; interactive/agentic use (temperature>0) additionally requires aiter#3919 (or ATOM's native-sampler fallback), otherwise the engine crashes on the first sampled token.

Dependent PRs

Stack: aiter#3860 → ATOM#1314 (this) → ATOM#1337. The other aiter PRs land independently.

PR	Role	Status / need
ATOM #1314 (this)	gfx1151 BF16 base: arch gate, native Triton attention, GDN `block_tables`	—
ATOM #1337	online INT8 W8A8 dense+MoE, 35B-A3B, MTP-on-MoE drafter fix	stacked on #1314
ATOM #1319	qwen3_xml tool-calling + unique tool-call ids (agentic)	independent
aiter #3860	arch-guard gfx9-only fp8/bf8-cvt builtins	merged
aiter #3917	gfx1151 INT8 W8A8 GEMM default config	required for INT8
aiter #3919	gfx1151 in `cpp_itfs` arch allow-list (top-p/top-k sampling)	required for temp>0 / agentic
aiter #3915	attention `num_warps=8` (~31%→59% of LPDDR5X roofline, 1.5–1.9×)	perf, optional

…/act + GDN prefill block_tables Enable native-engine inference of the Qwen3.5/Qwen3.6 architectures (Qwen3_5ForConditionalGeneration dense / Qwen3_5MoeForConditionalGeneration MoE: GDN linear-attn + interleaved full-attn + MTP) on gfx1151 (Strix Halo / Radeon 8060S, RDNA3.5). No new model code — these archs already exist; this is arch-enablement only, mirroring the gfx1201 path (ATOM_USE_UNIFIED_ATTN=1). aiter ships hand-written HIP kernels that emit gfx9-only instructions (v_pk_mul_f32, packed fp8-cvt). Route the affected ops to their existing portable implementations on non-gfx9 arches: * atom/utils/arch.py: aiter_hip_kernels_supported() capability gate (gfx9). * layernorm.py: GemmaRMSNorm.forward -> forward_native on non-gfx9 (fused_qk_rmsnorm_group_quant uses v_pk_mul_f32). * activation.py: SiluAndMul -> forward_native on non-gfx9 (silu_and_mul activation kernel pulls aiter_opus_plus.h). * attentions/gdn_attn.py: populate block_tables in prepare_prefill so the hybrid models interleaved full-attention layers work under unified_attention (the GDN builder previously left it None; only TritonMHAMetadataBuilder populated it). Requires aiter built with fp8/bf8-cvt builtins arch-guarded for RDNA3.5 — companion: ROCm/aiter PR (carhuang/gfx1151_opus_fp8_guard). Verified: Qwen3.6-27B BF16 on gfx1151 generates correct output (ATOM_USE_UNIFIED_ATTN=1, bf16 KV, --block-size 64, eager).

- atom/utils/arch.py: blank line after module docstring (black) - atom/model_ops/activation.py: move _AITER_HIP_ACT_SUPPORTED constant below the import block (ruff E402)

carlushuang mentioned this pull request Jun 22, 2026

[OPUS]: arch-guard fp8/bf8 packed-cvt builtins for RDNA3/3.5 (gfx1151) ROCm/aiter#3860

Merged

style: satisfy black + ruff for gfx1151 changes

0224d70

- atom/utils/arch.py: blank line after module docstring (black) - atom/model_ops/activation.py: move _AITER_HIP_ACT_SUPPORTED constant below the import block (ruff E402)

carlushuang mentioned this pull request Jun 24, 2026

[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5, with working MTP #1337

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[gfx1151] Qwen3.5/3.6 (GDN hybrid) BF16 on RDNA3.5 via native Triton attention#1314

[gfx1151] Qwen3.5/3.6 (GDN hybrid) BF16 on RDNA3.5 via native Triton attention#1314
carlushuang wants to merge 2 commits into
mainfrom
carhuang/support_gfx1151_qwen36

carlushuang commented Jun 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

carlushuang commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[gfx1151] Qwen3.5/3.6 (GDN hybrid) BF16 on RDNA3.5 (Strix Halo) via native Triton attention

Models verified (gfx1151, ATOM_USE_UNIFIED_ATTN=1, bf16 KV, --block-size 64, eager)

Serve

Notes

Performance (gfx1151 / Radeon 8060S, Qwen3.6-27B BF16, bf16 KV, in≈500 / out=64)

Full-stack status (this PR + stacked follow-ups)

Best performance (measured, single Radeon 8060S)

Reproduce from scratch (verified clean-room build)

Dependent PRs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

carlushuang commented Jun 22, 2026 •

edited

Loading