Skip to content

[gfx1151] Qwen3.5/3.6 (GDN hybrid) BF16 on RDNA3.5 via native Triton attention#1314

Open
carlushuang wants to merge 2 commits into
mainfrom
carhuang/support_gfx1151_qwen36
Open

[gfx1151] Qwen3.5/3.6 (GDN hybrid) BF16 on RDNA3.5 via native Triton attention#1314
carlushuang wants to merge 2 commits into
mainfrom
carhuang/support_gfx1151_qwen36

Conversation

@carlushuang

@carlushuang carlushuang commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

[gfx1151] Qwen3.5/3.6 (GDN hybrid) BF16 on RDNA3.5 (Strix Halo) via native Triton attention

gfx1151 (AMD Ryzen AI MAX+ / Radeon 8060S, RDNA3.5) support for the Qwen3.5/3.6 architectures. These archs already exist in ATOM (Qwen3_5ForConditionalGeneration dense / Qwen3_5MoeForConditionalGeneration MoE: Gated-DeltaNet linear-attn + interleaved full-attn + MTP), so no new model code — this is arch-enablement only, in the spirit of the gfx1201 path (ATOM_USE_UNIFIED_ATTN=1, attention/GEMM via Triton/hipBLASLt).

aiter ships hand-written HIP kernels that emit gfx9-only instructions (v_pk_mul_f32, packed fp8-cvt) which don't exist on RDNA3.5. Route the affected ops to their existing portable implementations on non-gfx9 arches:

  • atom/utils/arch.pyaiter_hip_kernels_supported() capability gate (gfx9).
  • model_ops/layernorm.pyGemmaRMSNorm.forwardforward_native on non-gfx9 (the aiter fused_qk_rmsnorm_group_quant kernel uses v_pk_mul_f32).
  • model_ops/activation.pySiluAndMulforward_native on non-gfx9 (the silu_and_mul activation kernel pulls in aiter_opus_plus.h).
  • model_ops/attentions/gdn_attn.py — populate block_tables in prepare_prefill so the hybrid model's interleaved full-attention layers work under unified_attention. The GDN metadata builder previously left block_tables=None (only TritonMHAMetadataBuilder populated it), so the full-attn layers crashed with 'NoneType' object has no attribute 'stride'.

Scope: text path. Qwen3.5/3.6 are multimodal (VL) checkpoints (~1/3 of tensors are the model.visual.* ViT); this enablement and every number below cover the text path. The vision tower itself does run correctly on gfx1151 (verified: images perceived correctly) — but multimodal serving currently requires --no-enable_prefix_caching: ATOM's prefix cache collides image-placeholder tokens (all id 248056) across different images and crashes the runner. That's an arch-independent engine issue (needs multimodal-aware cache hashing), tracked as a separate follow-up.

Models verified (gfx1151, ATOM_USE_UNIFIED_ATTN=1, bf16 KV, --block-size 64, eager)

  • Qwen3.6-27B BF16 — coherent + correct (e.g. "60 km in 45 min" → 80 km/h).

Serve

ATOM_USE_UNIFIED_ATTN=1 \
python -m atom.entrypoints.openai_server --model Qwen/Qwen3.6-27B \
  --trust-remote-code -tp 1 --kv_cache_dtype bf16 --block-size 64 \
  --max-model-len 4096 --max-num-seqs 8 --gpu-memory-utilization 0.97

Notes

  • --max-num-seqs small: the GDN per-seq state cache is large (~73 MB/slot).
  • GEMM uses hipBLASLt ("torch solution"); no gfx1151 Triton GEMM tuning yet.
  • 35B-A3B BF16 (~72 GB) needs >64 GB of GPU-visible memory. On unified-memory Strix Halo this comes from GTT: raise the TTM page limit (amd-ttm --set <GB>; reboot) so GTT exceeds the dedicated-VRAM carveout and the driver serves VRAM allocations from GTT. GTT is bounded by system RAM, so if the BIOS dedicates a large carveout (e.g. 64 GB, leaving only ~62 GB system) GTT cannot exceed it — set the BIOS UMA/dedicated-VRAM small (e.g. 512 MB) first, then grow GTT.

Performance (gfx1151 / Radeon 8060S, Qwen3.6-27B BF16, bf16 KV, in≈500 / out=64)

Mode TTFT Decode (single) Throughput (batch 8)
--enforce-eager ~6.3 s 4.2 tok/s ~31 tok/s
HIP graphs (default) 0.54 s 4.38 tok/s ~30 tok/s

Enabling HIP/CUDA graphs (i.e. not passing --enforce-eager) cuts first-token latency ~12× (6.3 s → 0.54 s) with no throughput change; cudagraph capture for bs∈{1,2,4,8} costs ~1.2 s. Recommended to leave graphs on.

Decode is memory-bandwidth-bound, not compute-bound: reading the ~54 GB of BF16 weights per token over Strix Halo's ~256 GB/s LPDDR5X caps single-stream decode near ~4.6 tok/s, and we measure 4.38 (~95% of that roofline). Prefill runs ~950 tok/s (compute-bound, near the iGPU's BF16 peak). Consequently Triton-GEMM/attention tuning and ROCBLAS_USE_HIPBLASLT=1 yield no single-stream gain here (verified identical); the only lever for materially faster decode is reducing bytes/token — FP8 (~2×) or MXFP4 (~4×). Throughput scales with batch, bounded by the GDN per-seq state + KV memory.


Full-stack status (this PR + stacked follow-ups)

This is the umbrella PR for Qwen3.6 on gfx1151 (Radeon 8060S / RDNA3.5, Strix Halo). This PR lands the BF16 arch-enablement; the stacked PRs below add online INT8 W8A8, the 35B-A3B MoE path, MTP speculative decoding, and agentic tool-calling. All measured on a single Radeon 8060S, ROCm 7.13, ATOM_USE_UNIFIED_ATTN=1, bf16 KV, --block-size 64.

Best performance (measured, single Radeon 8060S)

Single-stream decode, short context:

Model Precision Config Decode tok/s
27B dense BF16 HIP graph 4.4
27B dense INT8 W8A8 HIP graph 6.0
27B dense INT8 W8A8 + MTP-1 9.4
35B-A3B INT8 W8A8 HIP graph 24.8
35B-A3B INT8 W8A8 + MTP-1 ~41

35B-A3B INT8 W8A8 (out_proj int8) + MTP-1 + HIP graph — decode tok/s across batch × context (shared-prefix; aggregate, per-stream in parens):

context \ bs 1 2 4 8
8K 41.2 77.6 (38.8) 134.4 (33.6) 224.0 (28.0)
64K 34.3 58.7 (29.3) 94.4 (23.6) 141.1 (17.6)
128K 28.2 47.5 (23.8) 61.9 (15.5) 74.5 (9.3)
256K 22.2 28.2 (14.1) 26.0 (6.5) 41.4 (5.2)

Long-context decode reflects an attention-kernel fix: profiled against the measured LPDDR5X roofline (207 GB/s), the bf16 unified_attention decode kernel was running at only ~31% of bandwidth — parallelism-bound at bs=1, not bandwidth-bound — because each attention workgroup used num_warps=2. Raising it to 8 on gfx1151 (output bitwise-identical, max-relerr 0) lifts the kernel to ~59% of roofline (1.5–1.9×) and drives the long-context gains above vs the prior surface (64K bs1 26.0→34.3 +32%, 128K bs1 18.4→28.2 +53%, 256K bs1 11.2→22.2 +98%; bs8 up to +126%). Short-context (8K) is mostly unchanged because attention is a small share of decode there. Requires a small aiter change (see dependent PRs). 256K bs≥4 is KV-capacity-limited (preemption) → weak/non-monotonic scaling.

Notes: MTP-1 helps single-stream (low bs); at high bs it is net-neutral/negative (the draft competes), so for multi-user throughput drop MTP (no-MTP 35B-A3B hits ~160 tok/s aggregate at bs=8, short ctx). Decode tok/s falls with context (KV read grows; long-context already flash-decodes via unified_attention's 3D segmented path). Cold prefill TTFT scales with context (≈85 s at 64K, ≈3 min at 128K, ≈13 min at 256K) — a one-time per-fresh-prompt cost; prefix caching avoids re-paying it across turns. out_proj-int8 lifts short-context decode ~+11% (BF16 GDN out_proj → int8, quality-safe); the GDN in_proj stays BF16 (int8 there fails gsm8k — feeds the delta-net recurrence).
Quality: 35B-A3B INT8 W8A8 gsm8k = 0.84 (BF16-equivalent; MTP is lossless). Decode is memory-bandwidth-bound for the dense GEMM/MoE weight reads; long-context decode flash-decodes via unified_attention's 3D segmented path (128 KV-splits). The attention kernel itself was not occupancy-saturated at bs=1 (see the num_warps fix above).

Reproduce from scratch (verified clean-room build)

Verified by building aiter + ATOM from source in a fresh container and reproducing the decode matrix above within ~1% (8K bs1 40.7 vs 41.2, 64K bs1 34.4 vs 34.3, 128K bs1 28.2 vs 28.2).

Base image (ROCm 7.13 gfx1151 PyTorch stack):

docker run -d --name atom --device /dev/kfd --device /dev/dri --group-add video \
  --ipc host --network host --shm-size 16g --security-opt label=disable \
  -v $PWD:/work rocm/vllm:rocm7.13.0_gfx1151_ubuntu24.04_py3.13_pytorch_2.10.0_vllm_0.19.1 sleep infinity

Build aiter (with the two dependency PRs above) and ATOM:

# aiter — recursive for the composable_kernel submodule; keep the image's triton
git clone --recursive https://github.com/ROCm/aiter.git && cd aiter
#   apply aiter#3917 (INT8 GEMM config, required), aiter#3919 (sampler arch gate, required for temp>0), aiter#3915 (attention num_warps, perf)
GPU_ARCHS=gfx1151 AITER_USE_SYSTEM_TRITON=1 python3 setup.py develop && cd ..

# ATOM (#1337 branch = this PR's BF16 base + INT8 + MTP)
git clone -b carhuang/gfx1151_int8_qwen36 https://github.com/ROCm/ATOM.git && cd ATOM
pip install -e .   # pins transformers==5.2.0; the resulting vllm pin warning is benign (native engine)

Serve 35B-A3B online INT8 W8A8 + MTP-1 (HIP graphs default-on). The full env matters:

export GPU_ARCHS=gfx1151 HSA_OVERRIDE_GFX_VERSION=11.5.1 PREBUILD_KERNELS=0 \
       ATOM_USE_UNIFIED_ATTN=1 ATOM_USE_TRITON_MOE=1
python3 -m atom.entrypoints.openai_server --model <Qwen3.6-35B-A3B> --trust-remote-code \
  -tp 1 --kv_cache_dtype bf16 --block-size 64 --max-model-len 131072 --max-num-seqs 8 \
  --gpu-memory-utilization 0.9 --method mtp --num-speculative-tokens 1 \
  --online_quant_config '{"global_quant_config":"ptpc_i8","exclude_layer":["*in_proj*","*linear_attn.conv1d*","*lm_head*","*shared_head*","*embed_tokens*","*mlp.gate"]}'

Notes: ATOM_USE_TRITON_MOE=1 and HSA_OVERRIDE_GFX_VERSION=11.5.1 are required (in addition to ATOM_USE_UNIFIED_ATTN=1). Do not add *mtp* to exclude_layer when --method mtp is on, or the draft MoE goes unquantized and falls to an MXFP4 path that asserts. 35B needs >64 GB GPU-visible memory via GTT (see the BF16 note above) — a host/BIOS prerequisite. The decode matrix above was measured with greedy decoding (temperature=0), which does not exercise the sampling path; interactive/agentic use (temperature>0) additionally requires aiter#3919 (or ATOM's native-sampler fallback), otherwise the engine crashes on the first sampled token.

Dependent PRs

Stack: aiter#3860ATOM#1314 (this) → ATOM#1337. The other aiter PRs land independently.

PR Role Status / need
ATOM #1314 (this) gfx1151 BF16 base: arch gate, native Triton attention, GDN block_tables
ATOM #1337 online INT8 W8A8 dense+MoE, 35B-A3B, MTP-on-MoE drafter fix stacked on #1314
ATOM #1319 qwen3_xml tool-calling + unique tool-call ids (agentic) independent
aiter #3860 arch-guard gfx9-only fp8/bf8-cvt builtins merged
aiter #3917 gfx1151 INT8 W8A8 GEMM default config required for INT8
aiter #3919 gfx1151 in cpp_itfs arch allow-list (top-p/top-k sampling) required for temp>0 / agentic
aiter #3915 attention num_warps=8 (~31%→59% of LPDDR5X roofline, 1.5–1.9×) perf, optional

…/act + GDN prefill block_tables

Enable native-engine inference of the Qwen3.5/Qwen3.6 architectures
(Qwen3_5ForConditionalGeneration dense / Qwen3_5MoeForConditionalGeneration
MoE: GDN linear-attn + interleaved full-attn + MTP) on gfx1151 (Strix Halo /
Radeon 8060S, RDNA3.5). No new model code — these archs already exist; this is
arch-enablement only, mirroring the gfx1201 path (ATOM_USE_UNIFIED_ATTN=1).

aiter ships hand-written HIP kernels that emit gfx9-only instructions
(v_pk_mul_f32, packed fp8-cvt). Route the affected ops to their existing
portable implementations on non-gfx9 arches:

  * atom/utils/arch.py: aiter_hip_kernels_supported() capability gate (gfx9).
  * layernorm.py: GemmaRMSNorm.forward -> forward_native on non-gfx9
    (fused_qk_rmsnorm_group_quant uses v_pk_mul_f32).
  * activation.py: SiluAndMul -> forward_native on non-gfx9
    (silu_and_mul activation kernel pulls aiter_opus_plus.h).
  * attentions/gdn_attn.py: populate block_tables in prepare_prefill so the
    hybrid models interleaved full-attention layers work under
    unified_attention (the GDN builder previously left it None; only
    TritonMHAMetadataBuilder populated it).

Requires aiter built with fp8/bf8-cvt builtins arch-guarded for RDNA3.5 —
companion: ROCm/aiter PR (carhuang/gfx1151_opus_fp8_guard).

Verified: Qwen3.6-27B BF16 on gfx1151 generates correct output
(ATOM_USE_UNIFIED_ATTN=1, bf16 KV, --block-size 64, eager).
- atom/utils/arch.py: blank line after module docstring (black)
- atom/model_ops/activation.py: move _AITER_HIP_ACT_SUPPORTED constant
  below the import block (ruff E402)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant