[ROCm][DSV4] Default to aiter mHC pre/post, add tilelang opt-in knob by Fangzhou-Ai · Pull Request #6 · Fangzhou-Ai/vllm

Fangzhou-Ai · 2026-05-29T06:12:44Z

Summary

Re-enable the aiter multi-head-consensus (mHC) pre/post ops as the default ROCm path for DeepSeek V4, replacing the tilelang fused post+pre kernel from vllm-project#43679. The tilelang fused path is preserved as an opt-in via a new env knob:

VLLM_ROCM_USE_TILELANG_MHC (default False) — set to 1 to use the tilelang fused post+pre kernel instead of the default aiter mHC pre/post ops. Only affects ROCm under VLLM_ROCM_USE_AITER; on CUDA tilelang is always used.
The previously-proposed VLLM_ROCM_USE_AITER_MHC knob is removed in favor of this tilelang opt-out design (aiter is the default).

The aiter mHC kernels require aiter >= 0.1.14, which contains the mhc_pre_gemm_sqrsum_kernel race-condition fix (ROCm/aiter@b639cb6); without it results are wrong at large token counts. AITER_BRANCH in docker/Dockerfile.rocm_base is bumped v0.1.13 -> v0.1.14.

The unfused aiter path applies hc_post inline and returns no residual streams, so the deferred hc_post in DeepseekV4Model.forward and the MTP layer is gated on residual is not None rather than has_tilelang/is_cuda.

Why not a duplicate

This builds directly on top of vllm-project#43679 (tilelang mHC), which explicitly left a hook to re-enable aiter mHC "in later PRs" once an aiter version with the sqrsum fix was available. This is that follow-up.

Accuracy — gsm8k (DeepSeek-V4-Pro, no MTP, aiter mHC, num_fewshot=20)

Filter	Metric	Value	Stderr
flexible-extract	exact_match	0.9507	±0.0060
strict-match	exact_match	0.9515	±0.0059

Within the vllm-project#43679 gate of 0.95 ± 0.01 (tilelang baseline there: 0.9553 / 0.9560). A large-token stress test confirmed the v0.1.14 kernel fixes the sqrsum race (v0.1.13 regressed badly at high token counts).

Performance — aiter mHC vs tilelang fused

Server: DeepSeek-V4-Pro, tp=8, --gpu-memory-utilization 0.8, --kv-cache-dtype fp8, --moe-backend triton_unfused, --compilation-config '{"mode":3,"cudagraph_mode":"FULL_AND_PIECEWISE"}', VLLM_ROCM_USE_AITER=1, no MTP.
Bench: vllm bench serve random, --random-range-ratio 0.8, --num-prompts C*5, --num-warmups C*2 (InferenceX-aligned sweep). Output tok/s, higher is better.

1k1k (input 1024 / output 1024)

Concurrency	tilelang tok/s	aiter tok/s	Output gain	tilelang TPOT (ms)	aiter TPOT (ms)	TPOT reduction
C4	85.45	91.22	+6.8%	42.35	39.76	-6.1%
C64	760.36	796.07	+4.7%	73.30	70.10	-4.4%

8k1k (input 8192 / output 1024)

Concurrency	tilelang tok/s	aiter tok/s	Output gain	tilelang TPOT (ms)	aiter TPOT (ms)	TPOT reduction
C4	74.83	79.55	+6.3%	47.69	44.72	-6.2%
C64	521.70	559.19	+7.2%	108.05	101.03	-6.5%

aiter mHC is consistently faster than the tilelang fused kernel across both workloads and concurrencies (+4.7% to +7.2% throughput, -4.4% to -6.5% TPOT). The tilelang baseline was confirmed to match vllm-project#43679's published tilelang numbers (e.g. 1k1k C4 85.45 vs 82.57 tok/s).

Path verification

aiter default: workers log [aiter] import [module_mhc] .../module_mhc.so; no tilelang fused mHC kernels compiled; generation correct.
VLLM_ROCM_USE_TILELANG_MHC=1: no aiter module_mhc; tilelang fused path used.

Build / measurement note

Benchmarks and gsm8k were measured on main + #43679 with matching prebuilt ROCm .so. The branch is rebased onto latest upstream/main (vllm-project#43854); the mHC Python code paths are identical between the two. A ROCm precompiled wheel for vllm-project#43854 was not available on pypi.amd.com/vllm-rocm, so the rebased branch could not be benchmarked with matching .so directly.

AI assistance (Cursor) was used to prepare this change; a human submitter has reviewed it.

Made with Cursor

Re-enable the aiter multi-head-consensus (mHC) pre/post ops as the default ROCm path for DeepSeek V4, replacing the tilelang fused post+pre kernel from PR vllm-project#43679 which now becomes opt-in via VLLM_ROCM_USE_TILELANG_MHC (default False). The aiter mHC kernels are faster than the tilelang fused kernel and require aiter >= 0.1.14 (sqrsum race-condition fix in mhc_pre_gemm_sqrsum_kernel), so bump AITER_BRANCH to v0.1.14 in Dockerfile.rocm_base. The unfused aiter path applies hc_post inline and returns no residual streams, so the deferred hc_post in DeepseekV4Model.forward and the MTP layer is now gated on `residual is not None` rather than has_tilelang/is_cuda. Co-authored-by: Cursor Co-authored-by: Cursor <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm][DSV4] Default to aiter mHC pre/post, add tilelang opt-in knob#6

[ROCm][DSV4] Default to aiter mHC pre/post, add tilelang opt-in knob#6
Fangzhou-Ai wants to merge 1 commit into
base-upstream-main-43854from
rocm-dsv4-aiter-mhc-default

Fangzhou-Ai commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Fangzhou-Ai commented May 29, 2026

Summary

Why not a duplicate

Accuracy — gsm8k (DeepSeek-V4-Pro, no MTP, aiter mHC, num_fewshot=20)

Performance — aiter mHC vs tilelang fused

1k1k (input 1024 / output 1024)

8k1k (input 8192 / output 1024)

Path verification

Build / measurement note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant