[ROCm][DSV4] Default to aiter mHC pre/post, add tilelang opt-in knob#6
Open
Fangzhou-Ai wants to merge 1 commit into
Open
[ROCm][DSV4] Default to aiter mHC pre/post, add tilelang opt-in knob#6Fangzhou-Ai wants to merge 1 commit into
Fangzhou-Ai wants to merge 1 commit into
Conversation
Re-enable the aiter multi-head-consensus (mHC) pre/post ops as the default ROCm path for DeepSeek V4, replacing the tilelang fused post+pre kernel from PR vllm-project#43679 which now becomes opt-in via VLLM_ROCM_USE_TILELANG_MHC (default False). The aiter mHC kernels are faster than the tilelang fused kernel and require aiter >= 0.1.14 (sqrsum race-condition fix in mhc_pre_gemm_sqrsum_kernel), so bump AITER_BRANCH to v0.1.14 in Dockerfile.rocm_base. The unfused aiter path applies hc_post inline and returns no residual streams, so the deferred hc_post in DeepseekV4Model.forward and the MTP layer is now gated on `residual is not None` rather than has_tilelang/is_cuda. Co-authored-by: Cursor Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Re-enable the aiter multi-head-consensus (mHC) pre/post ops as the default ROCm path for DeepSeek V4, replacing the tilelang fused post+pre kernel from vllm-project#43679. The tilelang fused path is preserved as an opt-in via a new env knob:
VLLM_ROCM_USE_TILELANG_MHC(defaultFalse) — set to1to use the tilelang fused post+pre kernel instead of the default aiter mHC pre/post ops. Only affects ROCm underVLLM_ROCM_USE_AITER; on CUDA tilelang is always used.VLLM_ROCM_USE_AITER_MHCknob is removed in favor of this tilelang opt-out design (aiter is the default).The aiter mHC kernels require aiter >= 0.1.14, which contains the
mhc_pre_gemm_sqrsum_kernelrace-condition fix (ROCm/aiter@b639cb6); without it results are wrong at large token counts.AITER_BRANCHindocker/Dockerfile.rocm_baseis bumpedv0.1.13 -> v0.1.14.The unfused aiter path applies
hc_postinline and returns no residual streams, so the deferredhc_postinDeepseekV4Model.forwardand the MTP layer is gated onresidual is not Nonerather thanhas_tilelang/is_cuda.Why not a duplicate
This builds directly on top of vllm-project#43679 (tilelang mHC), which explicitly left a hook to re-enable aiter mHC "in later PRs" once an aiter version with the sqrsum fix was available. This is that follow-up.
Accuracy — gsm8k (DeepSeek-V4-Pro, no MTP, aiter mHC, num_fewshot=20)
Within the vllm-project#43679 gate of
0.95 ± 0.01(tilelang baseline there: 0.9553 / 0.9560). A large-token stress test confirmed the v0.1.14 kernel fixes the sqrsum race (v0.1.13 regressed badly at high token counts).Performance — aiter mHC vs tilelang fused
Server:
DeepSeek-V4-Pro,tp=8,--gpu-memory-utilization 0.8,--kv-cache-dtype fp8,--moe-backend triton_unfused,--compilation-config '{"mode":3,"cudagraph_mode":"FULL_AND_PIECEWISE"}',VLLM_ROCM_USE_AITER=1, no MTP.Bench:
vllm bench serverandom,--random-range-ratio 0.8,--num-prompts C*5,--num-warmups C*2(InferenceX-aligned sweep). Output tok/s, higher is better.1k1k (input 1024 / output 1024)
8k1k (input 8192 / output 1024)
aiter mHC is consistently faster than the tilelang fused kernel across both workloads and concurrencies (+4.7% to +7.2% throughput, -4.4% to -6.5% TPOT). The tilelang baseline was confirmed to match vllm-project#43679's published tilelang numbers (e.g. 1k1k C4 85.45 vs 82.57 tok/s).
Path verification
[aiter] import [module_mhc] .../module_mhc.so; no tilelang fused mHC kernels compiled; generation correct.VLLM_ROCM_USE_TILELANG_MHC=1: no aitermodule_mhc; tilelang fused path used.Build / measurement note
Benchmarks and gsm8k were measured on
main + #43679with matching prebuilt ROCm.so. The branch is rebased onto latestupstream/main(vllm-project#43854); the mHC Python code paths are identical between the two. A ROCm precompiled wheel for vllm-project#43854 was not available onpypi.amd.com/vllm-rocm, so the rebased branch could not be benchmarked with matching.sodirectly.AI assistance (Cursor) was used to prepare this change; a human submitter has reviewed it.
Made with Cursor