Skip to content

[ROCm][DSV4] Default to aiter mHC pre/post, add tilelang opt-in knob#6

Open
Fangzhou-Ai wants to merge 1 commit into
base-upstream-main-43854from
rocm-dsv4-aiter-mhc-default
Open

[ROCm][DSV4] Default to aiter mHC pre/post, add tilelang opt-in knob#6
Fangzhou-Ai wants to merge 1 commit into
base-upstream-main-43854from
rocm-dsv4-aiter-mhc-default

Conversation

@Fangzhou-Ai
Copy link
Copy Markdown
Owner

Summary

Re-enable the aiter multi-head-consensus (mHC) pre/post ops as the default ROCm path for DeepSeek V4, replacing the tilelang fused post+pre kernel from vllm-project#43679. The tilelang fused path is preserved as an opt-in via a new env knob:

  • VLLM_ROCM_USE_TILELANG_MHC (default False) — set to 1 to use the tilelang fused post+pre kernel instead of the default aiter mHC pre/post ops. Only affects ROCm under VLLM_ROCM_USE_AITER; on CUDA tilelang is always used.
  • The previously-proposed VLLM_ROCM_USE_AITER_MHC knob is removed in favor of this tilelang opt-out design (aiter is the default).

The aiter mHC kernels require aiter >= 0.1.14, which contains the mhc_pre_gemm_sqrsum_kernel race-condition fix (ROCm/aiter@b639cb6); without it results are wrong at large token counts. AITER_BRANCH in docker/Dockerfile.rocm_base is bumped v0.1.13 -> v0.1.14.

The unfused aiter path applies hc_post inline and returns no residual streams, so the deferred hc_post in DeepseekV4Model.forward and the MTP layer is gated on residual is not None rather than has_tilelang/is_cuda.

Why not a duplicate

This builds directly on top of vllm-project#43679 (tilelang mHC), which explicitly left a hook to re-enable aiter mHC "in later PRs" once an aiter version with the sqrsum fix was available. This is that follow-up.

Accuracy — gsm8k (DeepSeek-V4-Pro, no MTP, aiter mHC, num_fewshot=20)

Filter Metric Value Stderr
flexible-extract exact_match 0.9507 ±0.0060
strict-match exact_match 0.9515 ±0.0059

Within the vllm-project#43679 gate of 0.95 ± 0.01 (tilelang baseline there: 0.9553 / 0.9560). A large-token stress test confirmed the v0.1.14 kernel fixes the sqrsum race (v0.1.13 regressed badly at high token counts).

Performance — aiter mHC vs tilelang fused

Server: DeepSeek-V4-Pro, tp=8, --gpu-memory-utilization 0.8, --kv-cache-dtype fp8, --moe-backend triton_unfused, --compilation-config '{"mode":3,"cudagraph_mode":"FULL_AND_PIECEWISE"}', VLLM_ROCM_USE_AITER=1, no MTP.
Bench: vllm bench serve random, --random-range-ratio 0.8, --num-prompts C*5, --num-warmups C*2 (InferenceX-aligned sweep). Output tok/s, higher is better.

1k1k (input 1024 / output 1024)

Concurrency tilelang tok/s aiter tok/s Output gain tilelang TPOT (ms) aiter TPOT (ms) TPOT reduction
C4 85.45 91.22 +6.8% 42.35 39.76 -6.1%
C64 760.36 796.07 +4.7% 73.30 70.10 -4.4%

8k1k (input 8192 / output 1024)

Concurrency tilelang tok/s aiter tok/s Output gain tilelang TPOT (ms) aiter TPOT (ms) TPOT reduction
C4 74.83 79.55 +6.3% 47.69 44.72 -6.2%
C64 521.70 559.19 +7.2% 108.05 101.03 -6.5%

aiter mHC is consistently faster than the tilelang fused kernel across both workloads and concurrencies (+4.7% to +7.2% throughput, -4.4% to -6.5% TPOT). The tilelang baseline was confirmed to match vllm-project#43679's published tilelang numbers (e.g. 1k1k C4 85.45 vs 82.57 tok/s).

Path verification

  • aiter default: workers log [aiter] import [module_mhc] .../module_mhc.so; no tilelang fused mHC kernels compiled; generation correct.
  • VLLM_ROCM_USE_TILELANG_MHC=1: no aiter module_mhc; tilelang fused path used.

Build / measurement note

Benchmarks and gsm8k were measured on main + #43679 with matching prebuilt ROCm .so. The branch is rebased onto latest upstream/main (vllm-project#43854); the mHC Python code paths are identical between the two. A ROCm precompiled wheel for vllm-project#43854 was not available on pypi.amd.com/vllm-rocm, so the rebased branch could not be benchmarked with matching .so directly.

AI assistance (Cursor) was used to prepare this change; a human submitter has reviewed it.

Made with Cursor

Re-enable the aiter multi-head-consensus (mHC) pre/post ops as the default
ROCm path for DeepSeek V4, replacing the tilelang fused post+pre kernel from
PR vllm-project#43679 which now becomes opt-in via VLLM_ROCM_USE_TILELANG_MHC (default
False). The aiter mHC kernels are faster than the tilelang fused kernel and
require aiter >= 0.1.14 (sqrsum race-condition fix in
mhc_pre_gemm_sqrsum_kernel), so bump AITER_BRANCH to v0.1.14 in
Dockerfile.rocm_base.

The unfused aiter path applies hc_post inline and returns no residual
streams, so the deferred hc_post in DeepseekV4Model.forward and the MTP layer
is now gated on `residual is not None` rather than has_tilelang/is_cuda.

Co-authored-by: Cursor
Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant