[ROCm][DSV4] Enable Tilelang MHC replacing torch/triton mhc by tjtanaa · Pull Request #43679 · vllm-project/vllm

tjtanaa · 2026-05-26T14:26:36Z

Purpose

In recent tilelang PR they support Vendor free compilation among CUDA and ROCM wheels in tile-ai/tilelang#2195 . So on ROCm we are pip installing the tilelang wheel from pypi directly.

This PR follows the way in sglang https://github.com/sgl-project/sglang/blob/c47f0e7cdde48ddc718e3c6ee8bc87bebee2e8ff/python/sglang/srt/layers/mhc.py#L88 to add a ENABLE_PDL control so that we can set it to False on unsupported platform like ROCm.

Now the Tilelang kernel is compatible for CUDA and ROCm.

This is used to replace the slow inference torch kernel if tilelang is support on the platform. I would like to keep the unfused path as well as torch fallback path when tilelang is not available and to prepare for a path to reenable the aiter MHC and allow developers easily validate if separate aiter MHC post and pre is faster than tilelang fused post and pre MHC in later PRs.

Test Plan

Lmeval score of no MTP and MTP must be around 0.95 for gsm8k score . MTP acceptance draft rate must be normal > 2.6 for gsm8k.
Perf gain over using torch mhc (before the PR)
kernel test case tests/kernels/test_mhc_kernels.py

======================= 51 passed, 38 warnings in 21.15s =======================

Test Result

experimental details

No MTP server command

#!/bin/bash

rm -rf /root/.cache/vllm

VLLM_ROCM_USE_AITER=1 \
vllm serve deepseek-ai/DeepSeek-V4-Pro \
  --host localhost \
  --port 8001 \
  --dtype auto \
  --tensor-parallel-size 8 \
  --max-num-seqs 256 \
  --distributed-executor-backend mp \
  --trust-remote-code \
  --gpu-memory-utilization 0.6 \
  --moe-backend triton_unfused \
  --tokenizer-mode deepseek_v4 \
  --reasoning-parser deepseek_v4 \
  --kv-cache-dtype fp8_e4m3 \
  --compilation-config '{"mode":3,"cudagraph_mode": "FULL_AND_PIECEWISE"}'

MTP command

#!/bin/bash

rm -rf /root/.cache/vllm

VLLM_ROCM_USE_AITER=1 \
vllm serve deepseek-ai/DeepSeek-V4-Pro \
  --host localhost \
  --port 8001 \
  --dtype auto \
  --tensor-parallel-size 8 \
  --max-num-seqs 256 \
  --distributed-executor-backend mp \
  --trust-remote-code \
  --gpu-memory-utilization 0.6 \
  --moe-backend triton_unfused \
  --tokenizer-mode deepseek_v4 \
  --reasoning-parser deepseek_v4 \
  --kv-cache-dtype fp8_e4m3 \
  --compilation-config '{"mode":3,"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
  --speculative_config '{"method":"mtp","num_speculative_tokens":2}'

Server benchmark script

#!/bin/bash
set -euo pipefail

# After you have launched the server with the command in @launchdeepseekv4graph.sh

BASE_URL=${BASE_URL:-http://127.0.0.1:8001}
RESULT_DIR=${RESULT_DIR:-./tilelangmhcmtppr-bench-results}
RESULT_PREFIX=${RESULT_PREFIX:-ds-v4-pro-mtpbeforepr}
CONCURRENCIES=${CONCURRENCIES:-"1 2 4 8 16 32 64"}
INPUT_LEN=${INPUT_LEN:-1024}
OUTPUT_LEN=${OUTPUT_LEN:-1024}

for C in ${CONCURRENCIES}; do
  NUM_PROMPTS=$((C * 5))
  NUM_WARMUPS=$((C * 1))

  vllm bench serve \
    --backend openai-chat \
    --base-url "${BASE_URL}" \
    --endpoint /v1/chat/completions \
    --model deepseek-ai/DeepSeek-V4-Pro \
    --dataset-name random \
    --input-len "${INPUT_LEN}" \
    --output-len "${OUTPUT_LEN}" \
    --num-prompts "${NUM_PROMPTS}" \
    --request-rate inf \
    --max-concurrency "${C}" \
    --num-warmups "${NUM_WARMUPS}" \
    --ignore-eos \
    --percentile-metrics ttft,tpot,itl,e2el \
    --metric-percentiles 50,90,99 \
    --save-result \
    --result-dir "${RESULT_DIR}" \
    --result-filename "${RESULT_PREFIX}-C${C}.json" \
    --metadata concurrency="${C}" workload=random_${INPUT_LEN}_${OUTPUT_LEN} num_prompts="${NUM_PROMPTS}"
done

Lmeval command (NOTE: numshot 8 and numshot 20 must both give a score of around 0.95 (+-0.01) for both strict and relax; else there could have accuracy issue)

#!/bin/bash

MODEL=deepseek-ai/DeepSeek-V4-Pro
lm_eval --model local-completions --model_args model=$MODEL,base_url=http://0.0.0.0:8001/v1/completions,num_concurrent=256,max_retries=10,max_gen_toks=2048,max_length=1048576,timeout=60000 --batch_size auto --tasks gsm8k --num_fewshot 20 \
  --output_path ./results_deepseekv4pro_numshot20 \
  --log_samples \
| tee lmeval_deepseekv4pro_numshot20.log

Lm eval score

DeepSeek V4 Pro no MTP

local-completions ({'model': 'deepseek-ai/DeepSeek-V4-Pro', 'base_url': 'http://0.0.0.0:8001/v1/completions', 'num_concurrent': 256, 'max_retries': 10, 'max_gen_toks': 2048, 'max_length': 1048576, 'timeout': 60000}), gen_kwargs: ({}), limit: None, num_fewshot: 20, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|    20|exact_match|↑  |0.9553|±  |0.0057|
|     |       |strict-match    |    20|exact_match|↑  |0.9560|±  |0.0056|

DeepSeek V4 Pro with MTP

local-completions ({'model': 'deepseek-ai/DeepSeek-V4-Pro', 'base_url': 'http://0.0.0.0:8001/v1/completions', 'num_concurrent': 256, 'max_retries': 10, 'max_gen_toks': 2048, 'max_length': 1048576, 'timeout': 60000}), gen_kwargs: ({}), limit: None, num_fewshot: 20, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|    20|exact_match|↑  |0.9530|±  |0.0058|
|     |       |strict-match    |    20|exact_match|↑  |0.9538|±  |0.0058|

Acceptance score: [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.71, Accepted throughput: 371.60 tokens/s, Drafted throughput: 435.60 tokens/s, Accepted: 3716 tokens, Drafted: 4356 tokens, Per-position acceptance rate: 0.962, 0.744, Avg Draft acceptance rate: 85.3%

Performance gain

no MTP (Before PR - torch mhc/triton mhc and after PR - tilelang mhc)

Throughput and Latency

Concurrency	torch/triton mhc output tok/s	tilelang mhc output tok/s	Output gain	torch/triton mhc TPOT ms	tilelang mhc TPOT ms	TPOT reduction	torch/triton mhc E2E ms	tilelang mhc E2E ms	E2E reduction
C1	19.21	23.07	20.10%	51.19	42.54	16.89%	53303.41	44380.85	16.74%
C2	36.60	44.06	20.38%	53.02	44.00	17.01%	55940.87	46471.69	16.93%
C4	68.97	82.57	19.72%	54.92	45.49	17.17%	59375.61	49595.27	16.47%
C8	125.24	147.89	18.09%	59.98	50.29	16.16%	65368.30	55358.54	15.31%
C16	214.02	249.19	16.43%	69.72	59.43	14.76%	76465.79	65676.14	14.11%
C32	326.42	372.72	14.18%	90.88	79.60	12.41%	100198.06	87751.37	12.42%
C64	452.21	507.12	12.14%	130.79	116.54	10.89%	144510.73	128875.31	10.82%

With MTP (Before PR - torch mhc/triton mhc and after PR - tilelang mhc)

Throughput, Latency, and Spec Decode

Concurrency	torch/triton mhc output tok/s	tilelang mhc output tok/s	Output gain	torch/triton mhc TPOT ms	tilelang mhc TPOT ms	TPOT reduction	torch/triton mhc E2E ms	tilelang mhc E2E ms	E2E reduction	torch/triton mhc acceptance %	tilelang mhc acceptance %	Acceptance delta
C1	37.93	46.12	21.58%	26.15	21.48	17.85%	26994.91	22204.13	17.75%	50.43	50.97	+0.53 pp
C2	64.72	86.09	33.01%	28.42	22.49	20.86%	29496.56	23380.52	20.73%	45.63	49.55	+3.92 pp
C4	117.25	157.67	34.47%	31.11	24.58	20.98%	32363.05	25642.82	20.77%	45.29	52.22	+6.93 pp
C8	229.31	277.67	21.09%	31.76	27.22	14.30%	33004.85	28315.66	14.21%	47.57	46.89	-0.68 pp
C16	389.97	457.91	17.42%	36.14	31.99	11.49%	37611.65	33271.31	11.54%	51.05	48.14	-2.90 pp
C32	251.95	261.86	3.93%	122.89	117.56	4.33%	126816.42	121267.94	4.38%	48.94	48.47	-0.47 pp
C64	821.74	882.68	7.42%	69.54	63.82	8.23%	72430.96	66495.85	8.19%	48.64	48.24	-0.40 pp

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

gemini-code-assist

Code Review

This pull request enables TileLang-based Multi-Head Latent Attention (MHC) kernels on ROCm (AMD GPUs) by adapting the existing CUDA implementations. It introduces platform-aware checks for Programmatic Dependent Launch (PDL) support, caches TileLang availability, and integrates TileLang operators into the DeepSeek V4 and MTP models. Critical feedback highlights that hardcoding a warp size of 32 in reduction logic will cause incorrect results on ROCm, where the wavefront size is 64; a platform-aware WARP_SIZE constant should be defined and used instead. Additionally, querying device capability via torch.cuda.current_device() at import time should be replaced with a stateless query to prevent eager CUDA initialization.

gemini-code-assist · 2026-05-26T14:28:16Z

+        lane = tid % 32
+        warp_id = tid // 32
+        num_warps = n_thr // 32
+        warp_acc = T.alloc_shared((num_warps, block_m, tile_n), T.float32)
+        warp_sqr = T.alloc_shared((num_warps, block_m), T.float32)


On AMD GPUs (ROCm), the hardware wavefront/warp size is 64, not 32. Hardcoding 32 for warp/lane index calculations will cause incorrect reduction results and double-counting when using T.warp_reduce_sum on ROCm. Please use WARP_SIZE instead of 32.

Suggested change

lane = tid % 32

warp_id = tid // 32

num_warps = n_thr // 32

warp_acc = T.alloc_shared((num_warps, block_m, tile_n), T.float32)

warp_sqr = T.alloc_shared((num_warps, block_m), T.float32)

lane = tid % WARP_SIZE

warp_id = tid // WARP_SIZE

num_warps = n_thr // WARP_SIZE

warp_acc = T.alloc_shared((num_warps, block_m, tile_n), T.float32)

warp_sqr = T.alloc_shared((num_warps, block_m), T.float32)

AMD CDNA hardware wavefront size is 64, but TileLang_s HIP warp_reduce_sum intentionally preserves 32-lane logical warp semantics. The installed TileLang source says this directly in:

tilelang/src/tl_templates/hip/reduce.h

Shows that it uses:

__shfl_xor(value, 16, 32)
...
__shfl_xor(value, 1, 32)

https://github.com/tile-ai/tilelang/blob/23d91c584dd98810b1acf91ec83bb1587dadf3c2/src/tl_templates/hip/reduce.h#L161 and https://github.com/tile-ai/tilelang/blob/23d91c584dd98810b1acf91ec83bb1587dadf3c2/src/tl_templates/hip/reduce.h#L171 comment explains that on CDNA wave64, width=32 splits the wavefront into two independent 32-lane logical groups, exactly for kernels that assume CUDA-like 32-lane warp behavior.

gemini-code-assist · 2026-05-26T14:28:16Z

+    @classmethod
+    def is_arch_support_pdl(cls) -> bool:
+        try:
+            device = torch.cuda.current_device()
+            major, _ = torch.cuda.get_device_capability(device)
+        except Exception:
+            return False
+        return major >= 9


Calling torch.cuda.current_device() at import time eagerly initializes the CUDA context, which can break multi-processing, Ray, and distributed setups in vLLM. Since is_arch_support_pdl is called at the module level of vllm/_tilelang_ops.py during import, we should query the device capability statelessly using cls.get_device_capability(0) to avoid eager CUDA initialization.

Suggested change

@classmethod

def is_arch_support_pdl(cls) -> bool:

try:

device = torch.cuda.current_device()

major, _ = torch.cuda.get_device_capability(device)

except Exception:

return False

return major >= 9

@classmethod

def is_arch_support_pdl(cls) -> bool:

try:

capability = cls.get_device_capability(0)

if capability is None:

return False

major = capability.major

except Exception:

return False

return major >= 9

I don't think this is necessary as it is only triggered once and all machines on market has homogeneous GPUs within a single node.

tjtanaa · 2026-05-26T14:41:44Z

-        if current_platform.is_rocm():
-            return self._forward_rocm(
+        if not self.has_tilelang:
+            return self._forward_unfused_post_pre(


I would like to keep the unfused path as well as torch fallback path when tilelang is not available and to prepare for a path to reenable the aiter MHC and allow developers easily validate if separate aiter MHC post and pre is faster than tilelang fused post and pre MHC in later PRs.

tjtanaa · 2026-05-26T14:42:25Z

        )

+    @classmethod
+    def is_arch_support_pdl(cls) -> bool:


This is from https://github.com/sgl-project/sglang/blob/c47f0e7cdde48ddc718e3c6ee8bc87bebee2e8ff/sgl-kernel/python/sgl_kernel/utils.py#L58

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

mergify · 2026-05-26T17:28:10Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tjtanaa.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tjtanaa · 2026-05-26T17:28:10Z

-                self.head_dim,
-            )
+            self._use_cutedsl_sparse_compressor = has_cutedsl()
+            if self._use_cutedsl_sparse_compressor:


This is a bugfix for import error cutlass not found, introduced in this PR #43584

@tjtanaa I think the bug should be fixed in #43710

I have removed my fix. Thanks for fixing the import issue.

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

mergify · 2026-05-27T04:31:09Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tjtanaa.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

benenzhu · 2026-05-27T09:20:36Z

WOW, tilelang in vllm now.

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

AndreasKaratzas · 2026-05-28T01:00:17Z

 # amd-quark: required for Quark quantization on ROCm 
 # To be consistent with test_quark.py
 amd-quark>=0.8.99
+tilelang>=0.1.10


Nice test and even nicer feature. i am wondering for CI purposes to avoid any regressions from future versions, should we pin the version? Maybe we can add it in rocm.in too. We can also do that in a follow-up PR. But let me know if you agree.

Let's do it in a follow up PR. Would like to land this PR and let @WoosukKwon continue with the restructuring of the mhc kernels.

Thanks. I have pinned the version to exact version.

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

…o tilelangmhc

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

Re-enable the aiter multi-head-consensus (mHC) pre/post ops as the preferred ROCm path for DeepSeek V4. PR vllm-project#43679 added a tilelang fused post+pre mHC kernel and left a hook to switch back to the (faster) aiter mHC kernels once an aiter release with the sqrsum race-condition fix was available; this is that follow-up. Dispatch is now purely capability based (no new env knobs): aiter mHC pre/post -> tilelang fused post+pre -> torch/triton reference The aiter kernels are used whenever aiter is available on a supported ROCm device (``is_aiter_found_and_supported()``) and the hidden size is a multiple of 256; otherwise we fall back to the tilelang fused kernel, and finally to the torch/triton reference implementation. On CUDA the tilelang path is unchanged. The aiter mHC kernels require aiter >= 0.1.14, which contains the sqrsum race-condition fix in ``mhc_pre_gemm_sqrsum_kernel`` (ROCm/aiter@b639cb6); without it results are wrong at large token counts. AITER_BRANCH in docker/Dockerfile.rocm_base is bumped v0.1.13 -> v0.1.14. The unfused aiter path applies hc_post inline and returns no residual streams, so the deferred hc_post in DeepseekV4Model.forward and the MTP layer is gated on ``residual is not None`` rather than has_tilelang/is_cuda. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: vLLM Contributor <contributor@vllm.ai> Co-authored-by: Cursor <cursoragent@cursor.com>

Re-enable the aiter multi-head-consensus (mHC) pre/post ops as the preferred ROCm path for DeepSeek V4. PR vllm-project#43679 added a tilelang fused post+pre mHC kernel and left a hook to switch back to the (faster) aiter mHC kernels once an aiter release with the sqrsum race-condition fix was available; this is that follow-up. Dispatch is now purely capability based (no new env knobs): aiter mHC pre/post -> tilelang fused post+pre -> torch/triton reference The aiter kernels are used whenever aiter is available on a supported ROCm device (``is_aiter_found_and_supported()``) and the hidden size is a multiple of 256; otherwise we fall back to the tilelang fused kernel, and finally to the torch/triton reference implementation. On CUDA the tilelang path is unchanged. The aiter mHC kernels require aiter >= 0.1.14, which contains the sqrsum race-condition fix in ``mhc_pre_gemm_sqrsum_kernel`` (ROCm/aiter@b639cb6); without it results are wrong at large token counts. AITER_BRANCH in docker/Dockerfile.rocm_base is bumped v0.1.13 -> v0.1.14. The unfused aiter path applies hc_post inline and returns no residual streams, so the deferred hc_post in DeepseekV4Model.forward and the MTP layer is gated on ``residual is not None`` rather than has_tilelang/is_cuda. Signed-off-by: Fangzhou Ai <fangzhou.ai@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>

…ject#43679) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: Xiaoran Chen <xiaoran@fb.com>

tjtanaa added 8 commits May 25, 2026 21:11

use tilelang mhc on rocm

9676182

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

clean up mhc test

5be8268

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

add tilelang to requirements.txt

122ad44

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

add tilelang_hc_pre_norm_gemm

63d5159

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

fix prefix commit; support fuse and unfused codepath

c966a7d

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

use tilelang mhc in mtp as well

d2f3c6d

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

clean up code

9111eee

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

clean up code

fbb43d7

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

tjtanaa requested review from WoosukKwon, mgoin, tlrmchlsmth, yewentao256 and zyongye as code owners May 26, 2026 14:26

mergify Bot added ci/build nvidia rocm Related to AMD ROCm labels May 26, 2026

github-project-automation Bot added this to NVIDIA and AMD May 26, 2026

github-project-automation Bot moved this to Todo in AMD May 26, 2026

gemini-code-assist Bot reviewed May 26, 2026

View reviewed changes

tjtanaa changed the title ~~[ROCm][DSV4] Enable Tilelang MHC replacement of Torch Tilelang~~ [ROCm][DSV4] Enable Tilelang MHC replacing torch/triton mhc May 26, 2026

tjtanaa commented May 26, 2026

View reviewed changes

tjtanaa added 2 commits May 26, 2026 09:49

clean up code

ed3db32

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

bugfix functionality

d0534f7

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

tjtanaa commented May 26, 2026

View reviewed changes

mergify Bot added the needs-rebase label May 26, 2026

resolve merge conflict

4610096

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

mergify Bot removed the needs-rebase label May 26, 2026

tjtanaa mentioned this pull request May 27, 2026

[Performance]: Deepseek-V4 Support and Optimization on ROCm Backend #41820

Open

22 tasks

Merge remote-tracking branch 'origin/main' into tilelangmhc

6656eb3

mergify Bot added the needs-rebase label May 27, 2026

sync main

dc01802

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

mergify Bot removed the needs-rebase label May 27, 2026

tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label May 27, 2026

Merge branch 'main' into tilelangmhc

77e91d6

zyongye approved these changes May 28, 2026

View reviewed changes

github-project-automation Bot moved this to Ready in NVIDIA May 28, 2026

add adaptation citation

97c342a

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

tjtanaa requested a review from AndreasKaratzas as a code owner May 28, 2026 00:42

tjtanaa added 3 commits May 27, 2026 19:47

remove unnecessary citation

5bc61a4

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

remove unnecessary citation

685bede

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

Merge branch 'main' into tilelangmhc

b1af57d

tjtanaa enabled auto-merge (squash) May 28, 2026 00:49

AndreasKaratzas approved these changes May 28, 2026

View reviewed changes

tjtanaa added 2 commits May 27, 2026 21:09

pin tilelang version rather than relax

8f40bc1

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

Merge branch 'tilelangmhc' of https://github.com/EmbeddedLLM/vllm int…

2c539fb

…o tilelangmhc

tjtanaa merged commit 0ba46d4 into vllm-project:main May 28, 2026
72 checks passed

github-project-automation Bot moved this from Todo to Done in AMD May 28, 2026

github-project-automation Bot moved this from Ready to Done in NVIDIA May 28, 2026

khluu pushed a commit that referenced this pull request May 28, 2026

[ROCm][DSV4] Enable Tilelang MHC replacing torch/triton mhc (#43679)

a147dd0

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

Fangzhou-Ai pushed a commit to Fangzhou-Ai/vllm that referenced this pull request May 29, 2026

WIP: aiter mHC reenable on vllm-project#43679 cherry-pick (fallback)

90f1c6a

Co-authored-by: Cursor <cursoragent@cursor.com>

This was referenced May 29, 2026

[ROCm][DSV4] Default to aiter mHC pre/post, add tilelang opt-in knob Fangzhou-Ai/vllm#6

Open

[ROCm][DSV4] Use aiter mHC pre/post as the default ROCm path #43950

Open

Uh oh!

Conversation

tjtanaa commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Performance gain

no MTP (Before PR - torch mhc/triton mhc and after PR - tilelang mhc)

With MTP (Before PR - torch mhc/triton mhc and after PR - tilelang mhc)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented May 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented May 27, 2026

Uh oh!

benenzhu commented May 27, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tjtanaa commented May 26, 2026 •

edited

Loading