Skip to content

[ROCm][DSV4] Enable Tilelang MHC replacing torch/triton mhc#43679

Merged
tjtanaa merged 20 commits into
vllm-project:mainfrom
EmbeddedLLM:tilelangmhc
May 28, 2026
Merged

[ROCm][DSV4] Enable Tilelang MHC replacing torch/triton mhc#43679
tjtanaa merged 20 commits into
vllm-project:mainfrom
EmbeddedLLM:tilelangmhc

Conversation

@tjtanaa
Copy link
Copy Markdown
Member

@tjtanaa tjtanaa commented May 26, 2026

Purpose

In recent tilelang PR they support Vendor free compilation among CUDA and ROCM wheels in tile-ai/tilelang#2195 . So on ROCm we are pip installing the tilelang wheel from pypi directly.

This PR follows the way in sglang https://github.com/sgl-project/sglang/blob/c47f0e7cdde48ddc718e3c6ee8bc87bebee2e8ff/python/sglang/srt/layers/mhc.py#L88 to add a ENABLE_PDL control so that we can set it to False on unsupported platform like ROCm.

Now the Tilelang kernel is compatible for CUDA and ROCm.

This is used to replace the slow inference torch kernel if tilelang is support on the platform. I would like to keep the unfused path as well as torch fallback path when tilelang is not available and to prepare for a path to reenable the aiter MHC and allow developers easily validate if separate aiter MHC post and pre is faster than tilelang fused post and pre MHC in later PRs.

Test Plan

  1. Lmeval score of no MTP and MTP must be around 0.95 for gsm8k score . MTP acceptance draft rate must be normal > 2.6 for gsm8k.

  2. Perf gain over using torch mhc (before the PR)

  3. kernel test case tests/kernels/test_mhc_kernels.py

======================= 51 passed, 38 warnings in 21.15s =======================

Test Result

experimental details

No MTP server command

#!/bin/bash

rm -rf /root/.cache/vllm

VLLM_ROCM_USE_AITER=1 \
vllm serve deepseek-ai/DeepSeek-V4-Pro \
  --host localhost \
  --port 8001 \
  --dtype auto \
  --tensor-parallel-size 8 \
  --max-num-seqs 256 \
  --distributed-executor-backend mp \
  --trust-remote-code \
  --gpu-memory-utilization 0.6 \
  --moe-backend triton_unfused \
  --tokenizer-mode deepseek_v4 \
  --reasoning-parser deepseek_v4 \
  --kv-cache-dtype fp8_e4m3 \
  --compilation-config '{"mode":3,"cudagraph_mode": "FULL_AND_PIECEWISE"}'

MTP command

#!/bin/bash

rm -rf /root/.cache/vllm

VLLM_ROCM_USE_AITER=1 \
vllm serve deepseek-ai/DeepSeek-V4-Pro \
  --host localhost \
  --port 8001 \
  --dtype auto \
  --tensor-parallel-size 8 \
  --max-num-seqs 256 \
  --distributed-executor-backend mp \
  --trust-remote-code \
  --gpu-memory-utilization 0.6 \
  --moe-backend triton_unfused \
  --tokenizer-mode deepseek_v4 \
  --reasoning-parser deepseek_v4 \
  --kv-cache-dtype fp8_e4m3 \
  --compilation-config '{"mode":3,"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
  --speculative_config '{"method":"mtp","num_speculative_tokens":2}'

Server benchmark script

#!/bin/bash
set -euo pipefail

# After you have launched the server with the command in @launchdeepseekv4graph.sh

BASE_URL=${BASE_URL:-http://127.0.0.1:8001}
RESULT_DIR=${RESULT_DIR:-./tilelangmhcmtppr-bench-results}
RESULT_PREFIX=${RESULT_PREFIX:-ds-v4-pro-mtpbeforepr}
CONCURRENCIES=${CONCURRENCIES:-"1 2 4 8 16 32 64"}
INPUT_LEN=${INPUT_LEN:-1024}
OUTPUT_LEN=${OUTPUT_LEN:-1024}

for C in ${CONCURRENCIES}; do
  NUM_PROMPTS=$((C * 5))
  NUM_WARMUPS=$((C * 1))

  vllm bench serve \
    --backend openai-chat \
    --base-url "${BASE_URL}" \
    --endpoint /v1/chat/completions \
    --model deepseek-ai/DeepSeek-V4-Pro \
    --dataset-name random \
    --input-len "${INPUT_LEN}" \
    --output-len "${OUTPUT_LEN}" \
    --num-prompts "${NUM_PROMPTS}" \
    --request-rate inf \
    --max-concurrency "${C}" \
    --num-warmups "${NUM_WARMUPS}" \
    --ignore-eos \
    --percentile-metrics ttft,tpot,itl,e2el \
    --metric-percentiles 50,90,99 \
    --save-result \
    --result-dir "${RESULT_DIR}" \
    --result-filename "${RESULT_PREFIX}-C${C}.json" \
    --metadata concurrency="${C}" workload=random_${INPUT_LEN}_${OUTPUT_LEN} num_prompts="${NUM_PROMPTS}"
done

Lmeval command (NOTE: numshot 8 and numshot 20 must both give a score of around 0.95 (+-0.01) for both strict and relax; else there could have accuracy issue)

#!/bin/bash

MODEL=deepseek-ai/DeepSeek-V4-Pro
lm_eval --model local-completions --model_args model=$MODEL,base_url=http://0.0.0.0:8001/v1/completions,num_concurrent=256,max_retries=10,max_gen_toks=2048,max_length=1048576,timeout=60000 --batch_size auto --tasks gsm8k --num_fewshot 20 \
  --output_path ./results_deepseekv4pro_numshot20 \
  --log_samples \
| tee lmeval_deepseekv4pro_numshot20.log
Lm eval score

DeepSeek V4 Pro no MTP

local-completions ({'model': 'deepseek-ai/DeepSeek-V4-Pro', 'base_url': 'http://0.0.0.0:8001/v1/completions', 'num_concurrent': 256, 'max_retries': 10, 'max_gen_toks': 2048, 'max_length': 1048576, 'timeout': 60000}), gen_kwargs: ({}), limit: None, num_fewshot: 20, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|    20|exact_match|↑  |0.9553|±  |0.0057|
|     |       |strict-match    |    20|exact_match|↑  |0.9560|±  |0.0056|

DeepSeek V4 Pro with MTP

local-completions ({'model': 'deepseek-ai/DeepSeek-V4-Pro', 'base_url': 'http://0.0.0.0:8001/v1/completions', 'num_concurrent': 256, 'max_retries': 10, 'max_gen_toks': 2048, 'max_length': 1048576, 'timeout': 60000}), gen_kwargs: ({}), limit: None, num_fewshot: 20, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|    20|exact_match|↑  |0.9530|±  |0.0058|
|     |       |strict-match    |    20|exact_match|↑  |0.9538|±  |0.0058|

Acceptance score: [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.71, Accepted throughput: 371.60 tokens/s, Drafted throughput: 435.60 tokens/s, Accepted: 3716 tokens, Drafted: 4356 tokens, Per-position acceptance rate: 0.962, 0.744, Avg Draft acceptance rate: 85.3%

Performance gain

no MTP (Before PR - torch mhc/triton mhc and after PR - tilelang mhc)

Throughput and Latency

Concurrency torch/triton mhc output tok/s tilelang mhc output tok/s Output gain torch/triton mhc TPOT ms tilelang mhc TPOT ms TPOT reduction torch/triton mhc E2E ms tilelang mhc E2E ms E2E reduction
C1 19.21 23.07 20.10% 51.19 42.54 16.89% 53303.41 44380.85 16.74%
C2 36.60 44.06 20.38% 53.02 44.00 17.01% 55940.87 46471.69 16.93%
C4 68.97 82.57 19.72% 54.92 45.49 17.17% 59375.61 49595.27 16.47%
C8 125.24 147.89 18.09% 59.98 50.29 16.16% 65368.30 55358.54 15.31%
C16 214.02 249.19 16.43% 69.72 59.43 14.76% 76465.79 65676.14 14.11%
C32 326.42 372.72 14.18% 90.88 79.60 12.41% 100198.06 87751.37 12.42%
C64 452.21 507.12 12.14% 130.79 116.54 10.89% 144510.73 128875.31 10.82%

With MTP (Before PR - torch mhc/triton mhc and after PR - tilelang mhc)

Throughput, Latency, and Spec Decode

Concurrency torch/triton mhc output tok/s tilelang mhc output tok/s Output gain torch/triton mhc TPOT ms tilelang mhc TPOT ms TPOT reduction torch/triton mhc E2E ms tilelang mhc E2E ms E2E reduction torch/triton mhc acceptance % tilelang mhc acceptance % Acceptance delta
C1 37.93 46.12 21.58% 26.15 21.48 17.85% 26994.91 22204.13 17.75% 50.43 50.97 +0.53 pp
C2 64.72 86.09 33.01% 28.42 22.49 20.86% 29496.56 23380.52 20.73% 45.63 49.55 +3.92 pp
C4 117.25 157.67 34.47% 31.11 24.58 20.98% 32363.05 25642.82 20.77% 45.29 52.22 +6.93 pp
C8 229.31 277.67 21.09% 31.76 27.22 14.30% 33004.85 28315.66 14.21% 47.57 46.89 -0.68 pp
C16 389.97 457.91 17.42% 36.14 31.99 11.49% 37611.65 33271.31 11.54% 51.05 48.14 -2.90 pp
C32 251.95 261.86 3.93% 122.89 117.56 4.33% 126816.42 121267.94 4.38% 48.94 48.47 -0.47 pp
C64 821.74 882.68 7.42% 69.54 63.82 8.23% 72430.96 66495.85 8.19% 48.64 48.24 -0.40 pp

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

tjtanaa added 8 commits May 25, 2026 21:11
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables TileLang-based Multi-Head Latent Attention (MHC) kernels on ROCm (AMD GPUs) by adapting the existing CUDA implementations. It introduces platform-aware checks for Programmatic Dependent Launch (PDL) support, caches TileLang availability, and integrates TileLang operators into the DeepSeek V4 and MTP models. Critical feedback highlights that hardcoding a warp size of 32 in reduction logic will cause incorrect results on ROCm, where the wavefront size is 64; a platform-aware WARP_SIZE constant should be defined and used instead. Additionally, querying device capability via torch.cuda.current_device() at import time should be replaced with a stateless query to prevent eager CUDA initialization.

Comment thread vllm/_tilelang_ops.py
Comment thread vllm/_tilelang_ops.py
Comment thread vllm/_tilelang_ops.py
Comment on lines +688 to +692
lane = tid % 32
warp_id = tid // 32
num_warps = n_thr // 32
warp_acc = T.alloc_shared((num_warps, block_m, tile_n), T.float32)
warp_sqr = T.alloc_shared((num_warps, block_m), T.float32)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

On AMD GPUs (ROCm), the hardware wavefront/warp size is 64, not 32. Hardcoding 32 for warp/lane index calculations will cause incorrect reduction results and double-counting when using T.warp_reduce_sum on ROCm. Please use WARP_SIZE instead of 32.

Suggested change
lane = tid % 32
warp_id = tid // 32
num_warps = n_thr // 32
warp_acc = T.alloc_shared((num_warps, block_m, tile_n), T.float32)
warp_sqr = T.alloc_shared((num_warps, block_m), T.float32)
lane = tid % WARP_SIZE
warp_id = tid // WARP_SIZE
num_warps = n_thr // WARP_SIZE
warp_acc = T.alloc_shared((num_warps, block_m, tile_n), T.float32)
warp_sqr = T.alloc_shared((num_warps, block_m), T.float32)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD CDNA hardware wavefront size is 64, but TileLang_s HIP warp_reduce_sum intentionally preserves 32-lane logical warp semantics. The installed TileLang source says this directly in:

tilelang/src/tl_templates/hip/reduce.h

Shows that it uses:

__shfl_xor(value, 16, 32)
...
__shfl_xor(value, 1, 32)

https://github.com/tile-ai/tilelang/blob/23d91c584dd98810b1acf91ec83bb1587dadf3c2/src/tl_templates/hip/reduce.h#L161 and https://github.com/tile-ai/tilelang/blob/23d91c584dd98810b1acf91ec83bb1587dadf3c2/src/tl_templates/hip/reduce.h#L171 comment explains that on CDNA wave64, width=32 splits the wavefront into two independent 32-lane logical groups, exactly for kernels that assume CUDA-like 32-lane warp behavior.

Comment thread vllm/platforms/cuda.py
Comment on lines +595 to +602
@classmethod
def is_arch_support_pdl(cls) -> bool:
try:
device = torch.cuda.current_device()
major, _ = torch.cuda.get_device_capability(device)
except Exception:
return False
return major >= 9
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Calling torch.cuda.current_device() at import time eagerly initializes the CUDA context, which can break multi-processing, Ray, and distributed setups in vLLM. Since is_arch_support_pdl is called at the module level of vllm/_tilelang_ops.py during import, we should query the device capability statelessly using cls.get_device_capability(0) to avoid eager CUDA initialization.

Suggested change
@classmethod
def is_arch_support_pdl(cls) -> bool:
try:
device = torch.cuda.current_device()
major, _ = torch.cuda.get_device_capability(device)
except Exception:
return False
return major >= 9
@classmethod
def is_arch_support_pdl(cls) -> bool:
try:
capability = cls.get_device_capability(0)
if capability is None:
return False
major = capability.major
except Exception:
return False
return major >= 9

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is necessary as it is only triggered once and all machines on market has homogeneous GPUs within a single node.

Comment thread vllm/_tilelang_ops.py Outdated
@tjtanaa tjtanaa changed the title [ROCm][DSV4] Enable Tilelang MHC replacement of Torch Tilelang [ROCm][DSV4] Enable Tilelang MHC replacing torch/triton mhc May 26, 2026
if current_platform.is_rocm():
return self._forward_rocm(
if not self.has_tilelang:
return self._forward_unfused_post_pre(
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to keep the unfused path as well as torch fallback path when tilelang is not available and to prepare for a path to reenable the aiter MHC and allow developers easily validate if separate aiter MHC post and pre is faster than tilelang fused post and pre MHC in later PRs.

Comment thread vllm/platforms/cuda.py
)

@classmethod
def is_arch_support_pdl(cls) -> bool:
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tjtanaa added 2 commits May 26, 2026 09:49
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 26, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tjtanaa.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Comment thread vllm/models/deepseek_v4/compressor.py Outdated
self.head_dim,
)
self._use_cutedsl_sparse_compressor = has_cutedsl()
if self._use_cutedsl_sparse_compressor:
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bugfix for import error cutlass not found, introduced in this PR #43584

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tjtanaa I think the bug should be fixed in #43710

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed my fix. Thanks for fixing the import issue.

@mergify mergify Bot added the needs-rebase label May 26, 2026
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
@mergify mergify Bot removed the needs-rebase label May 26, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 27, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tjtanaa.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 27, 2026
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
@mergify mergify Bot removed the needs-rebase label May 27, 2026
@tjtanaa tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label May 27, 2026
@benenzhu
Copy link
Copy Markdown
Contributor

WOW, tilelang in vllm now.

@github-project-automation github-project-automation Bot moved this to Ready in NVIDIA May 28, 2026
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
@tjtanaa tjtanaa requested a review from AndreasKaratzas as a code owner May 28, 2026 00:42
tjtanaa added 3 commits May 27, 2026 19:47
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
@tjtanaa tjtanaa enabled auto-merge (squash) May 28, 2026 00:49
Comment thread requirements/rocm.txt Outdated
# amd-quark: required for Quark quantization on ROCm
# To be consistent with test_quark.py
amd-quark>=0.8.99
tilelang>=0.1.10
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test and even nicer feature. i am wondering for CI purposes to avoid any regressions from future versions, should we pin the version? Maybe we can add it in rocm.in too. We can also do that in a follow-up PR. But let me know if you agree.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do it in a follow up PR. Would like to land this PR and let @WoosukKwon continue with the restructuring of the mhc kernels.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I have pinned the version to exact version.

tjtanaa added 2 commits May 27, 2026 21:09
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
@tjtanaa tjtanaa merged commit 0ba46d4 into vllm-project:main May 28, 2026
72 checks passed
@github-project-automation github-project-automation Bot moved this from Todo to Done in AMD May 28, 2026
@github-project-automation github-project-automation Bot moved this from Ready to Done in NVIDIA May 28, 2026
khluu pushed a commit that referenced this pull request May 28, 2026
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Fangzhou-Ai pushed a commit to Fangzhou-Ai/vllm that referenced this pull request May 29, 2026
Co-authored-by: Cursor <cursoragent@cursor.com>
Fangzhou-Ai pushed a commit to Fangzhou-Ai/vllm that referenced this pull request May 29, 2026
Re-enable the aiter multi-head-consensus (mHC) pre/post ops as the preferred
ROCm path for DeepSeek V4. PR vllm-project#43679 added a tilelang fused post+pre mHC
kernel and left a hook to switch back to the (faster) aiter mHC kernels once
an aiter release with the sqrsum race-condition fix was available; this is
that follow-up.

Dispatch is now purely capability based (no new env knobs):

  aiter mHC pre/post  ->  tilelang fused post+pre  ->  torch/triton reference

The aiter kernels are used whenever aiter is available on a supported ROCm
device (``is_aiter_found_and_supported()``) and the hidden size is a multiple
of 256; otherwise we fall back to the tilelang fused kernel, and finally to
the torch/triton reference implementation. On CUDA the tilelang path is
unchanged.

The aiter mHC kernels require aiter >= 0.1.14, which contains the sqrsum
race-condition fix in ``mhc_pre_gemm_sqrsum_kernel`` (ROCm/aiter@b639cb6);
without it results are wrong at large token counts. AITER_BRANCH in
docker/Dockerfile.rocm_base is bumped v0.1.13 -> v0.1.14.

The unfused aiter path applies hc_post inline and returns no residual
streams, so the deferred hc_post in DeepseekV4Model.forward and the MTP layer
is gated on ``residual is not None`` rather than has_tilelang/is_cuda.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: vLLM Contributor <contributor@vllm.ai>
Co-authored-by: Cursor <cursoragent@cursor.com>
Fangzhou-Ai added a commit to Fangzhou-Ai/vllm that referenced this pull request May 29, 2026
Re-enable the aiter multi-head-consensus (mHC) pre/post ops as the preferred
ROCm path for DeepSeek V4. PR vllm-project#43679 added a tilelang fused post+pre mHC
kernel and left a hook to switch back to the (faster) aiter mHC kernels once
an aiter release with the sqrsum race-condition fix was available; this is
that follow-up.

Dispatch is now purely capability based (no new env knobs):

  aiter mHC pre/post  ->  tilelang fused post+pre  ->  torch/triton reference

The aiter kernels are used whenever aiter is available on a supported ROCm
device (``is_aiter_found_and_supported()``) and the hidden size is a multiple
of 256; otherwise we fall back to the tilelang fused kernel, and finally to
the torch/triton reference implementation. On CUDA the tilelang path is
unchanged.

The aiter mHC kernels require aiter >= 0.1.14, which contains the sqrsum
race-condition fix in ``mhc_pre_gemm_sqrsum_kernel`` (ROCm/aiter@b639cb6);
without it results are wrong at large token counts. AITER_BRANCH in
docker/Dockerfile.rocm_base is bumped v0.1.13 -> v0.1.14.

The unfused aiter path applies hc_post inline and returns no residual
streams, so the deferred hc_post in DeepseekV4Model.forward and the MTP layer
is gated on ``residual is not None`` rather than has_tilelang/is_cuda.

Signed-off-by: Fangzhou Ai <fangzhou.ai@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
sphinx07 pushed a commit to sphinx07/vllm-mgenai-gptoss that referenced this pull request May 29, 2026
…ject#43679)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: Xiaoran Chen <xiaoran@fb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build nvidia ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants