[BugFix] Vendor HIP headers and build fat CUDA+ROCm linux wheels by benenzhu · Pull Request #2195 · tile-ai/tilelang

benenzhu · 2026-05-13T07:10:08Z

This closes #1922

Reviewer note: Files under 3rdparty/hip-headers/include/hip/ are vendored verbatim from Triton's third_party/amd/backend/include/hip/ and do not need review.

Summary

Currently USE_ROCM=ON cannot configure on a build host that has no ROCm runtime (e.g. an NV-only machine producing a cross-target wheel) because TVM's find_rocm() macro hard-requires a real libamdhip64:

CMake Error at 3rdparty/tvm/cmake/modules/ROCM.cmake:32 (message):
  Cannot find ROCM, USE_ROCM=ON

With TILELANG_USE_HIP_STUBS=ON the runtime library isn't actually needed at build time — only the public HIP headers. This PR vendors them under 3rdparty/hip-headers/include/hip/ (sourced from Triton) so no system ROCm install is required for the build.

Changes

Vendored HIP headers under 3rdparty/hip-headers/include/hip/, copied verbatim from Triton. HSA headers are intentionally not vendored: src/backend/rocm/stubs/hip.cc already gates <hsa/hsa.h> behind __has_include with a forward-decl fallback for the only two HSA symbols used (hsa_init / hsa_shut_down).
CMakeLists.txt
- New TILELANG_HIP_INCLUDE_DIR cache var to override the HIP header location.
- Refactor backend env-var auto-selection so USE_CUDA=ON USE_ROCM=ON set together both take effect (the previous if/elseif chain only honored the first match).
src/backend/rocm/CMakeLists.txt
- Fallback when find_rocm() fails: if TILELANG_USE_HIP_STUBS=ON and HIP headers can be located, manually set ROCM_FOUND=TRUE, ROCM_INCLUDE_DIRS, and route ROCM_HIPHCC_LIBRARY=hip_stub so TVM's ROCM.cmake is satisfied.
- HIP header resolution order: TILELANG_HIP_INCLUDE_DIR cmake var → env var → /opt/rocm/include → vendored 3rdparty/hip-headers/include (default fallback). Most build environments need zero manual configuration.
src/backend/rocm/codegen/rt_mod_hip.cc
- Drop unused #include <hip/hiprtc.h> (no hiprtc symbols referenced) so we don't need to vendor hiprtc.h (which Triton also omits).
src/backend/rocm/stubs/hiprtc.cc
- Stub function signatures changed from const char *const * to const char ** to match the real HIPRTC API.
pyproject.toml
- Linux wheels now build with both USE_CUDA=ON USE_ROCM=ON via [tool.cibuildwheel.linux], producing a single fat wheel that runs on either CUDA or ROCm hosts. Windows / macOS targets unchanged.
- Vendored headers added to sdist include (build-time only; not mapped into the runtime wheel).
.github/workflows/ci.yml
- Self-hosted NV CUDA job now sets USE_ROCM=ON, so a regression in the ROCm-on-NV build path is caught by regular PR CI rather than only by the release-time dist workflow.

Test plan

Built fat wheel on a 4090 (NV-only) host with USE_CUDA=ON USE_ROCM=ON pip wheel . -v — vendored headers picked up automatically, no manual TILELANG_HIP_INCLUDE_DIR needed.
Same build with explicit TILELANG_HIP_INCLUDE_DIR=<system-rocm-include> as a manual override.
Installed the produced wheel on the 4090 (NV) host and ran a CUDA example end-to-end — no regression.
Installed the same wheel on an MI355X (AMD ROCm) host and ran a HIP example end-to-end.

Summary by CodeRabbit

New Features
- Added support for fat wheels with both CUDA and ROCm backends compiled together.
- Improved ROCm cross-compilation support for systems without a local ROCm runtime.
Build & Infrastructure
- Enhanced CI/build configuration for ROCm backend testing.
- Added vendor-supplied HIP headers for improved backend compatibility.

TVM's find_rocm() requires a real libamdhip64 on the build host, so USE_ROCM=ON currently FATAL_ERRORs on machines that only have HIP headers installed (e.g. an NV-only CI machine producing a cross-target wheel). With TILELANG_USE_HIP_STUBS=ON the runtime library isn't actually needed at build time, only the public HIP/HSA headers. This adds a fallback path: when find_rocm() fails but stubs are enabled and HIP headers are reachable (auto-detected at /opt/rocm/include or pointed to via the new TILELANG_HIP_INCLUDE_DIR cache var), pretend ROCM_FOUND=TRUE and route linking through tilelang's hip_stub target so TVM's ROCM.cmake is satisfied. The vendored stub header is unaffected (it remains private to src/backend/rocm/stubs/ as before); this path relies on header-only ROCm dev packages such as hip-runtime-amd-dev and hsa-rocr-dev, which install without a GPU or driver.

github-actions · 2026-05-13T07:10:19Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2026-05-13T07:10:26Z

📝 Walkthrough

Walkthrough

Vendors comprehensive HIP/AMD headers and updates CMake, CI, packaging, and build configuration to enable header-only ROCm compilation on hosts without ROCm runtime and to build CUDA+ROCm fat wheels. Adds device intrinsics, atomic operations, warp synchronization, vector/texture/surface APIs, runtime kernel-launch helpers, and driver types. Updates backend selection logic, includes vendored headers in sdist, configures fat-wheel builds, and updates CI to enforce ROCm compilation paths.

Changes

Complete HIP headers and fat-wheel ROCm build foundation

Layer / File(s)	Summary
HIP platform detection and common definitions `3rdparty/hip-headers/include/hip/hip_common.h`, `.../host_defines.h`, `.../hip_version.h`, `.../hip_vector_types.h`, `.../library_types.h`, `.../hip_runtime.h`	Defines platform macros (`__HIP_PLATFORM_AMD__`/`__HIP_PLATFORM_NVIDIA__`), compiler-specific attributes, versioning, host typedefs, and routes vector/runtime headers to platform-specific implementations.
AMD HIP device-side intrinsics and builtins `.../amd_detail/amd_channel_descriptor.h`, `.../amd_device_functions.h`, `.../device_library_decls.h`, `.../hip_assert.h`	Adds `hipCreateChannelDesc` and template specializations, device-side bit/conversion/clock/fence/sync intrinsics, local memcpy/memset, device-library OCKL declarations, and device assertion helpers for HIP/Clang.
Vector types and math function declarations `.../amd_detail/amd_hip_vector_types.h`, `.../amd_math_functions.h`, `.../math_fwd.h`, `.../hip_fp16_math_fwd.h`, `.../hip_ldg.h`	Implements HIP vector types (native ext_vector_type or fallback unions), make_* constructors, rank mapping, amd_mixed_dot helpers, and forwards OCML/FP16/`__ldg` device math functions.
Atomic operations and warp/block synchronization `.../amd_hip_atomic.h`, `.../amd_hip_unsafe_atomics.h`, `.../amd_warp_functions.h`, `.../amd_warp_sync_functions.h`	Implements CAS-based atomic wrappers (agent/system scopes), unsafe/safe FP atomics with target-specific fast paths, LDS/register permute/swizzle helpers, warp vote/shuffle/lane intrinsics, and masked warp-sync reduction operations over explicit 64-bit masks.
Texture and surface read/write operations `.../texture_fetch_functions.h`, `.../texture_indirect_functions.h`, `.../amd_surface_functions.h`, `.../ockl_image.h`	Adds texture fetch/sample/gather/LOD/grad helpers with type traits and element mapping, texture-object indirect wrappers, surface 1D/2D/3D/cubemap/layered read/write templates, and OCKL image load/store/sample/gather prototypes.
HIP runtime kernel launch and public driver types `.../amd_hip_runtime.h`, `.../amd_hip_runtime_pt_api.h`, `.../driver_types.h`, `.../texture_types.h`, `.../surface_types.h`, `.../amd_hip_gl_interop.h`, `.../hip_deprecated.h`, `.../hip_runtime_prof.h`, `.../channel_descriptor.h`	Provides HIP-clang kernel launch helpers, device-builtin accessors, per-thread default-stream API remaps, comprehensive driver/resource/array/memcpy/texture/surface type descriptors, OpenGL interop, legacy device properties, and runtime profiling interfaces.
CMake backend selection and ROCm header discovery `CMakeLists.txt`, `src/backend/rocm/CMakeLists.txt`	Refactors backend env-driven selection to independently check USE_CUDA/USE_ROCM/USE_METAL; adds `TILELANG_HIP_INCLUDE_DIR` cache variable; implements header-fallback search (override/env/system/vendored) when ROCm runtime absent, configuring hip_stub and setting ROCM_FOUND for stub builds.
Fat-wheel packaging and CI ROCm enforcement `pyproject.toml`, `.github/workflows/ci.yml`	Includes vendored HIP headers in sdist (build-time only); configures cibuildwheel Linux environment with `USE_CUDA=ON` and `USE_ROCM=ON` for fat wheels; appends `USE_ROCM=ON` to CI CUDA job environment to exercise ROCm compilation path.
Runtime header and version metadata alignment `src/backend/rocm/codegen/rt_mod_hip.cc`, `version_provider.py`	Switches runtime include from `hip/hiprtc.h` to `hip/hip_runtime.h`; adjusts ROCm version labeling to apply only when `USE_ROCM=ON` and `USE_CUDA` is disabled.
Python typing modernization `tilelang/autotuner/grouped_compile.py`	Updates `CompileUnitResult` type alias to use `Optional[...]` instead of union notation for optional fields.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

tile-ai/tilelang#2114: Similar ROCm header discovery and hip_stub linking adjustments in CMake.
tile-ai/tilelang#1867: Introduces ROCm lazy-load hip/hiprtc stubs tied to hip_stub usage.
tile-ai/tilelang#1858: HIP codegen lowering to warp-sync builtins added by these vendored headers.

Suggested labels

enhancement

Suggested reviewers

LeiWang1999

Bunny builds with nimble craft,
ROCm headers bundled in my pack.
CUDA, ROCm—wheels roll wide,
hip_stub guides my rabbit stride.
Textures shimmer, warps align,
Sync and atomics hop in time! 🥕✨

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Vendor Triton's HIP headers under 3rdparty/hip-headers/include/hip so USE_ROCM=ON can build on hosts without any ROCm install (cibuildwheel manylinux containers, NV-only dev machines). HSA headers are not vendored because hip.cc already gates <hsa/hsa.h> behind __has_include with a forward-decl fallback for the two HSA symbols we use. CMake fallback chain when find_rocm() fails: TILELANG_HIP_INCLUDE_DIR -> env var -> /opt/rocm/include -> vendored. Drop unused #include <hip/hiprtc.h> from rt_mod_hip.cc (no hiprtc symbols referenced) so we don't need to vendor hiprtc.h, which Triton also omits. Linux wheels now build with both USE_CUDA=ON and USE_ROCM=ON via [tool.cibuildwheel.linux] env, producing a single wheel that runs on either CUDA or ROCm hosts. Windows / macOS targets are unchanged. Add USE_ROCM=ON to the self-hosted NV CI job so a regression that breaks the ROCm-on-NV build path is caught by regular PR CI, not just by the release-time dist workflow.

hip_runtime_api.h transitively #includes <hip/amd_detail/host_defines.h> and other amd_detail headers. The first vendoring pass only copied the top-level hip/ files, breaking the build with: fatal error: hip/amd_detail/host_defines.h: No such file or directory Add the full hip/amd_detail/ subtree (25 files) from the same Triton source. nvidia_detail/ is intentionally not vendored: every nvidia_detail include sits behind `__HIP_PLATFORM_NVIDIA__`, which we never define. HSA is also still not needed (no hsa references in the amd_detail set).

TVM's src/runtime/rocm/rocm_device_api.cc unconditionally #includes <hsa/hsa.h>, so the previous "no HSA headers" assumption only held for tilelang's own stubs (which gate the include behind __has_include) and broke the build for the TVM submodule: /root/tilelang/3rdparty/tvm/src/runtime/rocm/rocm_device_api.cc:26:10: fatal error: hsa/hsa.h: No such file or directory Vendor hsa.h verbatim from Triton's third_party/amd/backend/include/hsa/. The header is self-contained (only includes <stddef.h>, <stdint.h>, <stdbool.h>), so the other 6 hsa_*.h files are not needed. Link-time remains unchanged: ROCM_HSA_LIBRARY stays NOTFOUND and the only two HSA symbols actually referenced (hsa_init / hsa_shut_down) are exported by hip_stub and lazy-loaded at run time.

Line 21 evaluates `JITKernel | None` at module import (PEP 604 union), which only works on Python 3.10+. The file's `from __future__ import annotations` defers function-signature evaluation but not module-level type-alias assignments, so cp39 wheels imported by the dist.yml smoke test fail with: TypeError: unsupported operand type(s) for |: 'type' and 'NoneType' Use typing.Optional instead. Pre-existing latent bug introduced by tile-ai#2159; surfaced now because this PR touches pyproject.toml/CMakeLists which triggers the dist workflow.

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

CMakeLists.txt (1)

397-438: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve env-provided SDK paths instead of coercing to ON.

USE_CUDA/USE_ROCM support path values, but this block converts any truthy env value to ON, discarding explicit paths from ENV{USE_CUDA} / ENV{USE_ROCM}.

Proposed fix

   if(DEFINED ENV{USE_CUDA})
     set(_tilelang_backend_env_selected ON)
-    if($ENV{USE_CUDA})
-      set(USE_CUDA ON)
-    else()
-      set(USE_CUDA OFF)
-    endif()
+    set(USE_CUDA "$ENV{USE_CUDA}")
   endif()

   if(DEFINED ENV{USE_ROCM})
     set(_tilelang_backend_env_selected ON)
-    if($ENV{USE_ROCM})
-      set(USE_ROCM ON)
-    else()
-      set(USE_ROCM OFF)
-    endif()
+    set(USE_ROCM "$ENV{USE_ROCM}")
   endif()

   if(DEFINED ENV{USE_METAL})
     set(_tilelang_backend_env_selected ON)
-    if($ENV{USE_METAL})
-      set(USE_METAL ON)
-    else()
-      set(USE_METAL OFF)
-    endif()
+    set(USE_METAL "$ENV{USE_METAL}")
   endif()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@CMakeLists.txt` around lines 397 - 438, The env-handling blocks for USE_CUDA,
USE_ROCM (and similarly USE_METAL) currently coerce any truthy ENV{...} to "ON"
and throw away path values; change them to preserve the raw ENV value when
defined by setting USE_CUDA (and USE_ROCM/USE_METAL) to the actual
$ENV{USE_CUDA} string instead of always "ON", and only map explicit false-like
values ("0" or "OFF") to OFF; update the branches that set
USE_CUDA/USE_ROCM/USE_METAL and the _tilelang_backend_env_selected logic so
environment-provided SDK paths are used as-is while still supporting explicit
OFF/0 toggles, leaving the default selection logic
(TILELANG_CUDA_TOOLKIT_AVAILABLE / APPLE) unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@3rdparty/hip-headers/include/hip/amd_detail/amd_device_functions.h`:
- Around line 82-90: The bitmask construction in __fns64 and __fns32 uses (1 <<
base) which is UB for large bases; change the mask shift to use unsigned-width
literals and clamp the shift amount to the valid width before shifting (e.g.,
use ( (__hip_uint64_t)1 << clamped_base ) for __fns64 and ( (__hip_uint32_t)1u
<< clamped_base ) for __fns32), where clamped_base = min(max(base, 0),
WAVEFRONT_SIZE-1) (or mask base with the width bits) so no signed shifts or
overflow occur; update all occurrences in the __fns64 and __fns32
implementations (including temp_mask &= ... branches) to use these unsigned,
clamped shifts.

In `@3rdparty/hip-headers/include/hip/amd_detail/amd_surface_functions.h`:
- Around line 376-384: The surfCubemapLayeredwrite overload incorrectly declares
its first parameter as T* data while all other surf*write overloads use a value
parameter T data; change the signature of surfCubemapLayeredwrite from T* data
to T data so that the call to __hipMapTo<float4::Native_vec_>(data) and the
pattern with other functions (e.g., other surf*write overloads) match; update
the parameter type in the function declaration and any references inside
surfCubemapLayeredwrite (keeping the rest of the body, including
__hipGetPixelAddr, int2 coords, and __ockl_image_store_lod_CM usage, unchanged).

In `@3rdparty/hip-headers/include/hip/amd_detail/amd_warp_sync_functions.h`:
- Around line 499-521: The reduction lambdas in __reduce_or_sync,
__reduce_and_sync and __reduce_xor_sync (and their extra-types variants for int,
long long, unsigned long long) use logical operators instead of bitwise ops and
a malformed XOR expression; replace lhs || rhs with lhs | rhs in
__reduce_or_sync, lhs && rhs with lhs & rhs in __reduce_and_sync, and the XOR
lambda with lhs ^ rhs in __reduce_xor_sync, and apply equivalent fixes to the
corresponding int/long long/unsigned long long reduction lambdas so the manual
reduction tree matches the __ockl_wfred_* bitwise intrinsics.

In `@3rdparty/hip-headers/include/hip/amd_detail/hip_runtime_prof.h`:
- Line 39: The enum constant name kHipHipVdiMemcpyHostToDevice is incorrectly
duplicated with an extra "Hip" and should be renamed to
kHipVdiMemcpyHostToDevice to match the surrounding pattern (e.g.,
kHipVdiMemcpyDeviceToHost, kHipVdiMemcpyDeviceToDevice); update the identifier
in the enum declaration in hip_runtime_prof.h (symbol:
kHipHipVdiMemcpyHostToDevice → kHipVdiMemcpyHostToDevice) and search the
codebase for any occurrences to replace them (adjust any comments or
documentation strings nearby as needed).

In `@3rdparty/hip-headers/include/hip/amd_detail/texture_indirect_functions.h`:
- Around line 388-400: In tex3DGrad, the incoming gradient parameter dPdx is
unused and both gradx and grady are incorrectly initialized from dPdy; fix by
initializing gradx from dPdx and grady from dPdy so the call to
__ockl_image_sample_grad_3D receives the correct x and y gradients. Update the
assignments to gradx and grady in the tex3DGrad function (after
TEXTURE_OBJECT_PARAMETERS_INIT) and ensure the rest of the call to
__ockl_image_sample_grad_3D(i, s, get_native_vector(coords),
get_native_vector(gradx), get_native_vector(grady)) remains unchanged.
- Around line 200-205: The pointer overload of tex2Dgather mistakenly calls
texCubemapLayered<T> instead of the intended tex2Dgather lookup; in the function
template tex2Dgather(T* ptr, hipTextureObject_t textureObject, float x, float y,
int comp = 0) replace the call to texCubemapLayered<T>(...) with the correct
device sampling function tex2Dgather<T>(textureObject, x, y, comp) so the
pointer overload returns the proper tex2Dgather result.
- Around line 128-133: The pointer overload of the tex2DLayered template
mistakenly calls tex1DLayered<T>(...) instead of invoking the 2D variant; update
the body of the static __device__ __hip_img_chk__ void tex2DLayered(...)
overload to assign *ptr using tex2DLayered<T>(textureObject, x, y, layer) (i.e.,
replace the tex1DLayered call with the tex2DLayered call) so the correct 2D
layered fetch is performed.

---

Outside diff comments:
In `@CMakeLists.txt`:
- Around line 397-438: The env-handling blocks for USE_CUDA, USE_ROCM (and
similarly USE_METAL) currently coerce any truthy ENV{...} to "ON" and throw away
path values; change them to preserve the raw ENV value when defined by setting
USE_CUDA (and USE_ROCM/USE_METAL) to the actual $ENV{USE_CUDA} string instead of
always "ON", and only map explicit false-like values ("0" or "OFF") to OFF;
update the branches that set USE_CUDA/USE_ROCM/USE_METAL and the
_tilelang_backend_env_selected logic so environment-provided SDK paths are used
as-is while still supporting explicit OFF/0 toggles, leaving the default
selection logic (TILELANG_CUDA_TOOLKIT_AVAILABLE / APPLE) unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5c0bf7f8-2cbf-4b77-9eae-67857a1c7d50

📥 Commits

Reviewing files that changed from the base of the PR and between bcb2da3 and ef4d508.

📒 Files selected for processing (45)

.github/workflows/ci.yml
3rdparty/hip-headers/include/hip/amd_detail/amd_channel_descriptor.h
3rdparty/hip-headers/include/hip/amd_detail/amd_device_functions.h
3rdparty/hip-headers/include/hip/amd_detail/amd_hip_atomic.h
3rdparty/hip-headers/include/hip/amd_detail/amd_hip_common.h
3rdparty/hip-headers/include/hip/amd_detail/amd_hip_gl_interop.h
3rdparty/hip-headers/include/hip/amd_detail/amd_hip_runtime.h
3rdparty/hip-headers/include/hip/amd_detail/amd_hip_runtime_pt_api.h
3rdparty/hip-headers/include/hip/amd_detail/amd_hip_unsafe_atomics.h
3rdparty/hip-headers/include/hip/amd_detail/amd_hip_vector_types.h
3rdparty/hip-headers/include/hip/amd_detail/amd_math_functions.h
3rdparty/hip-headers/include/hip/amd_detail/amd_surface_functions.h
3rdparty/hip-headers/include/hip/amd_detail/amd_warp_functions.h
3rdparty/hip-headers/include/hip/amd_detail/amd_warp_sync_functions.h
3rdparty/hip-headers/include/hip/amd_detail/device_library_decls.h
3rdparty/hip-headers/include/hip/amd_detail/hip_api_trace.hpp
3rdparty/hip-headers/include/hip/amd_detail/hip_assert.h
3rdparty/hip-headers/include/hip/amd_detail/hip_fp16_math_fwd.h
3rdparty/hip-headers/include/hip/amd_detail/hip_ldg.h
3rdparty/hip-headers/include/hip/amd_detail/hip_prof_str.h
3rdparty/hip-headers/include/hip/amd_detail/hip_runtime_prof.h
3rdparty/hip-headers/include/hip/amd_detail/host_defines.h
3rdparty/hip-headers/include/hip/amd_detail/math_fwd.h
3rdparty/hip-headers/include/hip/amd_detail/ockl_image.h
3rdparty/hip-headers/include/hip/amd_detail/texture_fetch_functions.h
3rdparty/hip-headers/include/hip/amd_detail/texture_indirect_functions.h
3rdparty/hip-headers/include/hip/channel_descriptor.h
3rdparty/hip-headers/include/hip/driver_types.h
3rdparty/hip-headers/include/hip/hip_common.h
3rdparty/hip-headers/include/hip/hip_deprecated.h
3rdparty/hip-headers/include/hip/hip_runtime.h
3rdparty/hip-headers/include/hip/hip_runtime_api.h
3rdparty/hip-headers/include/hip/hip_texture_types.h
3rdparty/hip-headers/include/hip/hip_vector_types.h
3rdparty/hip-headers/include/hip/hip_version.h
3rdparty/hip-headers/include/hip/library_types.h
3rdparty/hip-headers/include/hip/linker_types.h
3rdparty/hip-headers/include/hip/surface_types.h
3rdparty/hip-headers/include/hip/texture_types.h
3rdparty/hip-headers/include/hsa/hsa.h
CMakeLists.txt
pyproject.toml
src/backend/rocm/CMakeLists.txt
src/backend/rocm/codegen/rt_mod_hip.cc
src/backend/rocm/stubs/hiprtc.cc

💤 Files with no reviewable changes (1)

src/backend/rocm/codegen/rt_mod_hip.cc

Real ROCm /opt/rocm/include/hip/hiprtc.h declares hiprtcCreateProgram and hiprtcCompileProgram with `const char *const *` parameters. An earlier change in this PR flipped our stub to `const char **` (claiming it matched the real API), but on a host with a real ROCm install the stub's extern "C" definitions then conflict with the real header: error: conflicting declaration of C function 'hiprtcResult hiprtcCreateProgram(_hiprtcProgram**, const char*, const char*, int, const char**, const char**)' note: previous declaration 'hiprtcResult hiprtcCreateProgram(..., const char* const*, const char* const*)' Switch both the fallback declarations and the function definitions back to `const char *const *` so the stub compiles whether <hip/hiprtc.h> comes from the real ROCm install or from the __has_include fallback.

zhangnju · 2026-05-13T10:04:59Z

the latest version of rocm release is 7.2, and we may enable 7.2 here

Yeah, triton uses this file too. That's only effect build. Generated kernel will compile by hipcc and will use system's hip headers. So should be fine.

Have tested the whl build by this can run in mi355 rocm 7.2 dockers

The wheel name was flipping from +cuXXX to +rocm on linux because the backend selection chain in dynamic_metadata checked USE_ROCM before USE_CUDA, so the new fat wheel (USE_CUDA=ON USE_ROCM=ON) labelled itself as a ROCm wheel. Only emit the rocm tag when USE_ROCM=ON without USE_CUDA=ON. The fat wheel now keeps the historical +cuXXX.gitYYY naming, which preserves drop-in upgrade behaviour for existing CUDA-pinned installs. ROCm-only builds still get +rocm.

benenzhu added 5 commits May 13, 2026 04:19

zz

bc32bc2

zz

2f950fd

zz

63ad2bb

zz

42e2cad

benenzhu added 2 commits May 13, 2026 07:13

merge

5429965

merge

dce7ae8

benenzhu marked this pull request as draft May 13, 2026 07:15

benenzhu added 2 commits May 13, 2026 07:16

merge

a46172c

benenzhu changed the title ~~[BugFix] Allow USE_ROCM=ON wheel builds without a ROCm runtime on host~~ [BugFix] Vendor HIP headers and build fat CUDA+ROCm linux wheels May 13, 2026

benenzhu added 2 commits May 13, 2026 07:58

benenzhu marked this pull request as ready for review May 13, 2026 08:47

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

benenzhu changed the title ~~[BugFix] Vendor HIP headers and build fat CUDA+ROCm linux wheels~~ [DRAFT][BugFix] Vendor HIP headers and build fat CUDA+ROCm linux wheels May 13, 2026

merge

064fe97

zhangnju reviewed May 13, 2026

View reviewed changes

Comment thread .github/workflows/ci.yml

zhangnju reviewed May 13, 2026

View reviewed changes

benenzhu added 2 commits May 13, 2026 10:52

merge

107aaab

benenzhu changed the title ~~[DRAFT][BugFix] Vendor HIP headers and build fat CUDA+ROCm linux wheels~~ [BugFix] Vendor HIP headers and build fat CUDA+ROCm linux wheels May 13, 2026

Update ci.yml

139c51d

PicoCreator mentioned this pull request May 14, 2026

[BUG] T.While is broken for target="c" / CPU mode #2202

Closed

2 tasks

zhangnju self-requested a review May 14, 2026 09:02

zhangnju approved these changes May 14, 2026

View reviewed changes

LeiWang1999 merged commit f3704ec into tile-ai:main May 18, 2026
13 checks passed

tjtanaa mentioned this pull request May 26, 2026

[ROCm][DSV4] Enable Tilelang MHC replacing torch/triton mhc vllm-project/vllm#43679

Merged

4 tasks

Conversation

benenzhu commented May 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Summary by CodeRabbit

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhangnju May 13, 2026

Choose a reason for hiding this comment

Uh oh!

benenzhu May 13, 2026

Choose a reason for hiding this comment

Uh oh!

benenzhu May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

benenzhu commented May 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 13, 2026 •

edited

Loading