[BugFix] Vendor HIP headers and build fat CUDA+ROCm linux wheels#2195
Conversation
TVM's find_rocm() requires a real libamdhip64 on the build host, so USE_ROCM=ON currently FATAL_ERRORs on machines that only have HIP headers installed (e.g. an NV-only CI machine producing a cross-target wheel). With TILELANG_USE_HIP_STUBS=ON the runtime library isn't actually needed at build time, only the public HIP/HSA headers. This adds a fallback path: when find_rocm() fails but stubs are enabled and HIP headers are reachable (auto-detected at /opt/rocm/include or pointed to via the new TILELANG_HIP_INCLUDE_DIR cache var), pretend ROCM_FOUND=TRUE and route linking through tilelang's hip_stub target so TVM's ROCM.cmake is satisfied. The vendored stub header is unaffected (it remains private to src/backend/rocm/stubs/ as before); this path relies on header-only ROCm dev packages such as hip-runtime-amd-dev and hsa-rocr-dev, which install without a GPU or driver.
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
📝 WalkthroughWalkthroughVendors comprehensive HIP/AMD headers and updates CMake, CI, packaging, and build configuration to enable header-only ROCm compilation on hosts without ROCm runtime and to build CUDA+ROCm fat wheels. Adds device intrinsics, atomic operations, warp synchronization, vector/texture/surface APIs, runtime kernel-launch helpers, and driver types. Updates backend selection logic, includes vendored headers in sdist, configures fat-wheel builds, and updates CI to enforce ROCm compilation paths. ChangesComplete HIP headers and fat-wheel ROCm build foundation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests (beta)
|
Vendor Triton's HIP headers under 3rdparty/hip-headers/include/hip so USE_ROCM=ON can build on hosts without any ROCm install (cibuildwheel manylinux containers, NV-only dev machines). HSA headers are not vendored because hip.cc already gates <hsa/hsa.h> behind __has_include with a forward-decl fallback for the two HSA symbols we use. CMake fallback chain when find_rocm() fails: TILELANG_HIP_INCLUDE_DIR -> env var -> /opt/rocm/include -> vendored. Drop unused #include <hip/hiprtc.h> from rt_mod_hip.cc (no hiprtc symbols referenced) so we don't need to vendor hiprtc.h, which Triton also omits. Linux wheels now build with both USE_CUDA=ON and USE_ROCM=ON via [tool.cibuildwheel.linux] env, producing a single wheel that runs on either CUDA or ROCm hosts. Windows / macOS targets are unchanged. Add USE_ROCM=ON to the self-hosted NV CI job so a regression that breaks the ROCm-on-NV build path is caught by regular PR CI, not just by the release-time dist workflow.
hip_runtime_api.h transitively #includes <hip/amd_detail/host_defines.h> and other amd_detail headers. The first vendoring pass only copied the top-level hip/ files, breaking the build with: fatal error: hip/amd_detail/host_defines.h: No such file or directory Add the full hip/amd_detail/ subtree (25 files) from the same Triton source. nvidia_detail/ is intentionally not vendored: every nvidia_detail include sits behind `__HIP_PLATFORM_NVIDIA__`, which we never define. HSA is also still not needed (no hsa references in the amd_detail set).
TVM's src/runtime/rocm/rocm_device_api.cc unconditionally #includes
<hsa/hsa.h>, so the previous "no HSA headers" assumption only held for
tilelang's own stubs (which gate the include behind __has_include) and
broke the build for the TVM submodule:
/root/tilelang/3rdparty/tvm/src/runtime/rocm/rocm_device_api.cc:26:10:
fatal error: hsa/hsa.h: No such file or directory
Vendor hsa.h verbatim from Triton's third_party/amd/backend/include/hsa/.
The header is self-contained (only includes <stddef.h>, <stdint.h>,
<stdbool.h>), so the other 6 hsa_*.h files are not needed. Link-time
remains unchanged: ROCM_HSA_LIBRARY stays NOTFOUND and the only two HSA
symbols actually referenced (hsa_init / hsa_shut_down) are exported by
hip_stub and lazy-loaded at run time.
Line 21 evaluates `JITKernel | None` at module import (PEP 604 union), which only works on Python 3.10+. The file's `from __future__ import annotations` defers function-signature evaluation but not module-level type-alias assignments, so cp39 wheels imported by the dist.yml smoke test fail with: TypeError: unsupported operand type(s) for |: 'type' and 'NoneType' Use typing.Optional instead. Pre-existing latent bug introduced by tile-ai#2159; surfaced now because this PR touches pyproject.toml/CMakeLists which triggers the dist workflow.
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
CMakeLists.txt (1)
397-438:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPreserve env-provided SDK paths instead of coercing to
ON.
USE_CUDA/USE_ROCMsupport path values, but this block converts any truthy env value toON, discarding explicit paths fromENV{USE_CUDA}/ENV{USE_ROCM}.Proposed fix
if(DEFINED ENV{USE_CUDA}) set(_tilelang_backend_env_selected ON) - if($ENV{USE_CUDA}) - set(USE_CUDA ON) - else() - set(USE_CUDA OFF) - endif() + set(USE_CUDA "$ENV{USE_CUDA}") endif() if(DEFINED ENV{USE_ROCM}) set(_tilelang_backend_env_selected ON) - if($ENV{USE_ROCM}) - set(USE_ROCM ON) - else() - set(USE_ROCM OFF) - endif() + set(USE_ROCM "$ENV{USE_ROCM}") endif() if(DEFINED ENV{USE_METAL}) set(_tilelang_backend_env_selected ON) - if($ENV{USE_METAL}) - set(USE_METAL ON) - else() - set(USE_METAL OFF) - endif() + set(USE_METAL "$ENV{USE_METAL}") endif()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@CMakeLists.txt` around lines 397 - 438, The env-handling blocks for USE_CUDA, USE_ROCM (and similarly USE_METAL) currently coerce any truthy ENV{...} to "ON" and throw away path values; change them to preserve the raw ENV value when defined by setting USE_CUDA (and USE_ROCM/USE_METAL) to the actual $ENV{USE_CUDA} string instead of always "ON", and only map explicit false-like values ("0" or "OFF") to OFF; update the branches that set USE_CUDA/USE_ROCM/USE_METAL and the _tilelang_backend_env_selected logic so environment-provided SDK paths are used as-is while still supporting explicit OFF/0 toggles, leaving the default selection logic (TILELANG_CUDA_TOOLKIT_AVAILABLE / APPLE) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@3rdparty/hip-headers/include/hip/amd_detail/amd_device_functions.h`:
- Around line 82-90: The bitmask construction in __fns64 and __fns32 uses (1 <<
base) which is UB for large bases; change the mask shift to use unsigned-width
literals and clamp the shift amount to the valid width before shifting (e.g.,
use ( (__hip_uint64_t)1 << clamped_base ) for __fns64 and ( (__hip_uint32_t)1u
<< clamped_base ) for __fns32), where clamped_base = min(max(base, 0),
WAVEFRONT_SIZE-1) (or mask base with the width bits) so no signed shifts or
overflow occur; update all occurrences in the __fns64 and __fns32
implementations (including temp_mask &= ... branches) to use these unsigned,
clamped shifts.
In `@3rdparty/hip-headers/include/hip/amd_detail/amd_surface_functions.h`:
- Around line 376-384: The surfCubemapLayeredwrite overload incorrectly declares
its first parameter as T* data while all other surf*write overloads use a value
parameter T data; change the signature of surfCubemapLayeredwrite from T* data
to T data so that the call to __hipMapTo<float4::Native_vec_>(data) and the
pattern with other functions (e.g., other surf*write overloads) match; update
the parameter type in the function declaration and any references inside
surfCubemapLayeredwrite (keeping the rest of the body, including
__hipGetPixelAddr, int2 coords, and __ockl_image_store_lod_CM usage, unchanged).
In `@3rdparty/hip-headers/include/hip/amd_detail/amd_warp_sync_functions.h`:
- Around line 499-521: The reduction lambdas in __reduce_or_sync,
__reduce_and_sync and __reduce_xor_sync (and their extra-types variants for int,
long long, unsigned long long) use logical operators instead of bitwise ops and
a malformed XOR expression; replace lhs || rhs with lhs | rhs in
__reduce_or_sync, lhs && rhs with lhs & rhs in __reduce_and_sync, and the XOR
lambda with lhs ^ rhs in __reduce_xor_sync, and apply equivalent fixes to the
corresponding int/long long/unsigned long long reduction lambdas so the manual
reduction tree matches the __ockl_wfred_* bitwise intrinsics.
In `@3rdparty/hip-headers/include/hip/amd_detail/hip_runtime_prof.h`:
- Line 39: The enum constant name kHipHipVdiMemcpyHostToDevice is incorrectly
duplicated with an extra "Hip" and should be renamed to
kHipVdiMemcpyHostToDevice to match the surrounding pattern (e.g.,
kHipVdiMemcpyDeviceToHost, kHipVdiMemcpyDeviceToDevice); update the identifier
in the enum declaration in hip_runtime_prof.h (symbol:
kHipHipVdiMemcpyHostToDevice → kHipVdiMemcpyHostToDevice) and search the
codebase for any occurrences to replace them (adjust any comments or
documentation strings nearby as needed).
In `@3rdparty/hip-headers/include/hip/amd_detail/texture_indirect_functions.h`:
- Around line 388-400: In tex3DGrad, the incoming gradient parameter dPdx is
unused and both gradx and grady are incorrectly initialized from dPdy; fix by
initializing gradx from dPdx and grady from dPdy so the call to
__ockl_image_sample_grad_3D receives the correct x and y gradients. Update the
assignments to gradx and grady in the tex3DGrad function (after
TEXTURE_OBJECT_PARAMETERS_INIT) and ensure the rest of the call to
__ockl_image_sample_grad_3D(i, s, get_native_vector(coords),
get_native_vector(gradx), get_native_vector(grady)) remains unchanged.
- Around line 200-205: The pointer overload of tex2Dgather mistakenly calls
texCubemapLayered<T> instead of the intended tex2Dgather lookup; in the function
template tex2Dgather(T* ptr, hipTextureObject_t textureObject, float x, float y,
int comp = 0) replace the call to texCubemapLayered<T>(...) with the correct
device sampling function tex2Dgather<T>(textureObject, x, y, comp) so the
pointer overload returns the proper tex2Dgather result.
- Around line 128-133: The pointer overload of the tex2DLayered template
mistakenly calls tex1DLayered<T>(...) instead of invoking the 2D variant; update
the body of the static __device__ __hip_img_chk__ void tex2DLayered(...)
overload to assign *ptr using tex2DLayered<T>(textureObject, x, y, layer) (i.e.,
replace the tex1DLayered call with the tex2DLayered call) so the correct 2D
layered fetch is performed.
---
Outside diff comments:
In `@CMakeLists.txt`:
- Around line 397-438: The env-handling blocks for USE_CUDA, USE_ROCM (and
similarly USE_METAL) currently coerce any truthy ENV{...} to "ON" and throw away
path values; change them to preserve the raw ENV value when defined by setting
USE_CUDA (and USE_ROCM/USE_METAL) to the actual $ENV{USE_CUDA} string instead of
always "ON", and only map explicit false-like values ("0" or "OFF") to OFF;
update the branches that set USE_CUDA/USE_ROCM/USE_METAL and the
_tilelang_backend_env_selected logic so environment-provided SDK paths are used
as-is while still supporting explicit OFF/0 toggles, leaving the default
selection logic (TILELANG_CUDA_TOOLKIT_AVAILABLE / APPLE) unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5c0bf7f8-2cbf-4b77-9eae-67857a1c7d50
📒 Files selected for processing (45)
.github/workflows/ci.yml3rdparty/hip-headers/include/hip/amd_detail/amd_channel_descriptor.h3rdparty/hip-headers/include/hip/amd_detail/amd_device_functions.h3rdparty/hip-headers/include/hip/amd_detail/amd_hip_atomic.h3rdparty/hip-headers/include/hip/amd_detail/amd_hip_common.h3rdparty/hip-headers/include/hip/amd_detail/amd_hip_gl_interop.h3rdparty/hip-headers/include/hip/amd_detail/amd_hip_runtime.h3rdparty/hip-headers/include/hip/amd_detail/amd_hip_runtime_pt_api.h3rdparty/hip-headers/include/hip/amd_detail/amd_hip_unsafe_atomics.h3rdparty/hip-headers/include/hip/amd_detail/amd_hip_vector_types.h3rdparty/hip-headers/include/hip/amd_detail/amd_math_functions.h3rdparty/hip-headers/include/hip/amd_detail/amd_surface_functions.h3rdparty/hip-headers/include/hip/amd_detail/amd_warp_functions.h3rdparty/hip-headers/include/hip/amd_detail/amd_warp_sync_functions.h3rdparty/hip-headers/include/hip/amd_detail/device_library_decls.h3rdparty/hip-headers/include/hip/amd_detail/hip_api_trace.hpp3rdparty/hip-headers/include/hip/amd_detail/hip_assert.h3rdparty/hip-headers/include/hip/amd_detail/hip_fp16_math_fwd.h3rdparty/hip-headers/include/hip/amd_detail/hip_ldg.h3rdparty/hip-headers/include/hip/amd_detail/hip_prof_str.h3rdparty/hip-headers/include/hip/amd_detail/hip_runtime_prof.h3rdparty/hip-headers/include/hip/amd_detail/host_defines.h3rdparty/hip-headers/include/hip/amd_detail/math_fwd.h3rdparty/hip-headers/include/hip/amd_detail/ockl_image.h3rdparty/hip-headers/include/hip/amd_detail/texture_fetch_functions.h3rdparty/hip-headers/include/hip/amd_detail/texture_indirect_functions.h3rdparty/hip-headers/include/hip/channel_descriptor.h3rdparty/hip-headers/include/hip/driver_types.h3rdparty/hip-headers/include/hip/hip_common.h3rdparty/hip-headers/include/hip/hip_deprecated.h3rdparty/hip-headers/include/hip/hip_runtime.h3rdparty/hip-headers/include/hip/hip_runtime_api.h3rdparty/hip-headers/include/hip/hip_texture_types.h3rdparty/hip-headers/include/hip/hip_vector_types.h3rdparty/hip-headers/include/hip/hip_version.h3rdparty/hip-headers/include/hip/library_types.h3rdparty/hip-headers/include/hip/linker_types.h3rdparty/hip-headers/include/hip/surface_types.h3rdparty/hip-headers/include/hip/texture_types.h3rdparty/hip-headers/include/hsa/hsa.hCMakeLists.txtpyproject.tomlsrc/backend/rocm/CMakeLists.txtsrc/backend/rocm/codegen/rt_mod_hip.ccsrc/backend/rocm/stubs/hiprtc.cc
💤 Files with no reviewable changes (1)
- src/backend/rocm/codegen/rt_mod_hip.cc
Real ROCm /opt/rocm/include/hip/hiprtc.h declares hiprtcCreateProgram
and hiprtcCompileProgram with `const char *const *` parameters. An
earlier change in this PR flipped our stub to `const char **` (claiming
it matched the real API), but on a host with a real ROCm install the
stub's extern "C" definitions then conflict with the real header:
error: conflicting declaration of C function 'hiprtcResult
hiprtcCreateProgram(_hiprtcProgram**, const char*, const char*, int,
const char**, const char**)'
note: previous declaration 'hiprtcResult hiprtcCreateProgram(..., const
char* const*, const char* const*)'
Switch both the fallback declarations and the function definitions back
to `const char *const *` so the stub compiles whether <hip/hiprtc.h>
comes from the real ROCm install or from the __has_include fallback.
There was a problem hiding this comment.
the latest version of rocm release is 7.2, and we may enable 7.2 here
There was a problem hiding this comment.
Yeah, triton uses this file too. That's only effect build. Generated kernel will compile by hipcc and will use system's hip headers. So should be fine.
There was a problem hiding this comment.
Have tested the whl build by this can run in mi355 rocm 7.2 dockers
The wheel name was flipping from +cuXXX to +rocm on linux because the backend selection chain in dynamic_metadata checked USE_ROCM before USE_CUDA, so the new fat wheel (USE_CUDA=ON USE_ROCM=ON) labelled itself as a ROCm wheel. Only emit the rocm tag when USE_ROCM=ON without USE_CUDA=ON. The fat wheel now keeps the historical +cuXXX.gitYYY naming, which preserves drop-in upgrade behaviour for existing CUDA-pinned installs. ROCm-only builds still get +rocm.
This closes #1922
cc @zhangnju
Summary
Currently
USE_ROCM=ONcannot configure on a build host that has no ROCm runtime (e.g. an NV-only machine producing a cross-target wheel) because TVM'sfind_rocm()macro hard-requires a reallibamdhip64:With
TILELANG_USE_HIP_STUBS=ONthe runtime library isn't actually needed at build time — only the public HIP headers. This PR vendors them under3rdparty/hip-headers/include/hip/(sourced from Triton) so no system ROCm install is required for the build.Changes
3rdparty/hip-headers/include/hip/, copied verbatim from Triton. HSA headers are intentionally not vendored:src/backend/rocm/stubs/hip.ccalready gates<hsa/hsa.h>behind__has_includewith a forward-decl fallback for the only two HSA symbols used (hsa_init/hsa_shut_down).CMakeLists.txtTILELANG_HIP_INCLUDE_DIRcache var to override the HIP header location.USE_CUDA=ON USE_ROCM=ONset together both take effect (the previous if/elseif chain only honored the first match).src/backend/rocm/CMakeLists.txtfind_rocm()fails: ifTILELANG_USE_HIP_STUBS=ONand HIP headers can be located, manually setROCM_FOUND=TRUE,ROCM_INCLUDE_DIRS, and routeROCM_HIPHCC_LIBRARY=hip_stubso TVM'sROCM.cmakeis satisfied.TILELANG_HIP_INCLUDE_DIRcmake var → env var →/opt/rocm/include→ vendored3rdparty/hip-headers/include(default fallback). Most build environments need zero manual configuration.src/backend/rocm/codegen/rt_mod_hip.cc#include <hip/hiprtc.h>(no hiprtc symbols referenced) so we don't need to vendorhiprtc.h(which Triton also omits).src/backend/rocm/stubs/hiprtc.ccconst char *const *toconst char **to match the real HIPRTC API.pyproject.tomlUSE_CUDA=ON USE_ROCM=ONvia[tool.cibuildwheel.linux], producing a single fat wheel that runs on either CUDA or ROCm hosts. Windows / macOS targets unchanged.include(build-time only; not mapped into the runtime wheel)..github/workflows/ci.ymlUSE_ROCM=ON, so a regression in the ROCm-on-NV build path is caught by regular PR CI rather than only by the release-time dist workflow.Test plan
USE_CUDA=ON USE_ROCM=ON pip wheel . -v— vendored headers picked up automatically, no manualTILELANG_HIP_INCLUDE_DIRneeded.TILELANG_HIP_INCLUDE_DIR=<system-rocm-include>as a manual override.Summary by CodeRabbit
New Features
Build & Infrastructure