Skip to content

[BugFix] Vendor HIP headers and build fat CUDA+ROCm linux wheels#2195

Merged
LeiWang1999 merged 17 commits into
tile-ai:mainfrom
benenzhu:fix_rocm_wheel_build_without_runtime
May 18, 2026
Merged

[BugFix] Vendor HIP headers and build fat CUDA+ROCm linux wheels#2195
LeiWang1999 merged 17 commits into
tile-ai:mainfrom
benenzhu:fix_rocm_wheel_build_without_runtime

Conversation

@benenzhu
Copy link
Copy Markdown
Contributor

@benenzhu benenzhu commented May 13, 2026

This closes #1922

cc @zhangnju

Reviewer note: Files under 3rdparty/hip-headers/include/hip/ are vendored verbatim from Triton's third_party/amd/backend/include/hip/ and do not need review.

Summary

Currently USE_ROCM=ON cannot configure on a build host that has no ROCm runtime (e.g. an NV-only machine producing a cross-target wheel) because TVM's find_rocm() macro hard-requires a real libamdhip64:

CMake Error at 3rdparty/tvm/cmake/modules/ROCM.cmake:32 (message):
  Cannot find ROCM, USE_ROCM=ON

With TILELANG_USE_HIP_STUBS=ON the runtime library isn't actually needed at build time — only the public HIP headers. This PR vendors them under 3rdparty/hip-headers/include/hip/ (sourced from Triton) so no system ROCm install is required for the build.

Changes

  • Vendored HIP headers under 3rdparty/hip-headers/include/hip/, copied verbatim from Triton. HSA headers are intentionally not vendored: src/backend/rocm/stubs/hip.cc already gates <hsa/hsa.h> behind __has_include with a forward-decl fallback for the only two HSA symbols used (hsa_init / hsa_shut_down).
  • CMakeLists.txt
    • New TILELANG_HIP_INCLUDE_DIR cache var to override the HIP header location.
    • Refactor backend env-var auto-selection so USE_CUDA=ON USE_ROCM=ON set together both take effect (the previous if/elseif chain only honored the first match).
  • src/backend/rocm/CMakeLists.txt
    • Fallback when find_rocm() fails: if TILELANG_USE_HIP_STUBS=ON and HIP headers can be located, manually set ROCM_FOUND=TRUE, ROCM_INCLUDE_DIRS, and route ROCM_HIPHCC_LIBRARY=hip_stub so TVM's ROCM.cmake is satisfied.
    • HIP header resolution order: TILELANG_HIP_INCLUDE_DIR cmake var → env var → /opt/rocm/include → vendored 3rdparty/hip-headers/include (default fallback). Most build environments need zero manual configuration.
  • src/backend/rocm/codegen/rt_mod_hip.cc
    • Drop unused #include <hip/hiprtc.h> (no hiprtc symbols referenced) so we don't need to vendor hiprtc.h (which Triton also omits).
  • src/backend/rocm/stubs/hiprtc.cc
    • Stub function signatures changed from const char *const * to const char ** to match the real HIPRTC API.
  • pyproject.toml
    • Linux wheels now build with both USE_CUDA=ON USE_ROCM=ON via [tool.cibuildwheel.linux], producing a single fat wheel that runs on either CUDA or ROCm hosts. Windows / macOS targets unchanged.
    • Vendored headers added to sdist include (build-time only; not mapped into the runtime wheel).
  • .github/workflows/ci.yml
    • Self-hosted NV CUDA job now sets USE_ROCM=ON, so a regression in the ROCm-on-NV build path is caught by regular PR CI rather than only by the release-time dist workflow.

Test plan

  • Built fat wheel on a 4090 (NV-only) host with USE_CUDA=ON USE_ROCM=ON pip wheel . -v — vendored headers picked up automatically, no manual TILELANG_HIP_INCLUDE_DIR needed.
  • Same build with explicit TILELANG_HIP_INCLUDE_DIR=<system-rocm-include> as a manual override.
  • Installed the produced wheel on the 4090 (NV) host and ran a CUDA example end-to-end — no regression.
  • Installed the same wheel on an MI355X (AMD ROCm) host and ran a HIP example end-to-end.

Summary by CodeRabbit

  • New Features

    • Added support for fat wheels with both CUDA and ROCm backends compiled together.
    • Improved ROCm cross-compilation support for systems without a local ROCm runtime.
  • Build & Infrastructure

    • Enhanced CI/build configuration for ROCm backend testing.
    • Added vendor-supplied HIP headers for improved backend compatibility.

Review Change Stack

benenzhu added 5 commits May 13, 2026 04:19
TVM's find_rocm() requires a real libamdhip64 on the build host, so
USE_ROCM=ON currently FATAL_ERRORs on machines that only have HIP
headers installed (e.g. an NV-only CI machine producing a cross-target
wheel). With TILELANG_USE_HIP_STUBS=ON the runtime library isn't
actually needed at build time, only the public HIP/HSA headers.

This adds a fallback path: when find_rocm() fails but stubs are enabled
and HIP headers are reachable (auto-detected at /opt/rocm/include or
pointed to via the new TILELANG_HIP_INCLUDE_DIR cache var), pretend
ROCM_FOUND=TRUE and route linking through tilelang's hip_stub target so
TVM's ROCM.cmake is satisfied. The vendored stub header is unaffected
(it remains private to src/backend/rocm/stubs/ as before); this path
relies on header-only ROCm dev packages such as hip-runtime-amd-dev and
hsa-rocr-dev, which install without a GPU or driver.
@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

📝 Walkthrough

Walkthrough

Vendors comprehensive HIP/AMD headers and updates CMake, CI, packaging, and build configuration to enable header-only ROCm compilation on hosts without ROCm runtime and to build CUDA+ROCm fat wheels. Adds device intrinsics, atomic operations, warp synchronization, vector/texture/surface APIs, runtime kernel-launch helpers, and driver types. Updates backend selection logic, includes vendored headers in sdist, configures fat-wheel builds, and updates CI to enforce ROCm compilation paths.

Changes

Complete HIP headers and fat-wheel ROCm build foundation

Layer / File(s) Summary
HIP platform detection and common definitions
3rdparty/hip-headers/include/hip/hip_common.h, .../host_defines.h, .../hip_version.h, .../hip_vector_types.h, .../library_types.h, .../hip_runtime.h
Defines platform macros (__HIP_PLATFORM_AMD__/__HIP_PLATFORM_NVIDIA__), compiler-specific attributes, versioning, host typedefs, and routes vector/runtime headers to platform-specific implementations.
AMD HIP device-side intrinsics and builtins
.../amd_detail/amd_channel_descriptor.h, .../amd_device_functions.h, .../device_library_decls.h, .../hip_assert.h
Adds hipCreateChannelDesc and template specializations, device-side bit/conversion/clock/fence/sync intrinsics, local memcpy/memset, device-library OCKL declarations, and device assertion helpers for HIP/Clang.
Vector types and math function declarations
.../amd_detail/amd_hip_vector_types.h, .../amd_math_functions.h, .../math_fwd.h, .../hip_fp16_math_fwd.h, .../hip_ldg.h
Implements HIP vector types (native ext_vector_type or fallback unions), make_* constructors, rank mapping, amd_mixed_dot helpers, and forwards OCML/FP16/__ldg device math functions.
Atomic operations and warp/block synchronization
.../amd_hip_atomic.h, .../amd_hip_unsafe_atomics.h, .../amd_warp_functions.h, .../amd_warp_sync_functions.h
Implements CAS-based atomic wrappers (agent/system scopes), unsafe/safe FP atomics with target-specific fast paths, LDS/register permute/swizzle helpers, warp vote/shuffle/lane intrinsics, and masked warp-sync reduction operations over explicit 64-bit masks.
Texture and surface read/write operations
.../texture_fetch_functions.h, .../texture_indirect_functions.h, .../amd_surface_functions.h, .../ockl_image.h
Adds texture fetch/sample/gather/LOD/grad helpers with type traits and element mapping, texture-object indirect wrappers, surface 1D/2D/3D/cubemap/layered read/write templates, and OCKL image load/store/sample/gather prototypes.
HIP runtime kernel launch and public driver types
.../amd_hip_runtime.h, .../amd_hip_runtime_pt_api.h, .../driver_types.h, .../texture_types.h, .../surface_types.h, .../amd_hip_gl_interop.h, .../hip_deprecated.h, .../hip_runtime_prof.h, .../channel_descriptor.h
Provides HIP-clang kernel launch helpers, device-builtin accessors, per-thread default-stream API remaps, comprehensive driver/resource/array/memcpy/texture/surface type descriptors, OpenGL interop, legacy device properties, and runtime profiling interfaces.
CMake backend selection and ROCm header discovery
CMakeLists.txt, src/backend/rocm/CMakeLists.txt
Refactors backend env-driven selection to independently check USE_CUDA/USE_ROCM/USE_METAL; adds TILELANG_HIP_INCLUDE_DIR cache variable; implements header-fallback search (override/env/system/vendored) when ROCm runtime absent, configuring hip_stub and setting ROCM_FOUND for stub builds.
Fat-wheel packaging and CI ROCm enforcement
pyproject.toml, .github/workflows/ci.yml
Includes vendored HIP headers in sdist (build-time only); configures cibuildwheel Linux environment with USE_CUDA=ON and USE_ROCM=ON for fat wheels; appends USE_ROCM=ON to CI CUDA job environment to exercise ROCm compilation path.
Runtime header and version metadata alignment
src/backend/rocm/codegen/rt_mod_hip.cc, version_provider.py
Switches runtime include from hip/hiprtc.h to hip/hip_runtime.h; adjusts ROCm version labeling to apply only when USE_ROCM=ON and USE_CUDA is disabled.
Python typing modernization
tilelang/autotuner/grouped_compile.py
Updates CompileUnitResult type alias to use Optional[...] instead of union notation for optional fields.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

enhancement

Suggested reviewers

  • LeiWang1999

Bunny builds with nimble craft,
ROCm headers bundled in my pack.
CUDA, ROCm—wheels roll wide,
hip_stub guides my rabbit stride.
Textures shimmer, warps align,
Sync and atomics hop in time! 🥕✨

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

@benenzhu benenzhu marked this pull request as draft May 13, 2026 07:15
benenzhu added 2 commits May 13, 2026 07:16
Vendor Triton's HIP headers under 3rdparty/hip-headers/include/hip
so USE_ROCM=ON can build on hosts without any ROCm install (cibuildwheel
manylinux containers, NV-only dev machines). HSA headers are not vendored
because hip.cc already gates <hsa/hsa.h> behind __has_include with a
forward-decl fallback for the two HSA symbols we use.

CMake fallback chain when find_rocm() fails:
TILELANG_HIP_INCLUDE_DIR -> env var -> /opt/rocm/include -> vendored.

Drop unused #include <hip/hiprtc.h> from rt_mod_hip.cc (no hiprtc symbols
referenced) so we don't need to vendor hiprtc.h, which Triton also omits.

Linux wheels now build with both USE_CUDA=ON and USE_ROCM=ON via
[tool.cibuildwheel.linux] env, producing a single wheel that runs on
either CUDA or ROCm hosts. Windows / macOS targets are unchanged.

Add USE_ROCM=ON to the self-hosted NV CI job so a regression that breaks
the ROCm-on-NV build path is caught by regular PR CI, not just by the
release-time dist workflow.
@benenzhu benenzhu changed the title [BugFix] Allow USE_ROCM=ON wheel builds without a ROCm runtime on host [BugFix] Vendor HIP headers and build fat CUDA+ROCm linux wheels May 13, 2026
benenzhu added 2 commits May 13, 2026 07:58
hip_runtime_api.h transitively #includes <hip/amd_detail/host_defines.h>
and other amd_detail headers. The first vendoring pass only copied the
top-level hip/ files, breaking the build with:
  fatal error: hip/amd_detail/host_defines.h: No such file or directory

Add the full hip/amd_detail/ subtree (25 files) from the same Triton
source. nvidia_detail/ is intentionally not vendored: every nvidia_detail
include sits behind `__HIP_PLATFORM_NVIDIA__`, which we never define.
HSA is also still not needed (no hsa references in the amd_detail set).
TVM's src/runtime/rocm/rocm_device_api.cc unconditionally #includes
<hsa/hsa.h>, so the previous "no HSA headers" assumption only held for
tilelang's own stubs (which gate the include behind __has_include) and
broke the build for the TVM submodule:

  /root/tilelang/3rdparty/tvm/src/runtime/rocm/rocm_device_api.cc:26:10:
    fatal error: hsa/hsa.h: No such file or directory

Vendor hsa.h verbatim from Triton's third_party/amd/backend/include/hsa/.
The header is self-contained (only includes <stddef.h>, <stdint.h>,
<stdbool.h>), so the other 6 hsa_*.h files are not needed. Link-time
remains unchanged: ROCM_HSA_LIBRARY stays NOTFOUND and the only two HSA
symbols actually referenced (hsa_init / hsa_shut_down) are exported by
hip_stub and lazy-loaded at run time.
@benenzhu benenzhu marked this pull request as ready for review May 13, 2026 08:47
Line 21 evaluates `JITKernel | None` at module import (PEP 604 union),
which only works on Python 3.10+. The file's `from __future__ import
annotations` defers function-signature evaluation but not module-level
type-alias assignments, so cp39 wheels imported by the dist.yml smoke
test fail with:

  TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

Use typing.Optional instead. Pre-existing latent bug introduced by
tile-ai#2159; surfaced now because this PR touches pyproject.toml/CMakeLists
which triggers the dist workflow.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
CMakeLists.txt (1)

397-438: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve env-provided SDK paths instead of coercing to ON.

USE_CUDA/USE_ROCM support path values, but this block converts any truthy env value to ON, discarding explicit paths from ENV{USE_CUDA} / ENV{USE_ROCM}.

Proposed fix
   if(DEFINED ENV{USE_CUDA})
     set(_tilelang_backend_env_selected ON)
-    if($ENV{USE_CUDA})
-      set(USE_CUDA ON)
-    else()
-      set(USE_CUDA OFF)
-    endif()
+    set(USE_CUDA "$ENV{USE_CUDA}")
   endif()

   if(DEFINED ENV{USE_ROCM})
     set(_tilelang_backend_env_selected ON)
-    if($ENV{USE_ROCM})
-      set(USE_ROCM ON)
-    else()
-      set(USE_ROCM OFF)
-    endif()
+    set(USE_ROCM "$ENV{USE_ROCM}")
   endif()

   if(DEFINED ENV{USE_METAL})
     set(_tilelang_backend_env_selected ON)
-    if($ENV{USE_METAL})
-      set(USE_METAL ON)
-    else()
-      set(USE_METAL OFF)
-    endif()
+    set(USE_METAL "$ENV{USE_METAL}")
   endif()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@CMakeLists.txt` around lines 397 - 438, The env-handling blocks for USE_CUDA,
USE_ROCM (and similarly USE_METAL) currently coerce any truthy ENV{...} to "ON"
and throw away path values; change them to preserve the raw ENV value when
defined by setting USE_CUDA (and USE_ROCM/USE_METAL) to the actual
$ENV{USE_CUDA} string instead of always "ON", and only map explicit false-like
values ("0" or "OFF") to OFF; update the branches that set
USE_CUDA/USE_ROCM/USE_METAL and the _tilelang_backend_env_selected logic so
environment-provided SDK paths are used as-is while still supporting explicit
OFF/0 toggles, leaving the default selection logic
(TILELANG_CUDA_TOOLKIT_AVAILABLE / APPLE) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@3rdparty/hip-headers/include/hip/amd_detail/amd_device_functions.h`:
- Around line 82-90: The bitmask construction in __fns64 and __fns32 uses (1 <<
base) which is UB for large bases; change the mask shift to use unsigned-width
literals and clamp the shift amount to the valid width before shifting (e.g.,
use ( (__hip_uint64_t)1 << clamped_base ) for __fns64 and ( (__hip_uint32_t)1u
<< clamped_base ) for __fns32), where clamped_base = min(max(base, 0),
WAVEFRONT_SIZE-1) (or mask base with the width bits) so no signed shifts or
overflow occur; update all occurrences in the __fns64 and __fns32
implementations (including temp_mask &= ... branches) to use these unsigned,
clamped shifts.

In `@3rdparty/hip-headers/include/hip/amd_detail/amd_surface_functions.h`:
- Around line 376-384: The surfCubemapLayeredwrite overload incorrectly declares
its first parameter as T* data while all other surf*write overloads use a value
parameter T data; change the signature of surfCubemapLayeredwrite from T* data
to T data so that the call to __hipMapTo<float4::Native_vec_>(data) and the
pattern with other functions (e.g., other surf*write overloads) match; update
the parameter type in the function declaration and any references inside
surfCubemapLayeredwrite (keeping the rest of the body, including
__hipGetPixelAddr, int2 coords, and __ockl_image_store_lod_CM usage, unchanged).

In `@3rdparty/hip-headers/include/hip/amd_detail/amd_warp_sync_functions.h`:
- Around line 499-521: The reduction lambdas in __reduce_or_sync,
__reduce_and_sync and __reduce_xor_sync (and their extra-types variants for int,
long long, unsigned long long) use logical operators instead of bitwise ops and
a malformed XOR expression; replace lhs || rhs with lhs | rhs in
__reduce_or_sync, lhs && rhs with lhs & rhs in __reduce_and_sync, and the XOR
lambda with lhs ^ rhs in __reduce_xor_sync, and apply equivalent fixes to the
corresponding int/long long/unsigned long long reduction lambdas so the manual
reduction tree matches the __ockl_wfred_* bitwise intrinsics.

In `@3rdparty/hip-headers/include/hip/amd_detail/hip_runtime_prof.h`:
- Line 39: The enum constant name kHipHipVdiMemcpyHostToDevice is incorrectly
duplicated with an extra "Hip" and should be renamed to
kHipVdiMemcpyHostToDevice to match the surrounding pattern (e.g.,
kHipVdiMemcpyDeviceToHost, kHipVdiMemcpyDeviceToDevice); update the identifier
in the enum declaration in hip_runtime_prof.h (symbol:
kHipHipVdiMemcpyHostToDevice → kHipVdiMemcpyHostToDevice) and search the
codebase for any occurrences to replace them (adjust any comments or
documentation strings nearby as needed).

In `@3rdparty/hip-headers/include/hip/amd_detail/texture_indirect_functions.h`:
- Around line 388-400: In tex3DGrad, the incoming gradient parameter dPdx is
unused and both gradx and grady are incorrectly initialized from dPdy; fix by
initializing gradx from dPdx and grady from dPdy so the call to
__ockl_image_sample_grad_3D receives the correct x and y gradients. Update the
assignments to gradx and grady in the tex3DGrad function (after
TEXTURE_OBJECT_PARAMETERS_INIT) and ensure the rest of the call to
__ockl_image_sample_grad_3D(i, s, get_native_vector(coords),
get_native_vector(gradx), get_native_vector(grady)) remains unchanged.
- Around line 200-205: The pointer overload of tex2Dgather mistakenly calls
texCubemapLayered<T> instead of the intended tex2Dgather lookup; in the function
template tex2Dgather(T* ptr, hipTextureObject_t textureObject, float x, float y,
int comp = 0) replace the call to texCubemapLayered<T>(...) with the correct
device sampling function tex2Dgather<T>(textureObject, x, y, comp) so the
pointer overload returns the proper tex2Dgather result.
- Around line 128-133: The pointer overload of the tex2DLayered template
mistakenly calls tex1DLayered<T>(...) instead of invoking the 2D variant; update
the body of the static __device__ __hip_img_chk__ void tex2DLayered(...)
overload to assign *ptr using tex2DLayered<T>(textureObject, x, y, layer) (i.e.,
replace the tex1DLayered call with the tex2DLayered call) so the correct 2D
layered fetch is performed.

---

Outside diff comments:
In `@CMakeLists.txt`:
- Around line 397-438: The env-handling blocks for USE_CUDA, USE_ROCM (and
similarly USE_METAL) currently coerce any truthy ENV{...} to "ON" and throw away
path values; change them to preserve the raw ENV value when defined by setting
USE_CUDA (and USE_ROCM/USE_METAL) to the actual $ENV{USE_CUDA} string instead of
always "ON", and only map explicit false-like values ("0" or "OFF") to OFF;
update the branches that set USE_CUDA/USE_ROCM/USE_METAL and the
_tilelang_backend_env_selected logic so environment-provided SDK paths are used
as-is while still supporting explicit OFF/0 toggles, leaving the default
selection logic (TILELANG_CUDA_TOOLKIT_AVAILABLE / APPLE) unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5c0bf7f8-2cbf-4b77-9eae-67857a1c7d50

📥 Commits

Reviewing files that changed from the base of the PR and between bcb2da3 and ef4d508.

📒 Files selected for processing (45)
  • .github/workflows/ci.yml
  • 3rdparty/hip-headers/include/hip/amd_detail/amd_channel_descriptor.h
  • 3rdparty/hip-headers/include/hip/amd_detail/amd_device_functions.h
  • 3rdparty/hip-headers/include/hip/amd_detail/amd_hip_atomic.h
  • 3rdparty/hip-headers/include/hip/amd_detail/amd_hip_common.h
  • 3rdparty/hip-headers/include/hip/amd_detail/amd_hip_gl_interop.h
  • 3rdparty/hip-headers/include/hip/amd_detail/amd_hip_runtime.h
  • 3rdparty/hip-headers/include/hip/amd_detail/amd_hip_runtime_pt_api.h
  • 3rdparty/hip-headers/include/hip/amd_detail/amd_hip_unsafe_atomics.h
  • 3rdparty/hip-headers/include/hip/amd_detail/amd_hip_vector_types.h
  • 3rdparty/hip-headers/include/hip/amd_detail/amd_math_functions.h
  • 3rdparty/hip-headers/include/hip/amd_detail/amd_surface_functions.h
  • 3rdparty/hip-headers/include/hip/amd_detail/amd_warp_functions.h
  • 3rdparty/hip-headers/include/hip/amd_detail/amd_warp_sync_functions.h
  • 3rdparty/hip-headers/include/hip/amd_detail/device_library_decls.h
  • 3rdparty/hip-headers/include/hip/amd_detail/hip_api_trace.hpp
  • 3rdparty/hip-headers/include/hip/amd_detail/hip_assert.h
  • 3rdparty/hip-headers/include/hip/amd_detail/hip_fp16_math_fwd.h
  • 3rdparty/hip-headers/include/hip/amd_detail/hip_ldg.h
  • 3rdparty/hip-headers/include/hip/amd_detail/hip_prof_str.h
  • 3rdparty/hip-headers/include/hip/amd_detail/hip_runtime_prof.h
  • 3rdparty/hip-headers/include/hip/amd_detail/host_defines.h
  • 3rdparty/hip-headers/include/hip/amd_detail/math_fwd.h
  • 3rdparty/hip-headers/include/hip/amd_detail/ockl_image.h
  • 3rdparty/hip-headers/include/hip/amd_detail/texture_fetch_functions.h
  • 3rdparty/hip-headers/include/hip/amd_detail/texture_indirect_functions.h
  • 3rdparty/hip-headers/include/hip/channel_descriptor.h
  • 3rdparty/hip-headers/include/hip/driver_types.h
  • 3rdparty/hip-headers/include/hip/hip_common.h
  • 3rdparty/hip-headers/include/hip/hip_deprecated.h
  • 3rdparty/hip-headers/include/hip/hip_runtime.h
  • 3rdparty/hip-headers/include/hip/hip_runtime_api.h
  • 3rdparty/hip-headers/include/hip/hip_texture_types.h
  • 3rdparty/hip-headers/include/hip/hip_vector_types.h
  • 3rdparty/hip-headers/include/hip/hip_version.h
  • 3rdparty/hip-headers/include/hip/library_types.h
  • 3rdparty/hip-headers/include/hip/linker_types.h
  • 3rdparty/hip-headers/include/hip/surface_types.h
  • 3rdparty/hip-headers/include/hip/texture_types.h
  • 3rdparty/hip-headers/include/hsa/hsa.h
  • CMakeLists.txt
  • pyproject.toml
  • src/backend/rocm/CMakeLists.txt
  • src/backend/rocm/codegen/rt_mod_hip.cc
  • src/backend/rocm/stubs/hiprtc.cc
💤 Files with no reviewable changes (1)
  • src/backend/rocm/codegen/rt_mod_hip.cc

Comment thread 3rdparty/hip-headers/include/hip/amd_detail/amd_device_functions.h
Comment thread 3rdparty/hip-headers/include/hip/amd_detail/amd_surface_functions.h
Comment thread 3rdparty/hip-headers/include/hip/amd_detail/hip_runtime_prof.h
Real ROCm /opt/rocm/include/hip/hiprtc.h declares hiprtcCreateProgram
and hiprtcCompileProgram with `const char *const *` parameters. An
earlier change in this PR flipped our stub to `const char **` (claiming
it matched the real API), but on a host with a real ROCm install the
stub's extern "C" definitions then conflict with the real header:

  error: conflicting declaration of C function 'hiprtcResult
    hiprtcCreateProgram(_hiprtcProgram**, const char*, const char*, int,
    const char**, const char**)'
  note: previous declaration 'hiprtcResult hiprtcCreateProgram(..., const
    char* const*, const char* const*)'

Switch both the fallback declarations and the function definitions back
to `const char *const *` so the stub compiles whether <hip/hiprtc.h>
comes from the real ROCm install or from the __has_include fallback.
@benenzhu benenzhu changed the title [BugFix] Vendor HIP headers and build fat CUDA+ROCm linux wheels [DRAFT][BugFix] Vendor HIP headers and build fat CUDA+ROCm linux wheels May 13, 2026
Comment thread .github/workflows/ci.yml
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the latest version of rocm release is 7.2, and we may enable 7.2 here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, triton uses this file too. That's only effect build. Generated kernel will compile by hipcc and will use system's hip headers. So should be fine.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have tested the whl build by this can run in mi355 rocm 7.2 dockers

benenzhu added 2 commits May 13, 2026 10:52
The wheel name was flipping from +cuXXX to +rocm on linux because the
backend selection chain in dynamic_metadata checked USE_ROCM before
USE_CUDA, so the new fat wheel (USE_CUDA=ON USE_ROCM=ON) labelled itself
as a ROCm wheel.

Only emit the rocm tag when USE_ROCM=ON without USE_CUDA=ON. The fat
wheel now keeps the historical +cuXXX.gitYYY naming, which preserves
drop-in upgrade behaviour for existing CUDA-pinned installs. ROCm-only
builds still get +rocm.
@benenzhu benenzhu changed the title [DRAFT][BugFix] Vendor HIP headers and build fat CUDA+ROCm linux wheels [BugFix] Vendor HIP headers and build fat CUDA+ROCm linux wheels May 13, 2026
@LeiWang1999 LeiWang1999 merged commit f3704ec into tile-ai:main May 18, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Question] Test Tilelang whl in Rocm with CUDA build wheel.

3 participants