Add RDNA gfx1151 ROCm target support#2127
Conversation
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds RDNA device support and RDNA-aware target utilities, introduces int8 DP4A intrinsics with an AMD builtin fallback, forwards target-derived mcpu to HIP compilation, expands test gating for RDNA/CDNA/CUDA, and guards optional IR annotations. ChangesRDNA Architecture
ROCm Target Utilities & Tests
HIP/AMD DP4A Intrinsics
Compilation / Tooling Integration
Carver Template & Policies
IR/Transform Robustness
Testing Decorators & Test Scope
Sequence Diagram(s)sequenceDiagram
participant User
participant TargetUtil as tilelang.utils.target
participant CarverArch as tilelang.carver.arch
participant RDNAClass as RDNA
participant ROCmDevice as ROCm Device
User->>TargetUtil: determine_target("auto" / Target)
TargetUtil->>TargetUtil: normalize_rocm_arch, extract mcpu, add mtriple/thread_warp_size
TargetUtil-->>CarverArch: enriched Target
User->>CarverArch: get_arch(target)
CarverArch->>CarverArch: target_is_rdna(target)?
alt RDNA
CarverArch->>RDNAClass: instantiate RDNA(target)
RDNAClass->>ROCmDevice: bind device_0 via tvm.runtime.rocm(0)
ROCmDevice-->>RDNAClass: device handle
RDNAClass-->>CarverArch: RDNA instance
else
CarverArch-->>User: CDNA/CUDA arch
end
CarverArch-->>User: arch object
sequenceDiagram
participant Builder as tilelang.engine.lower / libgen
participant TargetUtil as tilelang.utils.target
participant HIPcc as tilelang.contrib.hipcc
participant HIPImpl as HIP Compiler / runtime
Builder->>TargetUtil: target_get_mcpu(target)
alt mcpu available
Builder->>HIPcc: compile_hip(source, arch=mcpu)
HIPcc->>HIPImpl: invoke compile with --offload-arch=mcpu
else
Builder->>HIPcc: compile_hip(source, arch=fallback_rocm_arch)
HIPcc->>HIPImpl: invoke compile with detected arch
end
HIPImpl-->>HIPcc: compiled binary
HIPcc-->>Builder: result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tilelang/contrib/hipcc.py (1)
19-27: Use the canonical ROCm arch parser instead of duplicating it.This helper reimplements logic already centralized in
tilelang/utils/target.py(target_get_mcpu/normalize_rocm_arch). Keeping two parsers risks drift.♻️ Suggested simplification
+from tilelang.utils.target import target_get_mcpu - -def _target_mcpu(target): - try: - mcpu = target.attrs.get("mcpu") - except AttributeError: - return None - if mcpu is None: - return None - arch = str(mcpu).strip().split(":", maxsplit=1)[0] - return arch if arch.startswith("gfx") else NoneAnd update the callback usage:
- hsaco = compile_hip(code, target_format="hsaco", arch=_target_mcpu(target)) + hsaco = compile_hip(code, target_format="hsaco", arch=target_get_mcpu(target))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tilelang/contrib/hipcc.py` around lines 19 - 27, Replace the duplicated ROCm arch parsing logic in tilelang/contrib/hipcc.py with the canonical utilities from tilelang/utils/target.py: import and call target_get_mcpu and/or normalize_rocm_arch instead of the local parser, remove the duplicated functions/branches in hipcc.py, and adjust any place that consumed the old local parser output (including callback invocation) to accept the canonical function's return shape; ensure the callback is called with the normalized ROCm arch value returned by normalize_rocm_arch/target_get_mcpu so behavior remains identical.tilelang/carver/roller/policy/tensorcore.py (1)
266-270: Minor cleanup: reusenp.prod(space)once.Good feasibility check. You can avoid duplicate computation by storing the product once.
Refactor sketch
- if np.prod(space) < warps: + space_prod = int(np.prod(space)) + if space_prod < warps: return None - factors = factorize(np.prod(space) // warps) + factors = factorize(space_prod // warps)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tilelang/carver/roller/policy/tensorcore.py` around lines 266 - 270, The feasibility check currently calls np.prod(space) twice; compute it once, store it in a local variable (e.g., total = np.prod(space)) and replace both occurrences with that variable to avoid duplicate work; update the scope where np.prod(space) is used (the feasibility check block) so all references use the new local variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/tl_templates/hip/common.h`:
- Around line 149-159: The DP4A template is unsafe for non-4-byte types because
it uses fixed 4-byte __builtin_memcpy into int temporaries; add compile-time
checks inside DP4A to prevent instantiation with incompatible types by adding
static_asserts that sizeof(InDatatype) == 4 and sizeof(OutDatatype) == 4 (and
optionally that std::is_trivially_copyable<InDatatype>::value and
std::is_trivially_copyable<OutDatatype>::value) so only 4-byte,
trivially-copyable operand/result types can be used; keep the checks in the DP4A
template (which calls tl_dp4a and uses __builtin_memcpy) to fail fast at compile
time.
---
Nitpick comments:
In `@tilelang/carver/roller/policy/tensorcore.py`:
- Around line 266-270: The feasibility check currently calls np.prod(space)
twice; compute it once, store it in a local variable (e.g., total =
np.prod(space)) and replace both occurrences with that variable to avoid
duplicate work; update the scope where np.prod(space) is used (the feasibility
check block) so all references use the new local variable.
In `@tilelang/contrib/hipcc.py`:
- Around line 19-27: Replace the duplicated ROCm arch parsing logic in
tilelang/contrib/hipcc.py with the canonical utilities from
tilelang/utils/target.py: import and call target_get_mcpu and/or
normalize_rocm_arch instead of the local parser, remove the duplicated
functions/branches in hipcc.py, and adjust any place that consumed the old local
parser output (including callback invocation) to accept the canonical function's
return shape; ensure the callback is called with the normalized ROCm arch value
returned by normalize_rocm_arch/target_get_mcpu so behavior remains identical.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bb907b2d-7064-4071-adf6-72f2346e742f
📒 Files selected for processing (11)
src/tl_templates/hip/common.htesting/python/target/test_tilelang_rocm_target.pytilelang/carver/__init__.pytilelang/carver/arch/__init__.pytilelang/carver/arch/rdna.pytilelang/carver/roller/policy/tensorcore.pytilelang/carver/template/base.pytilelang/contrib/hipcc.pytilelang/engine/lower.pytilelang/jit/adapter/libgen.pytilelang/utils/target.py
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/tl_templates/hip/common.h (1)
149-163:⚠️ Potential issue | 🔴 CriticalFix the DP4A carrier-size check.
__builtin_memcpy(&a_int, a, sizeof(a_int))and the matchingb/ccopies read 4 bytes, sosizeof(InDatatype) == 1lets this template read past the pointed-to object. Require a 4-byte packed carrier here, or shrink the memcpy width to match the asserted size.🔧 Proposed fix
template <typename InDatatype, typename OutDatatype> TL_DEVICE void DP4A(const InDatatype *a, const InDatatype *b, OutDatatype *c) { - static_assert(sizeof(InDatatype) == 1, - "DP4A expects a pointer to packed int8 lanes"); + static_assert(sizeof(InDatatype) == sizeof(int), + "DP4A expects a 4-byte packed carrier"); static_assert(sizeof(OutDatatype) == sizeof(int), "DP4A expects 4-byte accumulator/output type");Run the following script to inspect DP4A call sites and confirm the carrier types they use:
#!/bin/bash set -euo pipefail rg -nP -C 2 '\bDP4A\s*(<|\()' src testing🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/tl_templates/hip/common.h` around lines 149 - 163, The DP4A template currently asserts sizeof(InDatatype) == 1 but then memcpy's 4 bytes from the pointers (a/b/c), so change the carrier-size check to require a 4-byte packed carrier: replace the static_assert for InDatatype with sizeof(InDatatype) == 4 (update the error message accordingly) so the template's memcpy calls (using a_int/b_int/c_int) are safe; keep the OutDatatype/int-size assert as-is and leave the memcpy widths unchanged. Also run the provided rg script to verify all DP4A call sites use 4-byte carrier types.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tilelang/carver/roller/policy/tensorcore.py`:
- Around line 266-270: The code currently computes factors from space_prod //
warps which truncates remainders; change it to first require exact divisibility
by checking space_prod % warps == 0 and return None if not, then call factorize
on the exact quotient (space_prod // warps) — update the logic around the
variables space_prod and warps and the call to factorize to enforce this
divisibility check before factorization.
---
Duplicate comments:
In `@src/tl_templates/hip/common.h`:
- Around line 149-163: The DP4A template currently asserts sizeof(InDatatype) ==
1 but then memcpy's 4 bytes from the pointers (a/b/c), so change the
carrier-size check to require a 4-byte packed carrier: replace the static_assert
for InDatatype with sizeof(InDatatype) == 4 (update the error message
accordingly) so the template's memcpy calls (using a_int/b_int/c_int) are safe;
keep the OutDatatype/int-size assert as-is and leave the memcpy widths
unchanged. Also run the provided rg script to verify all DP4A call sites use
4-byte carrier types.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: df20e00d-b6ab-4cfa-813d-270f143be6a1
📒 Files selected for processing (6)
src/op/parallel.ccsrc/tl_templates/hip/common.hsrc/transform/layout_reducer.cctesting/python/target/test_tilelang_rocm_target.pytilelang/carver/roller/policy/tensorcore.pytilelang/contrib/hipcc.py
|
Took a look at the failing tests and fixed w/ separate commits:
Validation:
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tilelang/carver/roller/policy/tensorcore.py`:
- Around line 215-219: In score_block_size, the current warp count uses ceiling
division which overestimates warps for non-aligned n; change the computation in
tensorcore.score_block_size to use floor division (compute warps = n //
self.arch.warp_size) when checking getattr(self.arch, "platform", None) ==
"RDNA" so underfilled blocks do not appear as full 8-warp occupancy; keep the
rest of the return tuple logic ((0 if warps == 8 else 1, abs(warps - 8),
*base_score)) unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: eabb4224-6f88-4935-9b78-a5706998c4c3
📒 Files selected for processing (6)
testing/python/amd/test_tilelang_gemm_mfma_intrinsic.pytesting/python/amd/test_tilelang_gemm_mfma_preshuffle.pytesting/python/amd/test_tilelang_gfx950_copy_async.pytesting/python/kernel/test_tilelang_kernel_gemm.pytilelang/carver/roller/policy/tensorcore.pytilelang/testing/__init__.py
✅ Files skipped from review due to trivial changes (1)
- testing/python/amd/test_tilelang_gfx950_copy_async.py
zhangnju
left a comment
There was a problem hiding this comment.
Thanks for your contribution.
zhangnju
left a comment
There was a problem hiding this comment.
Thanks for your contributions to AMD Tilelang. This PR adds first-class support for AMD RDNA GPUs (specifically gfx1151/Strix Halo) in TileLang. The work is well-structured across the layers: target detection, device modeling, and code generation. But because some codes of this PR will be used by RDNA4, we may also need to check whether the codes can work well for RDAN4, like some hardcoding.
|
@lhl Thanks for you contributions! would you mind resolving those comments? |
I'll take a look soon but, and I can add guards to make sure it only supports the GPUs I have (gfx1100, gfx1151) but note: I don't have or have access to RNDA4 so besides checking that it's not affecting it, I won't be doing any RDNA4 optimization and will leave that for those that want to do that. I'll scope specifically to RDNA3/3.5 |
|
@LeiWang1999 scope is guarded to gfx11 and all feedback is resolved now |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
testing/python/target/test_tilelang_rocm_target.py (2)
67-83: 💤 Low valueThe
torchmonkeypatches intest_carver_routes_rdna_without_instantiating_devicedon't affectget_arch.
torch.version.hip,torch.cuda.is_available, andtorch.mps.is_availableare only consulted byauto_infer_current_arch, not byget_arch(Target(...))which is called directly on line 81. These three patches are dead in this test and could be removed, leaving only thearch_mod.RDNAmonkeypatch that actually prevents the device probe.🧹 Proposed cleanup
def test_carver_routes_rdna_without_instantiating_device(monkeypatch): - import torch - - monkeypatch.setattr(torch.version, "hip", None, raising=False) - monkeypatch.setattr(torch.cuda, "is_available", lambda: False) - if hasattr(torch, "mps"): - monkeypatch.setattr(torch.mps, "is_available", lambda: False, raising=False) - import tilelang.carver.arch as arch_mod def fake_rdna(target): return ("rdna", target) monkeypatch.setattr(arch_mod, "RDNA", fake_rdna) arch = arch_mod.get_arch(Target("hip -mcpu=gfx1151"))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@testing/python/target/test_tilelang_rocm_target.py` around lines 67 - 83, The test function test_carver_routes_rdna_without_instantiating_device contains irrelevant monkeypatches for torch.version.hip, torch.cuda.is_available, and torch.mps.is_available that do not affect get_arch; remove the three monkeypatch.setattr calls (the torch.version.hip, torch.cuda.is_available, and torch.mps.is_available lines) and keep only the monkeypatch of arch_mod.RDNA and the call to arch_mod.get_arch(Target("hip -mcpu=gfx1151")) so the test solely prevents device probing via the RDNA stub.
19-27: 💤 Low value
test_normalize_rocm_arch_strips_feature_suffixmixes two distinct API surfaces.The function tests both
normalize_rocm_archandrocm_warp_size_for_archin the same test case. Consider splitting into separate test functions so failures are easier to diagnose. This is a low-priority style concern.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@testing/python/target/test_tilelang_rocm_target.py` around lines 19 - 27, Split the mixed assertions in test_normalize_rocm_arch_strips_feature_suffix into two focused tests: keep only normalize_rocm_arch assertions in a renamed test (e.g., test_normalize_rocm_arch) that asserts normalize_rocm_arch("gfx1151:sramecc+:xnack-")== "gfx1151", normalize_rocm_arch("gfx942")=="gfx942", normalize_rocm_arch("") is None, and normalize_rocm_arch("sm_90") is None; move the rocm_warp_size_for_arch assertions into a separate test (e.g., test_rocm_warp_size_for_arch) that asserts rocm_warp_size_for_arch("gfx1151")==32, rocm_warp_size_for_arch("gfx1030")==32, rocm_warp_size_for_arch("gfx1200")==32, and rocm_warp_size_for_arch("gfx942")==64 so failures pinpoint the correct API surface.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/tl_templates/hip/common.h`:
- Around line 129-136: The TL_AMDGPU_HAS_SDOT4 feature-guard incorrectly
includes __gfx1010__, which lacks dot1-insts and causes use of
__builtin_amdgcn_sdot4 on unsupported targets; update the preprocessor check
around the TL_AMDGPU_HAS_SDOT4 macro by removing the defined(__gfx1010__) entry
so that __gfx1010__ falls through to the fallback implementation (modify the
`#if/`#endif block that defines TL_AMDGPU_HAS_SDOT4 accordingly).
In `@tilelang/carver/arch/__init__.py`:
- Around line 22-25: The current branch that handles HIP targets silently maps
any RDNA target not equal to generation 11 into CDNA, causing wrong defaults;
update the HIP branch in __init__.py that checks target.kind.name == "hip" so
that after calling target_is_rdna(target) and reading
target_get_rdna_generation(target) you explicitly reject unsupported RDNA
generations (e.g. raise a ValueError with a clear message naming the detected
generation) instead of returning CDNA(target); if you prefer to keep a fallback,
add a clear inline comment and a warning log before returning CDNA, but the
preferred fix is to raise an explicit error when generation != 11 so RDNA
targets like gfx1030/gfx1200 are not silently misconfigured for CDNA defaults.
In `@tilelang/carver/arch/rdna.py`:
- Around line 45-47: The method get_avaliable_tensorintrin_shapes currently sets
self.available_tensor_instructions to a tuple (TensorInstruction("wmma", [16,
16]),) which contradicts the declared type list[TensorInstruction] | None;
change the assignment to a list (e.g. [TensorInstruction("wmma", [16, 16])]) so
callers can treat available_tensor_instructions as a list (support
isinstance(..., list) and .append()), and return [t.shape for t in
self.available_tensor_instructions] as before; keep references to the attribute
available_tensor_instructions and the TensorInstruction constructor in the fix.
---
Nitpick comments:
In `@testing/python/target/test_tilelang_rocm_target.py`:
- Around line 67-83: The test function
test_carver_routes_rdna_without_instantiating_device contains irrelevant
monkeypatches for torch.version.hip, torch.cuda.is_available, and
torch.mps.is_available that do not affect get_arch; remove the three
monkeypatch.setattr calls (the torch.version.hip, torch.cuda.is_available, and
torch.mps.is_available lines) and keep only the monkeypatch of arch_mod.RDNA and
the call to arch_mod.get_arch(Target("hip -mcpu=gfx1151")) so the test solely
prevents device probing via the RDNA stub.
- Around line 19-27: Split the mixed assertions in
test_normalize_rocm_arch_strips_feature_suffix into two focused tests: keep only
normalize_rocm_arch assertions in a renamed test (e.g.,
test_normalize_rocm_arch) that asserts
normalize_rocm_arch("gfx1151:sramecc+:xnack-")== "gfx1151",
normalize_rocm_arch("gfx942")=="gfx942", normalize_rocm_arch("") is None, and
normalize_rocm_arch("sm_90") is None; move the rocm_warp_size_for_arch
assertions into a separate test (e.g., test_rocm_warp_size_for_arch) that
asserts rocm_warp_size_for_arch("gfx1151")==32,
rocm_warp_size_for_arch("gfx1030")==32, rocm_warp_size_for_arch("gfx1200")==32,
and rocm_warp_size_for_arch("gfx942")==64 so failures pinpoint the correct API
surface.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 61d1e99c-e77f-4ed3-a77e-1cd816f3aa3e
📒 Files selected for processing (5)
src/tl_templates/hip/common.htesting/python/target/test_tilelang_rocm_target.pytilelang/carver/arch/__init__.pytilelang/carver/arch/rdna.pytilelang/utils/target.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tilelang/utils/target.py
|
@LeiWang1999 @zhangnju FYI all manual and automated issues are reviewed/resolved. This patch is tightly scoped to gfx11. |
I saw there was a recent PR for RDNA3/3.5 including for Strix Halo but I had some problems w/ running things and it and turns out I had to fix some things to get it working.
gfx1151:sramecc+:xnack-.mcpu=gfx1151mtriple=amdgcn-amd-amdhsa-hccthread_warp_size=32for gfx10/gfx11/gfx12thread_warp_size=64for gfx9mcputhrough HIP compilation paths so generated kernels compile for the actual ROCm target arch.RDNAcarver device model and route RDNA HIP targets away from the CDNA path.DP4Asupport:__builtin_amdgcn_sudot4for RDNA3/RDNA3.5/RDNA4 targets includinggfx1151__builtin_amdgcn_sdot4on supported older AMD targetsTested on:
gfx11513.12.11in thetherockconda/mamba environment124126594639c2790.23.dev0from the TileLang submodule build2.10.0+rocm7.13.0a20260417torch.version.hip:7.13.261547.13.26154-ca4b97ef2c, AMD clang23.0.0gitSmoke Test:
M=N=K=512, int8 x int8 -> int32Results after this patch:
0.058230 ms,4.610 TOPS0.055946 ms,4.798 TOPS0.055425 ms,4.843 TOPS0.055946 ms,4.798 TOPSSlow, but it works! (before compile just fails)
Tests Passed:
pre-commit install --install-hookspre-commit run --all-filespython3 -m pytest testing/python/target/test_tilelang_rocm_target.py -q6 passedpython -m py_compile ...ruff check ...ruff format --check ...git diff --check__builtin_amdgcn_sudot4with--offload-arch=gfx1151Fails (CDNA tests improperly scoped for RDNA...):Fixed scope for CDNA tets:
__builtin_amdgcn_mfma_f32_16x16x16f16 needs target feature mai-inststesting/python/debug/test_tilelang_debug_print.py::test_debug_print_bufferI updated the tests that improperly target RDNA to be CDNA or in one case, gfx950-only. Tests pass for the architectures they're meant for.
Summary by CodeRabbit
New Features
Bug Fixes
Tests