Skip to content

[MetaxGPU][feature] Add MACA GEMM compiler-path layout support#90

Open
VitalyAnkh wants to merge 4 commits into
tile-ai:devfrom
VitalyAnkh:clean-hgemm-164
Open

[MetaxGPU][feature] Add MACA GEMM compiler-path layout support#90
VitalyAnkh wants to merge 4 commits into
tile-ai:devfrom
VitalyAnkh:clean-hgemm-164

Conversation

@VitalyAnkh

@VitalyAnkh VitalyAnkh commented May 20, 2026

Copy link
Copy Markdown
Collaborator

Hi maintainers,

This PR restores the MACA intrinsic package that the compiler-generated hgemm path depends on and keeps the TileLang side aligned with the paired TileOps C500 compiler route. The handwritten MACA C kernel remains a reference only; the optimized path stays on the TileLang compiler and lowering route.

Summary

  • Restores the MACA intrinsics package and the root-level aliases needed by the language tests and MACA GEMM lowering.
  • Keeps the WGMMA and TCGEN05 emitter surface available from tilelang.intrinsics.
  • Preserves the layout, builtin, barrier, and codegen support used by the compiler-generated C500 hgemm path.
  • Adds a guarded WSM lowering path: unsupported WSM template contracts fall back to direct gemm_ss, while the supported path uses an explicitly sized 4-stage workspace.

Rebase update

  • Rebased onto the latest upstream dev.
  • Kept upstream tirx and reduce batch-source changes while preserving the MACA GEMM template route.
  • Addressed the WSM contract review finding by preventing unsupported template instantiations and aligning the workspace size with the template footprint.

Validation

  • Source-level MACA GEMM contract tests pass: 5 passed.
  • Production-shape hgemm sweep passed correctness on all 8 covered shapes with the paired TileOps PR.
  • Current measured throughput range: 172.076209 to 205.316929 TFLOPS.
  • Minimum A100-relative ratio in the sweep: 89.68%.
  • A representative compiler-path WSM fallback guard passed correctness.

Notes

@github-actions

Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai

coderabbitai Bot commented May 20, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds MACA GEMM support end-to-end: new A/B layout factory and fragment refactor, TVM/TIR builtin intrinsics and Python/FFI bindings, device-side barrier and cp.async helpers, GEMM template updates including a WSM kernel, codegen emission and cp-async sizing, lowering helpers for permuted layouts, k-packing in MMA lowering, and tests.

Changes

MACA GEMM Architecture and Intrinsics

Layer / File(s) Summary
Combined checkpoint
src/layout/gemm_layouts.cc, src/layout/layout.h, src/layout/layout.cc, tilelang/layout/swizzle.py, tilelang/layout/__init__.py, src/op/builtin.h, src/op/builtin.cc, tilelang/language/tir/op.py, tilelang/language/tir/ir.py, tilelang/language/tir/ir.pyi, src/target/codegen_maca.cc, src/backend/maca/codegen/codegen_maca.cc, src/tl_templates/maca/barrier.h, src/tl_templates/maca/copy.h, src/tl_templates/maca/common.h, src/tl_templates/maca/gemm.h, src/tl_templates/maca/gemm_wsm.h, src/transform/lower_tile_op.cc, tilelang/maca/op/gemm/gemm_mma.py, tilelang/language/gemm_op.py, tilelang/tileop/gemm/gemm_base.py, tilelang/maca/intrinsics/layout/mma_layout.py, tilelang/maca/intrinsics/macro/mma_macro_generator.py, tilelang/contrib/mxcc.py, src/backend/*/op/reduce.cc, src/tl_templates/*/reduce.h, testing/maca/*
Introduces makeMacaGemmABLayout and refactors makeGemmFragmentCMACA; registers MACA builtin intrinsics and tl_gemm_wsm; adds device-side barrier and cp.async helpers and wrappers; adds gemm_wsm kernel and updates gemm.h (remove_swizzle, pipelined staging); implements cp-async sizing and codegen emission; generalizes permuted-layout index rewriting in lowering; adds k-pack-driven MMA lowering and optional template WSM path; adds batch-offset AllReduce helpers and run_batch_offset in templates; exposes Python/FFI wrappers; and adds MACA-specific tests and mxcc flag parsing.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Suggested reviewers

  • Five-HZ

Poem

🐰 I hopped through code with tiny paws,
Rewrote layouts, fixed some laws,
Barriers hum, cp.async sings,
GEMM tiles dance on nimble wings.
K-packed kernels now take flight — hooray for clever flaws!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 19.92% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title '[MetaxGPU][feature] Add MACA GEMM compiler-path layout support' directly reflects the main change: restoring MACA compiler-path layout and intrinsics support for the GEMM lowering pipeline.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tilelang/tileop/gemm/gemm_maca_mma.py (2)

129-130: 💤 Low value

Replace lambda assignments with def statements per linter guidance.

Static analysis (E731) flagged these lambda assignments. Using def is more idiomatic Python.

Proposed fix
         if use_template and self.is_gemm_ss():
-            shared_layout_a = lambda buf: make_maca_gemm_ab_layout(buf, 1 if self.trans_A else 2)
-            shared_layout_b = lambda buf: make_maca_gemm_ab_layout(buf, 2 if self.trans_B else 1)
+            def shared_layout_a(buf):
+                return make_maca_gemm_ab_layout(buf, 1 if self.trans_A else 2)
+
+            def shared_layout_b(buf):
+                return make_maca_gemm_ab_layout(buf, 2 if self.trans_B else 1)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tilelang/tileop/gemm/gemm_maca_mma.py` around lines 129 - 130, The two lambda
assignments shared_layout_a and shared_layout_b should be replaced with proper
function definitions to satisfy the linter (E731); define functions (e.g. def
shared_layout_a(buf): return make_maca_gemm_ab_layout(buf, 1 if self.trans_A
else 2) and def shared_layout_b(buf): return make_maca_gemm_ab_layout(buf, 2 if
self.trans_B else 1)) so callers remain unchanged and references to
make_maca_gemm_ab_layout, self.trans_A and self.trans_B are preserved.

28-35: 💤 Low value

Consider wrapping the int conversion for a clearer error message.

If TILELANG_MACA_GEMM_K_PACK contains non-numeric content, int(value) raises a generic ValueError. Wrapping would provide a clearer diagnostic.

Proposed fix
 def _get_maca_gemm_k_pack(default: int = 1) -> int:
     value = os.environ.get("TILELANG_MACA_GEMM_K_PACK")
     if value is None:
         return default
-    k_pack = int(value)
+    try:
+        k_pack = int(value)
+    except ValueError:
+        raise ValueError(
+            f"TILELANG_MACA_GEMM_K_PACK must be an integer, got {value!r}"
+        ) from None
     if k_pack < 1:
         raise ValueError(f"TILELANG_MACA_GEMM_K_PACK must be >= 1, got {k_pack}")
     return k_pack
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tilelang/tileop/gemm/gemm_maca_mma.py` around lines 28 - 35, The integer
conversion in _get_maca_gemm_k_pack currently calls int(value) directly so
non-numeric env values raise a generic ValueError; wrap the conversion in a
try/except around int(value) (or name it raw and then parse) and on exception
raise a clearer ValueError that includes the env var name
TILELANG_MACA_GEMM_K_PACK and the offending value (optionally include the
original exception message) before the existing k_pack < 1 check.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/tl_templates/maca/barrier.h`:
- Around line 42-47: The inline asm in the synchronization primitives
(mbarrier_init, mbarrier_try_wait, mbarrier_wait, mbarrier_arrive (both
overloads), mbarrier_expect_tx, mbarrier_arrive_expect_tx,
mbarrier_cp_async_arrive, mbarrier_cp_async_arrive_noinc, fence_proxy_async,
fence_barrier_init) must declare the "memory" clobber so the compiler cannot
reorder loads/stores across these asm volatile blocks; update each asm
volatile(...) invocation to include "memory" in the clobber list (preserving
existing input/output constraints such as the "r" operands and any existing
clobbers) so the asm acts as a compiler-level memory barrier.

---

Nitpick comments:
In `@tilelang/tileop/gemm/gemm_maca_mma.py`:
- Around line 129-130: The two lambda assignments shared_layout_a and
shared_layout_b should be replaced with proper function definitions to satisfy
the linter (E731); define functions (e.g. def shared_layout_a(buf): return
make_maca_gemm_ab_layout(buf, 1 if self.trans_A else 2) and def
shared_layout_b(buf): return make_maca_gemm_ab_layout(buf, 2 if self.trans_B
else 1)) so callers remain unchanged and references to make_maca_gemm_ab_layout,
self.trans_A and self.trans_B are preserved.
- Around line 28-35: The integer conversion in _get_maca_gemm_k_pack currently
calls int(value) directly so non-numeric env values raise a generic ValueError;
wrap the conversion in a try/except around int(value) (or name it raw and then
parse) and on exception raise a clearer ValueError that includes the env var
name TILELANG_MACA_GEMM_K_PACK and the offending value (optionally include the
original exception message) before the existing k_pack < 1 check.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a73a170a-a045-4edd-b421-4eab667c7d4e

📥 Commits

Reviewing files that changed from the base of the PR and between c8b4cae and f884cc8.

📒 Files selected for processing (19)
  • src/layout/gemm_layouts.cc
  • src/layout/layout.cc
  • src/layout/layout.h
  • src/op/builtin.cc
  • src/op/builtin.h
  • src/target/codegen_maca.cc
  • src/tl_templates/maca/barrier.h
  • src/tl_templates/maca/common.h
  • src/tl_templates/maca/copy.h
  • src/tl_templates/maca/gemm.h
  • testing/maca/language/test_tilelang_language_access_ptr_codegen.py
  • testing/maca/tilelibrary/test_tilelang_maca_gemm_template_contract.py
  • tilelang/contrib/mxcc.py
  • tilelang/language/tir/ir.py
  • tilelang/language/tir/ir.pyi
  • tilelang/language/tir/op.py
  • tilelang/layout/__init__.py
  • tilelang/layout/swizzle.py
  • tilelang/tileop/gemm/gemm_maca_mma.py

Comment thread src/tl_templates/maca/barrier.h
@VitalyAnkh VitalyAnkh force-pushed the clean-hgemm-164 branch 2 times, most recently from f1a3a0d to c23ba70 Compare May 20, 2026 11:38

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tilelang/maca/op/gemm/gemm_mma.py (1)

306-318: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Critical: T.clear(C_buf) inside the loop will discard accumulated results.

In _gemm_srr, the T.clear(C_buf) is placed inside the for ki loop. This means the accumulator will be cleared on every iteration, discarding all previously computed partial products. This differs from _gemm_ssr and _gemm_rsr where the clear is correctly placed before the loop.

Proposed fix
             `@T.prim_func`
             def _gemm_srr() -> None:
                 ...
                 A_local = T.alloc_local((warp_rows * local_size_a * k_pack), in_dtype)
 
+                if clear_accum:
+                    T.clear(C_buf)
                 for ki in T.serial(0, (block_K // macro_size_k)):
-                    if clear_accum:
-                        T.clear(C_buf)
                     # Load A into fragment
                     mma_emitter.ldmatrix_a(
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tilelang/maca/op/gemm/gemm_mma.py` around lines 306 - 318, The accumulator
clear call is inside the ki loop in _gemm_srr which resets C_buf every iteration
and discards partial products; move the conditional clear_accum / T.clear(C_buf)
out of the for ki in T.serial(0, (block_K // macro_size_k)) loop so C_buf is
cleared once before the loop begins (preserve the existing clear_accum boolean
check), leaving the subsequent mma_emitter.ldmatrix_a(...) and
mma_emitter.mma(...) calls unchanged.
🧹 Nitpick comments (5)
tilelang/intrinsics/tcgen05_macro_generator.py (1)

278-304: 💤 Low value

Consider extracting duplicated access_ptr_from helper.

The access_ptr_from function is duplicated verbatim in both tcgen05mma_ss and tcgen05mma_ts. Extract it as a class method or module-level function to reduce duplication.

Also applies to: 473-499

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tilelang/intrinsics/tcgen05_macro_generator.py` around lines 278 - 304, The
duplicated helper access_ptr_from (used in tcgen05mma_ss and tcgen05mma_ts)
should be extracted to a single module-level function (or a shared class method)
so both generators call the same implementation; create a top-level def
access_ptr_from(buffer_or_load_or_region, access_type="r") that preserves the
existing logic (Buffer, BufferLoad handling with offset/stride computation,
BufferRegion handling, and the same error raising), replace the duplicated
blocks in tcgen05mma_ss and tcgen05mma_ts with calls to this new function, and
ensure any local names (Buffer, BufferLoad, BufferRegion, tvm) remain in scope
or are imported so behavior is unchanged.
tilelang/intrinsics/mma_sp_layout.py (1)

147-154: 💤 Low value

Type hint for dtype is inconsistent with implementation.

The type hint dtype: Literal["float16", "int8"] doesn't match the actual implementation which handles dtype_bits == 32 (line 157). Consider updating the type hint to include "float32" or use a more general type.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tilelang/intrinsics/mma_sp_layout.py` around lines 147 - 154, The dtype type
hint on get_ldmatrix_offset_b is incorrect for the implemented logic: update the
annotation for the dtype parameter in get_ldmatrix_offset_b to include "float32"
(e.g., Literal["float16","int8","float32"]) or replace the Literal with a
broader type (like str or an enum) so it matches the branch that checks for
dtype_bits == 32 and the float32 handling in the function body.
tilelang/intrinsics/wgmma_macro_generator.py (2)

500-502: 💤 Low value

Simplify redundant boolean logic.

The current logic with [False] prepended and any() is equivalent to just checking not transposed. The [False] has no effect on the any() result.

Proposed simplification
-        is_sr_conditions = [False]
-        is_sr_conditions.append(not transposed)
-        is_sr_axis_order = any(is_sr_conditions)
+        is_sr_axis_order = not transposed
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tilelang/intrinsics/wgmma_macro_generator.py` around lines 500 - 502, Replace
the redundant construction of is_sr_conditions and the any() call with a direct
boolean based on transposed: the current block that creates is_sr_conditions =
[False]; is_sr_conditions.append(not transposed); is_sr_axis_order =
any(is_sr_conditions) should be simplified to set is_sr_axis_order directly from
not transposed (i.e., is_sr_axis_order = not transposed) so remove the temporary
list and any() usage; update any dependent logic/comments in the function
wgmma_macro_generator.py to reflect the simplified variable.

165-176: 💤 Low value

Typo in method name: _determinate_swizzle_mode should be _determine_swizzle_mode.

Proposed fix
-    def _determinate_swizzle_mode(self, buffer: Buffer, layout: Layout) -> SwizzleMode:
+    def _determine_swizzle_mode(self, buffer: Buffer, layout: Layout) -> SwizzleMode:

Also update the call sites at lines 202, 203, and 363.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tilelang/intrinsics/wgmma_macro_generator.py` around lines 165 - 176, Rename
the misspelled method `_determinate_swizzle_mode` to `_determine_swizzle_mode`
and update every call site that references `_determinate_swizzle_mode` to use
`_determine_swizzle_mode` instead; specifically change the function definition
and all invocations (e.g., the calls currently referencing
`_determinate_swizzle_mode` around where swizzle mode is resolved) to the
corrected name so references (and any imports/uses within the same module)
remain consistent and tests/imports continue to work.
tilelang/intrinsics/wmma_macro_generator.py (1)

196-212: 💤 Low value

Unused macro parameter A_shared_buf.

The inner macro _warp_ldmatrix_a declares A_shared_buf as a parameter but never uses it. The actual buffer access uses A_buf from the outer scope. Consider removing the unused parameter or using it consistently.

Proposed fix
         `@T.macro`
-        def _warp_ldmatrix_a(A_local_buf, A_shared_buf, ki, thread_binding, rk=0):
+        def _warp_ldmatrix_a(A_local_buf, ki, thread_binding, rk=0):
             tx, _, warp_m = self.extract_thread_binding(thread_binding)
             ...
 
-        return _warp_ldmatrix_a(A_local_buf, A_shared_buf, ki, thread_binding, rk)
+        return _warp_ldmatrix_a(A_local_buf, ki, thread_binding, rk)

Same applies to _warp_ldmatrix_b at lines 236-251.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tilelang/intrinsics/wmma_macro_generator.py` around lines 196 - 212, The
macro parameter A_shared_buf in _warp_ldmatrix_a is unused (the body reads A_buf
from outer scope); either remove A_shared_buf from the macro signature and from
its invocation (return line) or change the macro body to consistently use
A_shared_buf instead of A_buf; apply the same fix for the symmetric macro
_warp_ldmatrix_b (remove unused B_shared_buf or switch uses to B_shared_buf) and
update any calls to _warp_ldmatrix_a/_warp_ldmatrix_b accordingly so signatures
match their invocations.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tilelang/intrinsics/maca_mma_macro_generator.py`:
- Around line 500-502: The ValueError in the else branch incorrectly states
"k_dim must be 0" while valid k_dim values handled are 4, 16, and 32; update the
raise in tilelang/intrinsics/maca_mma_macro_generator.py (the branch that
currently raises ValueError(f"k_dim must be 0 currently but got {k_dim}")) to a
clearer message that lists the allowed values and shows the actual value, e.g.
raise ValueError(f"k_dim must be one of {{4, 16, 32}} but got {k_dim}").
- Line 782: Remove the debugging print statement that emits kernel compilation
noise: delete the call to print(self.a_preshuffle) in the MacaMmaMacroGenerator
(or the method where self.a_preshuffle is used) so stdout is not polluted during
kernel compilation; ensure no other side-effects rely on that print and run
tests to confirm behavior unchanged.

In `@tilelang/intrinsics/mfma_layout.py`:
- Around line 113-128: The function shared_16x64_to_local_64x16_layout_B is a
copy of the _A variant and uses the wrong index mapping; replace its body so it
follows the established _B pattern: compute thread_id as j + (i // 16) * 16 and
local as i % 16, then return thread_id, local (update
shared_16x64_to_local_64x16_layout_B accordingly to match the intended mapping
and be consistent with shared_16x64_to_local_64x16_layout_A and other *_B
variants).

In `@tilelang/intrinsics/mfma_macro_generator.py`:
- Line 835: Remove the leftover debug print that emits self.a_preshuffle in
mfma_macro_generator.py: locate the print(self.a_preshuffle) call (inside the
MFMA macro generation code where self.a_preshuffle is referenced) and delete it;
if silent diagnostics are needed instead, replace it with a debug-level logging
call (e.g., using the module/class logger) rather than printing to stdout.

In `@tilelang/intrinsics/tcgen05_macro_generator.py`:
- Around line 22-27: The SwizzleMode IntEnum currently has incorrect numeric
mappings; update the SwizzleMode enum in tcgen05_macro_generator.py so the
members match the TCGen05 PTX spec: set NONE = 0, SWIZZLE_32B = 1, SWIZZLE_64B =
2, and SWIZZLE_128B = 3 (these correspond to bits 61–63 of the shared-memory
descriptor). Modify the class SwizzleMode to use these exact integer values so
code referencing SwizzleMode uses the correct PTX swizzle-mode encodings.

In `@tilelang/intrinsics/utils.py`:
- Around line 96-116: The function get_mma_micro_size currently types its dtype
as Literal["float16", "int8"] but the implementation and docstring also accept
"float8_e4m3" and "float8_e5m2"; update the type hint on get_mma_micro_size to
include those FP8 literals (e.g.,
Literal["float16","int8","float8_e4m3","float8_e5m2"]) or broaden to
str/Union[...] so the annotation matches the handled values, and ensure the
docstring stays consistent with the new annotation.

---

Outside diff comments:
In `@tilelang/maca/op/gemm/gemm_mma.py`:
- Around line 306-318: The accumulator clear call is inside the ki loop in
_gemm_srr which resets C_buf every iteration and discards partial products; move
the conditional clear_accum / T.clear(C_buf) out of the for ki in T.serial(0,
(block_K // macro_size_k)) loop so C_buf is cleared once before the loop begins
(preserve the existing clear_accum boolean check), leaving the subsequent
mma_emitter.ldmatrix_a(...) and mma_emitter.mma(...) calls unchanged.

---

Nitpick comments:
In `@tilelang/intrinsics/mma_sp_layout.py`:
- Around line 147-154: The dtype type hint on get_ldmatrix_offset_b is incorrect
for the implemented logic: update the annotation for the dtype parameter in
get_ldmatrix_offset_b to include "float32" (e.g.,
Literal["float16","int8","float32"]) or replace the Literal with a broader type
(like str or an enum) so it matches the branch that checks for dtype_bits == 32
and the float32 handling in the function body.

In `@tilelang/intrinsics/tcgen05_macro_generator.py`:
- Around line 278-304: The duplicated helper access_ptr_from (used in
tcgen05mma_ss and tcgen05mma_ts) should be extracted to a single module-level
function (or a shared class method) so both generators call the same
implementation; create a top-level def access_ptr_from(buffer_or_load_or_region,
access_type="r") that preserves the existing logic (Buffer, BufferLoad handling
with offset/stride computation, BufferRegion handling, and the same error
raising), replace the duplicated blocks in tcgen05mma_ss and tcgen05mma_ts with
calls to this new function, and ensure any local names (Buffer, BufferLoad,
BufferRegion, tvm) remain in scope or are imported so behavior is unchanged.

In `@tilelang/intrinsics/wgmma_macro_generator.py`:
- Around line 500-502: Replace the redundant construction of is_sr_conditions
and the any() call with a direct boolean based on transposed: the current block
that creates is_sr_conditions = [False]; is_sr_conditions.append(not
transposed); is_sr_axis_order = any(is_sr_conditions) should be simplified to
set is_sr_axis_order directly from not transposed (i.e., is_sr_axis_order = not
transposed) so remove the temporary list and any() usage; update any dependent
logic/comments in the function wgmma_macro_generator.py to reflect the
simplified variable.
- Around line 165-176: Rename the misspelled method `_determinate_swizzle_mode`
to `_determine_swizzle_mode` and update every call site that references
`_determinate_swizzle_mode` to use `_determine_swizzle_mode` instead;
specifically change the function definition and all invocations (e.g., the calls
currently referencing `_determinate_swizzle_mode` around where swizzle mode is
resolved) to the corrected name so references (and any imports/uses within the
same module) remain consistent and tests/imports continue to work.

In `@tilelang/intrinsics/wmma_macro_generator.py`:
- Around line 196-212: The macro parameter A_shared_buf in _warp_ldmatrix_a is
unused (the body reads A_buf from outer scope); either remove A_shared_buf from
the macro signature and from its invocation (return line) or change the macro
body to consistently use A_shared_buf instead of A_buf; apply the same fix for
the symmetric macro _warp_ldmatrix_b (remove unused B_shared_buf or switch uses
to B_shared_buf) and update any calls to _warp_ldmatrix_a/_warp_ldmatrix_b
accordingly so signatures match their invocations.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 05f94a50-4354-446b-99ad-ef5a91391483

📥 Commits

Reviewing files that changed from the base of the PR and between 877b80c and c23ba70.

📒 Files selected for processing (22)
  • src/target/codegen_maca.cc
  • src/tl_templates/maca/barrier.h
  • src/tl_templates/maca/copy.h
  • src/tl_templates/maca/gemm.h
  • testing/maca/tilelibrary/test_tilelang_maca_gemm_template_contract.py
  • tilelang/intrinsics/__init__.py
  • tilelang/intrinsics/maca_mma_macro_generator.py
  • tilelang/intrinsics/mfma_layout.py
  • tilelang/intrinsics/mfma_macro_generator.py
  • tilelang/intrinsics/mma_layout.py
  • tilelang/intrinsics/mma_macro_generator.py
  • tilelang/intrinsics/mma_sm70_layout.py
  • tilelang/intrinsics/mma_sm70_macro_generator.py
  • tilelang/intrinsics/mma_sp_layout.py
  • tilelang/intrinsics/mma_sp_macro_generator.py
  • tilelang/intrinsics/tcgen05_macro_generator.py
  • tilelang/intrinsics/utils.py
  • tilelang/intrinsics/wgmma_macro_generator.py
  • tilelang/intrinsics/wmma_layout.py
  • tilelang/intrinsics/wmma_macro_generator.py
  • tilelang/layout/swizzle.py
  • tilelang/maca/op/gemm/gemm_mma.py

Comment thread tilelang/maca/intrinsics/macro/mma_macro_generator.py
Comment thread tilelang/intrinsics/maca_mma_macro_generator.py Outdated
Comment thread tilelang/intrinsics/mfma_layout.py
Comment thread tilelang/intrinsics/mfma_macro_generator.py Outdated
Comment thread tilelang/intrinsics/tcgen05_macro_generator.py Outdated
Comment thread tilelang/intrinsics/utils.py Outdated
@VitalyAnkh VitalyAnkh force-pushed the clean-hgemm-164 branch 3 times, most recently from e8a7191 to 17a2e8f Compare May 20, 2026 13:09

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (1)
tilelang/contrib/mxcc.py (1)

20-44: ⚡ Quick win

Document the TILELANG_MXCC_FLAGS environment variable.

The function docstring doesn't mention that additional compiler flags can be injected via the TILELANG_MXCC_FLAGS environment variable. Users won't discover this feature without reading the implementation.

📝 Proposed docstring update
 def compile_maca(code, target_format="mcbin", arch=None, options=None, path_target=None, verbose=False):
     """Compile maca code with MXCC from env.
 
     Parameters
     ----------
     code : str
         The maca code.
 
     target_format : str
         The target format of mxcc compiler.
 
     arch : str
         The maca architecture.
 
     options : str or list of str
         The additional options.
 
     path_target : str, optional
         Output file.
 
+    Environment Variables
+    ---------------------
+    TILELANG_MXCC_FLAGS : str, optional
+        Additional MXCC command-line flags, parsed with shell-like syntax.
+        These flags are appended after explicit options but before output specs.
+
     Return
     ------
     cubin : bytearray
         The bytearray of the fatbin
     """
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tilelang/contrib/mxcc.py` around lines 20 - 44, The docstring for
compile_maca is missing documentation for the TILELANG_MXCC_FLAGS environment
variable; update the compile_maca function docstring to mention that additional
MXCC compiler flags can be supplied via the TILELANG_MXCC_FLAGS environment
variable (e.g., a space-separated string or list-equivalent), describe its
effect on the options passed to mxcc, and include an example or note about
precedence/format so users can discover and use this feature.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/target/codegen_maca.cc`:
- Around line 1666-1685: The code emits PrintExpr(op->args[2]) as a template
parameter for tl::cp_async_gs, which can produce invalid C++ if the size isn't a
compile-time IntImm; update the ptx_cp_async handling (the block guarded by
op->op.same_as(builtin::ptx_cp_async())) to first normalize/validate op->args[2]
as an IntImm using the same helper used elsewhere (call
GetTileLangCPAsyncTransferBytes on op->args[2] to obtain a validated integer
byte count), then use that integer value (as a literal string) when emitting
both tl::cp_async_gs<...>(...) and tl::cp_async_gs_conditional<...>(...), and
emit a clear error/check if GetTileLangCPAsyncTransferBytes fails; apply the
same pattern consistently wherever ptx_cp_async is handled (e.g., the other
backends).

In `@src/tl_templates/maca/barrier.h`:
- Around line 68-75: The inline assembly in the __forceinline__ barrier helper
defines non-unique labels LAB_WAIT and DONE which will clash when the function
is instantiated in multiple translation units; update the labels inside the asm
block (the ones referenced by the branch instructions in the mbarrier loop) to
use unique local labels such as LAB_WAIT_%= and DONE_%= (or switch to numeric
local labels like 1: with 1b/1f references) so each inlined instance gets
distinct label names and avoids duplicate-label assembler errors.

In `@src/tl_templates/maca/gemm_wsm.h`:
- Around line 76-83: The code always seeds C_f32 from accum, ignoring the
template flag clear_accum; change the logic in the gemm_ss_wsm routine so that
before the current loop that assigns C_f32 from accum you check clear_accum and
if true initialize C_f32 to zero (or skip loading accum entirely), otherwise
proceed to load accum into C_f32 as currently done; reference the C_f32 array,
accum pointer, and the clear_accum template parameter to implement the
conditional initialization so first-use outputs are not contaminated by stale
fragments.

In `@src/tl_templates/maca/gemm.h`:
- Around line 96-100: The remove_swizzle helper is checking sizeof(A_type) but
is used for the B fragment (e.g., called with tCrB.layout()), which strips the
wrong layout when A and B have different widths; change the constexpr condition
to inspect sizeof(B_type) instead of sizeof(A_type) so that
ComposedLayout<Args...> const &layout returns layout.layout_b() only when B_type
is 2 bytes (and otherwise returns layout), updating the remove_swizzle
implementation accordingly to reference B_type for the swizzle decision so
gemm() receives the correct B layout.

In `@testing/maca/language/test_tilelang_language_access_ptr_codegen.py`:
- Around line 168-169: The two MACA tests (e.g., the test function
test_maca_bsm_intrinsics_codegen and the similar test at lines ~206-207) are
unconditionally running; gate them with the project’s MACA availability marker
so they are skipped when MACA isn’t present. Locate the test definitions
(test_maca_bsm_intrinsics_codegen and the other MACA-targeted test) and add the
same pytest marker used elsewhere in this file (e.g., the MACA availability
skip/marker) so the tests are automatically skipped when the project’s MACA
flag/marker indicates MACA is unavailable.

In `@tilelang/contrib/mxcc.py`:
- Around line 92-94: Handle malformed TILELANG_MXCC_FLAGS by catching
shlex.split ValueError around the call that appends to cmd: when reading
extra_env_flags and calling shlex.split(extra_env_flags) (the code that mutates
cmd using variables cmd and extra_env_flags), wrap the split in try/except
ValueError and raise a clear, user-facing error (or exit with a descriptive
message) that includes the original exception text and the offending
extra_env_flags value so users see "malformed TILELANG_MXCC_FLAGS: <details>"
instead of a raw traceback.

In `@tilelang/language/gemm_op.py`:
- Around line 159-160: The gemm function added a new public parameter
annotations but the docstring Args: section for gemm does not mention it; update
the gemm docstring to document the annotations parameter (type, purpose,
default/optional behavior) alongside the existing arguments—reference the gemm
function and the annotations parameter in your description so users know what
values (e.g., dict | None) are expected and how annotations affect behavior;
keep the style consistent with the other Args entries in the existing docstring.

In `@tilelang/tileop/gemm/gemm_base.py`:
- Around line 168-170: The annotations property can return None and cause
callers like self.annotations.get(...) to crash; change the property
(annotations) to always return a dict by retrieving getattr(self.gemm_node,
"annotations", None) and normalizing it to an empty dict when falsy (or
converting to dict if needed) so callers can safely call .get() on the result.

---

Nitpick comments:
In `@tilelang/contrib/mxcc.py`:
- Around line 20-44: The docstring for compile_maca is missing documentation for
the TILELANG_MXCC_FLAGS environment variable; update the compile_maca function
docstring to mention that additional MXCC compiler flags can be supplied via the
TILELANG_MXCC_FLAGS environment variable (e.g., a space-separated string or
list-equivalent), describe its effect on the options passed to mxcc, and include
an example or note about precedence/format so users can discover and use this
feature.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e95d2cba-a6d9-4555-bb17-7bf0713dd149

📥 Commits

Reviewing files that changed from the base of the PR and between c23ba70 and e8a7191.

📒 Files selected for processing (26)
  • src/backend/maca/codegen/codegen_maca.cc
  • src/layout/gemm_layouts.cc
  • src/layout/layout.cc
  • src/layout/layout.h
  • src/op/builtin.cc
  • src/op/builtin.h
  • src/target/codegen_maca.cc
  • src/tl_templates/maca/barrier.h
  • src/tl_templates/maca/common.h
  • src/tl_templates/maca/copy.h
  • src/tl_templates/maca/gemm.h
  • src/tl_templates/maca/gemm_wsm.h
  • src/transform/lower_tile_op.cc
  • testing/maca/language/test_tilelang_language_access_ptr_codegen.py
  • testing/maca/tilelibrary/test_tilelang_maca_gemm_template_contract.py
  • tilelang/contrib/mxcc.py
  • tilelang/language/gemm_op.py
  • tilelang/language/tir/ir.py
  • tilelang/language/tir/ir.pyi
  • tilelang/language/tir/op.py
  • tilelang/layout/__init__.py
  • tilelang/layout/swizzle.py
  • tilelang/maca/intrinsics/layout/mma_layout.py
  • tilelang/maca/intrinsics/macro/mma_macro_generator.py
  • tilelang/maca/op/gemm/gemm_mma.py
  • tilelang/tileop/gemm/gemm_base.py
✅ Files skipped from review due to trivial changes (1)
  • src/tl_templates/maca/common.h

Comment thread src/target/codegen_maca.cc
Comment thread src/tl_templates/maca/barrier.h
Comment thread src/tl_templates/maca/gemm_wsm.h
Comment thread src/tl_templates/maca/gemm.h Outdated
Comment thread testing/maca/language/test_tilelang_language_access_ptr_codegen.py
Comment thread tilelang/contrib/mxcc.py Outdated
Comment thread tilelang/language/gemm_op.py Outdated
Comment thread tilelang/tileop/gemm/gemm_base.py Outdated
@VitalyAnkh VitalyAnkh force-pushed the clean-hgemm-164 branch 3 times, most recently from 530db0f to 6f05550 Compare May 21, 2026 08:32

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tilelang/maca/op/gemm/gemm_mma.py (2)

286-298: 💤 Low value

Document the WSM buffer size constant.

The 0x8000 (32 KB) allocation is a magic number. A named constant or brief comment would clarify why this specific size is required for the WSM workspace.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tilelang/maca/op/gemm/gemm_mma.py` around lines 286 - 298, The WSM workspace
allocation uses a magic constant (0x8000) in _gemm_ss_wsm_template which should
be clarified; replace the literal with a named constant (e.g., WSM_SIZE_BYTES or
WSM_WORKSPACE_SIZE = 0x8000) or add an inline comment explaining “32KB required
for WSM workspace per TL gemm backend” so readers understand the size choice and
update the T.alloc_shared call to use that constant (refer to WSM and
_gemm_ss_wsm_template to locate the change).

137-150: 💤 Low value

Environment variable is read in both infer_layout and lower — ensure consistency.

_get_maca_gemm_k_pack() is called at line 137 (in infer_layout) and again at line 210 (in lower). If the environment variable changes between these calls—unlikely in practice, but possible—the layout and lowering could use different k_pack values, causing mismatched buffer shapes or loop bounds. Consider caching k_pack once at construction or passing it explicitly from infer_layout to lower.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tilelang/maca/op/gemm/gemm_mma.py` around lines 137 - 150, infer_layout calls
_get_maca_gemm_k_pack(self.k_pack) and lower calls it again, risking
inconsistency if the env var changes; to fix, read and cache k_pack once (e.g.,
store in self._cached_k_pack at construction or the first call) and replace
direct calls to _get_maca_gemm_k_pack(...) in both infer_layout and lower with
the cached value, or pass the resolved k_pack from infer_layout into lower via
the operator state so both use the identical k_pack; update references in
methods named infer_layout and lower and any helper uses like
_make_maca_gemm_emitter to consume the cached/passed k_pack.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/tl_templates/maca/gemm_wsm.h`:
- Around line 11-143: The template instantiation silently permits unsupported
configurations; add compile-time guards inside gemm_ss_wsm to fail fast:
static_assert that trans_A==false and trans_B==false, num_warp_m==1 and
num_warp_n==1 (the hardwired warp layout), kPack==8 (used as K/8),
AStrideElements%8==0 (used to compute lda_vec), and that the implementation
expects a 4-stage schedule (Stage==4) / 4x4 accumulators (i.e., require any
template flags you depend on to match the hardcoded schedule); reference the
function name gemm_ss_wsm and the symbols ALdgOffset, BLdgOffset, WSM_Ldg,
lda_vec, K/8, and C_f32 when adding these static_asserts so the checks are
colocated with the hardcoded logic.

---

Nitpick comments:
In `@tilelang/maca/op/gemm/gemm_mma.py`:
- Around line 286-298: The WSM workspace allocation uses a magic constant
(0x8000) in _gemm_ss_wsm_template which should be clarified; replace the literal
with a named constant (e.g., WSM_SIZE_BYTES or WSM_WORKSPACE_SIZE = 0x8000) or
add an inline comment explaining “32KB required for WSM workspace per TL gemm
backend” so readers understand the size choice and update the T.alloc_shared
call to use that constant (refer to WSM and _gemm_ss_wsm_template to locate the
change).
- Around line 137-150: infer_layout calls _get_maca_gemm_k_pack(self.k_pack) and
lower calls it again, risking inconsistency if the env var changes; to fix, read
and cache k_pack once (e.g., store in self._cached_k_pack at construction or the
first call) and replace direct calls to _get_maca_gemm_k_pack(...) in both
infer_layout and lower with the cached value, or pass the resolved k_pack from
infer_layout into lower via the operator state so both use the identical k_pack;
update references in methods named infer_layout and lower and any helper uses
like _make_maca_gemm_emitter to consume the cached/passed k_pack.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 81ab369b-85af-4a84-84d9-bb2cbf160cc2

📥 Commits

Reviewing files that changed from the base of the PR and between e8a7191 and 530db0f.

📒 Files selected for processing (29)
  • src/backend/common/op/reduce.h
  • src/backend/cuda/op/reduce.cc
  • src/backend/maca/codegen/codegen_maca.cc
  • src/backend/maca/op/reduce.cc
  • src/backend/rocm/op/reduce.cc
  • src/layout/gemm_layouts.cc
  • src/layout/layout.cc
  • src/layout/layout.h
  • src/op/builtin.cc
  • src/op/builtin.h
  • src/target/codegen_maca.cc
  • src/tl_templates/cuda/reduce.h
  • src/tl_templates/hip/reduce.h
  • src/tl_templates/maca/barrier.h
  • src/tl_templates/maca/copy.h
  • src/tl_templates/maca/gemm.h
  • src/tl_templates/maca/gemm_wsm.h
  • src/tl_templates/maca/reduce.h
  • src/transform/lower_tile_op.cc
  • testing/maca/language/test_tilelang_language_access_ptr_codegen.py
  • testing/maca/tilelibrary/test_tilelang_maca_gemm_template_contract.py
  • tilelang/contrib/mxcc.py
  • tilelang/language/gemm_op.py
  • tilelang/layout/__init__.py
  • tilelang/layout/swizzle.py
  • tilelang/maca/intrinsics/layout/mma_layout.py
  • tilelang/maca/intrinsics/macro/mma_macro_generator.py
  • tilelang/maca/op/gemm/gemm_mma.py
  • tilelang/tileop/gemm/gemm_base.py
✅ Files skipped from review due to trivial changes (2)
  • src/layout/layout.h
  • tilelang/maca/intrinsics/macro/mma_macro_generator.py

Comment thread src/tl_templates/maca/gemm_wsm.h
@VitalyAnkh

VitalyAnkh commented May 21, 2026

Copy link
Copy Markdown
Collaborator Author

Update: I traced the remaining MACA-3.6 failure to a stale test expectation in testing/python/language/test_tilelang_language_reduce.py. The MACA/ROCm/CUDA batched reduce lowerers intentionally emit run_batch_offset(...) via MakeBatchAllReduceOffset(...), so the test now checks run_batch_offset(...) instead of run_batch(...) for the batch>1 cases.

@Five-HZ

Five-HZ commented May 26, 2026

Copy link
Copy Markdown
Collaborator

@regression-perf

@github-actions

Copy link
Copy Markdown

Performance Regression Test Report

Triggered by: @Five-HZ
Workflow run: https://github.com/tile-ai/tilelang-metax/actions/runs/26441256554

Results

File Original Latency Current Latency Speedup
example_mla_decode 5.20567 5.21702 0.997825
example_tilelang_nsa_decode 0.00920458 0.00922229 0.99808
example_group_per_split_token_cast_to_fp8 0.0220335 0.0220747 0.998133
example_fusedmoe_tilelang 0.42456 0.424945 0.999096
example_tilelang_sparse_gqa_decode_varlen_mask 0.060374 0.0604147 0.999326
example_convolution_autotune 10.0543 10.0609 0.99935
example_mhc_post 0.163303 0.16339 0.999466
example_gemv 0.595079 0.595356 0.999536
example_tilelang_nsa_fwd 0.0106652 0.0106692 0.999623
example_tilelang_gemm_splitk 4.50126 4.50252 0.999721
example_tilelang_gemm_splitk_vectorize_atomicadd 4.50349 4.50436 0.999808
example_convolution 3.06745 3.0679 0.999853
example_mhc_pre 0.85 0.85008 0.999906
example_topk 0.0343541 0.0343543 0.999994
example_tilelang_block_sparse_attn 0.0245258 0.0245216 1.00017
example_tilelang_sparse_gqa_decode_varlen_indice 0.0375994 0.0375901 1.00025
example_per_token_cast_to_fp8 0.015333 0.0153054 1.0018
example_blocksparse_gemm 0.100857 0.100663 1.00192
block_sparse_attn_tilelang 0.0240627 0.024015 1.00199
example_elementwise_add 0.154497 0.15312 1.00899

Artifacts

  • regression_result.png (speedup plot) is attached as a workflow artifact. Download it from the workflow run page above.

@ventijing

Copy link
Copy Markdown
Collaborator

@regression-perf

@VitalyAnkh VitalyAnkh force-pushed the clean-hgemm-164 branch 2 times, most recently from 7b0c0a7 to da1471d Compare June 4, 2026 21:00
Expose MACA GEMM A/B and C fragment layouts to Python and wire the dense TileLang GEMM template path through MACA-specific layouts.

Teach MACA codegen to emit tl_gemm calls and derive tl.ptx_cp_async byte widths from access-pointer element types, then add the template header pieces needed by the compiler path.

Validation: git diff --cached --check; ./.venv/bin/python -m py_compile tilelang/layout/__init__.py tilelang/layout/swizzle.py tilelang/tileop/gemm/gemm_maca_mma.py testing/maca/tilelibrary/test_tilelang_maca_gemm_template_contract.py testing/maca/language/test_tilelang_language_access_ptr_codegen.py
@ventijing

Copy link
Copy Markdown
Collaborator

@regression-perf

@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown

Performance Regression Test Report

Triggered by: @ventijing
Workflow run: https://github.com/tile-ai/tilelang-metax/actions/runs/27008443327

Results

File Original Latency Current Latency Speedup
example_mha_sink_bwd_bhsd_sliding_window 0.897085 1.01439 0.884363
example_gqa_sink_bwd_bhsd 0.202594 0.228141 0.888023
example_elementwise_add 0.152906 0.153654 0.995134
example_blocksparse_gemm 0.0975296 0.0979488 0.99572
example_tilelang_nsa_fwd 0.0111904 0.0112258 0.996849
example_gemm_autotune 0.0567091 0.0568491 0.997538
example_mha_bwd_bshd 0.499893 0.500889 0.998011
example_tilelang_nsa_decode 0.00943447 0.00944995 0.998362
example_mha_bwd_bhsd 0.479775 0.480454 0.998587
example_group_per_split_token_cast_to_fp8 0.0226073 0.0226213 0.999379
example_topk 0.0343284 0.0343468 0.999465
example_fusedmoe_tilelang 0.42742 0.427592 0.999598
example_mhc_pre 0.84892 0.849248 0.999614
example_tilelang_gemm_splitk 4.49813 4.49983 0.999621
example_mha_fwd_bshd 0.23063 0.230682 0.999772
example_gqa_decode 0.238383 0.238435 0.999781
example_mha_sink_fwd_bhsd 0.109592 0.109608 0.999858
example_per_token_cast_to_fp8 0.0155742 0.0155762 0.999867
example_mla_decode 1.60495 1.60511 0.999903
example_mha_sink_fwd_bhsd_sliding_window 0.0937527 0.0937614 0.999907
example_gemm 0.121945 0.121955 0.999917
example_mha_inference 0.344391 0.344418 0.999923
example_tilelang_gemm_splitk_vectorize_atomicadd 4.50054 4.50057 0.999995
example_gemm_intrinsics 0.0889883 0.0889808 1.00008
example_mha_fwd_varlen 0.529413 0.529368 1.00009
example_convolution_autotune 9.94512 9.94421 1.00009
example_gqa_bwd 0.834074 0.833936 1.00017
example_convolution 3.10785 3.10725 1.00019
example_tilelang_sparse_gqa_decode_varlen_indice 0.0290714 0.0290653 1.00021
block_sparse_attn_tilelang 0.0240057 0.0239987 1.00029
example_gemv 0.589107 0.5889 1.00035
example_gqa_fwd_bshd 1.40891 1.40786 1.00075
example_linear_attn_fwd 0.191473 0.191287 1.00097
example_tilelang_block_sparse_attn 0.0245623 0.0245371 1.00102
example_mhc_post 0.163138 0.162926 1.0013
example_tilelang_sparse_gqa_decode_varlen_mask 0.059703 0.0596097 1.00157
example_mha_fwd_bhsd 0.0494599 0.0493702 1.00182
example_linear_attn_bwd 0.828576 0.820832 1.00944
example_gqa_sink_bwd_bhsd_sliding_window 0.0955768 0.0943757 1.01273
example_gqa_bwd_tma_reduce_varlen 1.44662 1.41724 1.02073
example_mha_sink_bwd_bhsd 1.62801 1.55977 1.04375

Artifacts

  • regression_result.png (speedup plot) is attached as a workflow artifact. Download it from the workflow run page above.

@VitalyAnkh

Copy link
Copy Markdown
Collaborator Author

Updated the branch with a targeted fix for the attention-sink backward regression.

Root cause: standalone tl::access_ptr / address_of layout remapping was also catching global-memory AtomicAdd destinations. For the affected backward kernels, that re-expanded an already valid logical global index into extra carry arithmetic in the generated dQ address. Global-memory access pointers now stay on the normal recursive BufferLoad path, while the special remap path remains in place for shared/non-global pointers.

Validation:

  • example_mha_sink_bwd_bhsd_sliding_window: 0.910 ms, back to the pre-regression range.
  • example_gqa_sink_bwd_bhsd: 0.234 ms, back to the pre-regression range.
  • Adjacent attention-sink backward cases also compiled and ran successfully.
  • Added MACA codegen coverage that checks the generated global AtomicAdd address form directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants