[MetaxGPU][testing] fix the maca async copy oob defect#111
Conversation
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
📝 WalkthroughWalkthroughThis PR extends TileLang's MACA async copy support by relaxing barrier argument requirements in codegen, adding flexible barrier-type annotation tracking during buffer flattening, adapting memory access legalization to MACA-specific argument patterns, implementing a vectorization optimization that fuses loops into single memcpy_async operations, and validating all changes with comprehensive MACA-targeted tests. ChangesMACA Async Copy Vectorization
🎯 4 (Complex) | ⏱️ ~60 minutes
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
testing/python/language/test_tilelang_language_access_ptr_codegen.py (1)
8-9:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winSystemic decorator-target mismatch across all test functions.
All seven test functions in this file (
test_access_ptr_cp_async_codegen,test_vectorized_cp_async_num_elems_codegen,test_vectorized_int4_cp_async_num_elems_codegen,test_async_copy_tileop_lowers_to_cp_async,test_async_copy_tileop_rejects_invalid_cp_async_scope,test_parallel_simt_copy_respects_enable_async_copy_config, andtest_async_copy_oob_lowers_to_predicated_cp_async_without_wait) are decorated with@tilelang.testing.requires_cudabut compile fortarget="maca". This decorator-target mismatch will cause all tests to skip on MACA-only systems and potentially fail on CUDA-only systems where MACA is unavailable. The shared root cause is that the test requirements were not updated when retargeting from CUDA to MACA. Replace@tilelang.testing.requires_cudawith the appropriate MACA availability decorator (e.g.,@tilelang.testing.requires_maca) across all affected test functions.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@testing/python/language/test_tilelang_language_access_ptr_codegen.py` around lines 8 - 9, Replace the incorrect CUDA availability decorator with the MACA one across the tests: for each test function named test_access_ptr_cp_async_codegen, test_vectorized_cp_async_num_elems_codegen, test_vectorized_int4_cp_async_num_elems_codegen, test_async_copy_tileop_lowers_to_cp_async, test_async_copy_tileop_rejects_invalid_cp_async_scope, test_parallel_simt_copy_respects_enable_async_copy_config, and test_async_copy_oob_lowers_to_predicated_cp_async_without_wait, change the decorator from `@tilelang.testing.requires_cuda` to the MACA availability decorator (e.g., `@tilelang.testing.requires_maca`) so the tests target="maca" run only when MACA is available.src/transform/legalize_safe_memory_access.cc (1)
558-603:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftTwo issues in RewriteCPAsync for MACA.
Weak argument-count check (line 564): Same issue as
GetCPAsyncSourceBuffer—MACA requires at least 3 arguments (dst, src, bytes), not 2.Lost
mbarargument (lines 589-598): Whensafe_value == 0, the code constructs a predicated call from[args[0], args[1], args[2]]and appends the combined predicate. For MACA, this produces[dst, src, bytes, predicate], but the originalmbaratargs[3]is dropped. The correct predicated MACA layout per Context snippet 2 is[dst, src, bytes, mbar, predicate].MACA may not support hardware-predicated copies the same way CUDA does. Consider skipping the predicated path entirely for MACA and always using the
IfThenElsewrapper (lines 600-602).♻️ Proposed fixes
Fix 1: Restore minimum argument count
- ICHECK_GE(call->args.size(), 2U) - << "cp.async expects at least 2 arguments, but got " << call->args; + ICHECK_GE(call->args.size(), 3U) + << "cp.async expects at least 3 arguments (dst, src, size), but got " << call->args;Fix 2: Disable predicated path for MACA
// Predicated cp.async zero-fills on the false path... - if (analyzer_->CanProveEqual(safe_value, make_zero(dst_dtype))) { + if (call->op != tl::maca_memcpy_async() && + analyzer_->CanProveEqual(safe_value, make_zero(dst_dtype))) { PrimExpr predicate = existing_predicate.defined()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/transform/legalize_safe_memory_access.cc` around lines 558 - 603, In RewriteCPAsync: (1) tighten the argument-count check to require at least 3 args (ICHECK_GE(..., 3U)) to match GetCPAsyncSourceBuffer's expectation (dst, src, bytes). (2) Avoid dropping the mbar argument for MACA by skipping the short-circuit predicated-call path when the original call carries an mbar (i.e., args.size() > 3 / mbar present) or when the op is MACA; in that case always return the IfThenElse wrapper (keep BufferStore fallback). Ensure any code that builds the predicated Call (the Array new_args{call->args[0], call->args[1], call->args[2]} + predicate) is only executed for non-MACA calls so args[3] (mbar) is not lost.
🧹 Nitpick comments (9)
testing/python/language/test_tilelang_language_access_ptr_codegen.py (9)
169-170: 💤 Low valueUpdate function name and docstring to reference MACA.
The function name mentions "lowers_to_predicated_cp_async_without_wait" and the docstring mentions "cp.async", but the test validates MACA conditional copy behavior.
📝 Suggested rename
-def test_async_copy_oob_lowers_to_predicated_cp_async_without_wait(): - """Check T.async_copy supports OOB via predicated cp.async and does not auto-wait.""" +def test_maca_async_copy_oob_lowers_to_conditional_without_wait(): + """Check T.maca_async_copy supports OOB via conditional copy and does not auto-wait."""🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@testing/python/language/test_tilelang_language_access_ptr_codegen.py` around lines 169 - 170, Rename the test function test_async_copy_oob_lowers_to_predicated_cp_async_without_wait to reflect MACA behavior (e.g., test_async_copy_oob_lowers_to_predicated_maca_without_wait) and update its docstring to mention MACA conditional copy rather than "cp.async" so the name and description accurately state that the test validates MACA conditional copy behavior; adjust any internal references to the old function name if used elsewhere.
142-142: 💤 Low valueUpdate docstring to reference MACA async copy.
The docstring mentions "auto cp.async rewriting" but the test validates MACA async copy behavior.
📝 Suggested update
- """Check `tl.enable_async_copy=False` disables auto cp.async rewriting.""" + """Check `tl.enable_async_copy=False` disables auto MACA async copy rewriting."""🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@testing/python/language/test_tilelang_language_access_ptr_codegen.py` at line 142, The docstring incorrectly says "auto cp.async rewriting" but the test in testing/python/language/test_tilelang_language_access_ptr_codegen.py is asserting MACA async copy behavior controlled by tl.enable_async_copy; update the triple-quoted docstring to reference "MACA async copy" (or mention tl.enable_async_copy disabling MACA async copy) so it accurately describes what the test validates (e.g., "Check tl.enable_async_copy=False disables MACA async copy").
86-87: 💤 Low valueUpdate function name and docstring to reflect MACA target.
The function name mentions "lowers_to_cp_async" and the docstring mentions "CPAsync path", but the test validates MACA
memcpy_asynclowering.📝 Suggested rename
-def test_async_copy_tileop_lowers_to_cp_async(): - """Check T.async_copy always uses CPAsync path and does not auto-wait.""" +def test_maca_async_copy_tileop_lowers_to_memcpy_async(): + """Check T.maca_async_copy lowers to memcpy_async and does not auto-wait."""🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@testing/python/language/test_tilelang_language_access_ptr_codegen.py` around lines 86 - 87, Rename the test function test_async_copy_tileop_lowers_to_cp_async to reflect the MACA target (e.g., test_async_copy_tileop_lowers_to_maca_memcpy_async) and update its docstring to say it validates MACA memcpy_async lowering and non-auto-wait behavior instead of "CPAsync path"; modify the function definition name and the triple-quoted string at the top of that function to reference "MACA memcpy_async" and the correct expected behavior.
63-63: 💤 Low valueUpdate docstring to reference MACA async copy.
The docstring mentions "tl.ptx_cp_async" but the test validates MACA behavior.
📝 Suggested update
- """Check subbyte tl.ptx_cp_async derives PTX bytes from logical element counts.""" + """Check subbyte MACA async copy derives bytes from logical element counts."""🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@testing/python/language/test_tilelang_language_access_ptr_codegen.py` at line 63, Update the test docstring that currently references "tl.ptx_cp_async" to accurately describe the behavior being tested (MACA async copy); locate the docstring in testing/python/language/test_tilelang_language_access_ptr_codegen.py (the triple-quoted string beginning "Check subbyte tl.ptx_cp_async derives PTX bytes from logical element counts.") and replace it with a concise description mentioning MACA async copy (e.g., "Check subbyte MACA async copy derives PTX bytes from logical element counts.") so the docstring matches the test intent.
9-10: 💤 Low valueUpdate function name and docstring to reflect MACA target.
The function name
test_access_ptr_cp_async_codegenand docstring still reference CUDAcp.async, but the test now validates MACAmemcpy_asynccodegen.📝 Suggested rename
-def test_access_ptr_cp_async_codegen(): - """Smoke-test codegen for T.access_ptr -> tl.access_ptr -> tvm_access_ptr -> cp.async.""" +def test_access_ptr_maca_async_codegen(): + """Smoke-test codegen for T.access_ptr with MACA memcpy_async."""🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@testing/python/language/test_tilelang_language_access_ptr_codegen.py` around lines 9 - 10, Rename the test function test_access_ptr_cp_async_codegen and update its docstring to reflect the MACA target: change the function name to something like test_access_ptr_maca_memcpy_async_codegen and update the docstring string to describe that the smoke-test validates T.access_ptr -> tl.access_ptr -> tvm_access_ptr -> MACA memcpy_async codegen (replace any "cp.async" mentions with "memcpy_async" or "MACA memcpy_async"). Ensure the new name and docstring are used consistently in the file so the test reflects the MACA target.
110-111: 💤 Low valueUpdate function name and docstring to reference MACA.
The function name mentions "invalid_cp_async_scope" and the docstring mentions "T.async_copy", but the test validates MACA async copy scope restrictions.
📝 Suggested rename
-def test_async_copy_tileop_rejects_invalid_cp_async_scope(): - """Check T.async_copy rejects non global->shared patterns.""" +def test_maca_async_copy_tileop_rejects_invalid_scope(): + """Check T.maca_async_copy rejects non global->shared patterns."""🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@testing/python/language/test_tilelang_language_access_ptr_codegen.py` around lines 110 - 111, Rename the test function test_async_copy_tileop_rejects_invalid_cp_async_scope to a name that references MACA (e.g., test_maca_async_copy_rejects_invalid_scope) and update its docstring to describe that the test validates MACA async copy scope restrictions (replace "T.async_copy" and "non global->shared patterns" wording with a concise description of MACA async copy scope rejection). Ensure any internal references in the test body or asserts that mention the old name or T.async_copy are updated to match the MACA terminology.
57-57: 💤 Low valueUpdate assertion message to reference MACA.
The message mentions "cp.async bytes" but should reference MACA async copy.
📝 Suggested update
- assert "memcpy_async<8>" in src, "Expected vectorized cp.async bytes to fold into memcpy_async<8>" + assert "memcpy_async<8>" in src, "Expected vectorized MACA async copy to fold into memcpy_async<8>"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@testing/python/language/test_tilelang_language_access_ptr_codegen.py` at line 57, Update the assertion message in the test assertion that checks for "memcpy_async<8>" so it references MACA instead of "cp.async bytes"; specifically modify the failure message in the assert statement in test_tilelang_language_access_ptr_codegen.py (the line containing assert "memcpy_async<8>" in src) to say something like "Expected vectorized MACA async copy to fold into memcpy_async<8>" so the message correctly mentions MACA.
38-38: 💤 Low valueUpdate docstring to reference MACA async copy.
The docstring mentions "vectorized tl.ptx_cp_async" but the test validates MACA
memcpy_asyncbehavior.📝 Suggested update
- """check vectorized tl.ptx_cp_async widens logical element counts.""" + """Check vectorized MACA async copy widens logical element counts."""🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@testing/python/language/test_tilelang_language_access_ptr_codegen.py` at line 38, Update the module/test docstring to reference MACA's memcpy_async rather than "vectorized tl.ptx_cp_async": replace the current docstring "check vectorized tl.ptx_cp_async widens logical element counts." with a concise description mentioning MACA memcpy_async (e.g., "check MACA memcpy_async widens logical element counts") so the docstring accurately reflects the behavior validated by the test.
122-122: 💤 Low valueUpdate comment to reference MACA async copy.
The comment mentions "cp.async" but should reference MACA async copy.
📝 Suggested update
- # shared->shared cannot use cp.async and should fail for async_copy. + # shared->shared cannot use MACA async copy and should fail.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@testing/python/language/test_tilelang_language_access_ptr_codegen.py` at line 122, Update the inline test comment that currently reads "shared->shared cannot use cp.async and should fail for async_copy." to reference MACA async copy instead of "cp.async" (e.g., "shared->shared cannot use MACA async copy and should fail for async_copy."). Locate the comment in the test_tilelang_language_access_ptr_codegen.py test where the shared->shared async_copy expectation is documented and replace the text "cp.async" with "MACA async copy" so the comment accurately reflects the mechanism under test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/maca/codegen/codegen_maca.cc`:
- Around line 1576-1578: The code declares std::string mbar inside the if
(tl::maca_memcpy_async()) block but uses mbar later, causing a scope/contract
bug; update codegen_maca.cc by (1) moving the declaration of mbar (and any
dependent variables) before the tl::maca_memcpy_async() conditional so it is
always in scope, (2) tighten the argument check ICHECK_GE(op->args.size(), 4) if
the transform always expects a 4th barrier arg, or alternatively implement
explicit handling for 3-arg calls by reading the barrier handle from annotations
or a well-defined fallback provider (e.g., a guaranteed "bar" identifier created
earlier); ensure maca_memcpy_async code path (and uses of op->args[3] / mbar)
consistently reflect the chosen contract so compilation and runtime behavior are
correct.
In `@src/transform/flatten_buffer.cc`:
- Around line 306-326: The fallback handling for maca_memcpy_async only extracts
a barrier Var when args[3] is a BufferLoadNode, so if args[3] is a Var-typed
PrimExpr barrier_var stays undefined; update the args[3] branch to also check
for a VarNode (e.g. call->args[3].as<VarNode>()) and set barrier_var =
Downcast<Var>(call->args[3]) when present so that the subsequent
maca_barrier_type_map_.Set(...) sees the Var; keep the existing BufferLoad
extraction (call->args[3].as<BufferLoad>()->buffer->data) and ensure both paths
populate barrier_var before the barrier_type lookup.
In `@src/transform/legalize_safe_memory_access.cc`:
- Around line 531-537: Restore the CPAsync argument-count guard to require at
least 3 arguments (dst, src, bytes) by changing the ICHECK to require >= 3, and
ensure the function extracts the source buffer using the source-argument index
helper (use GetCPAsyncSrcPtrArgIndex(call) to fetch the src pointer and call
GetRequiredAccessPtrInfo on that argument) so GetCPAsyncSourceBuffer returns
src_info.base_load->buffer rather than using the wrong arg index.
- Around line 420-432: The two helper functions GetCPAsyncSrcPtrArgIndex and
GetCPAsyncDstPtrArgIndex have their return values inverted for calls to
tl::maca_memcpy_async(); swap the indices for that op so
GetCPAsyncSrcPtrArgIndex returns 1 for tl::maca_memcpy_async() and
GetCPAsyncDstPtrArgIndex returns 0 for tl::maca_memcpy_async(), ensuring callers
like CollectCPAsyncConditions, GetCPAsyncSourceBuffer and RewriteCPAsync operate
on the correct src/dst buffers.
In `@src/transform/lower_tile_op.cc`:
- Around line 414-417: The matcher in lower_tile_op.cc currently rejects
tl.tileop.maca_async_copy calls that lack a "barrier" annotation and always
emits a 4-argument tl::maca_memcpy_async; change the logic to allow calls
without the "barrier" annotation (remove the annotations.count("barrier")
requirement around call_node) and when constructing the replacement emit the
3-argument tl::maca_memcpy_async(dst, src, bytes) if the "barrier" annotation is
absent or the 4-argument form tl::maca_memcpy_async(dst, src, bytes, barrier)
only when annotations.count("barrier") is true; apply the same fix in both
matching/replacement sites (around call_node /
Op::Get("tl.tileop.maca_async_copy") and the later block at lines 464-482) so
vectorized MACA copies without barriers are widened to the 3-arg async copy
path.
- Around line 337-350: The contiguity check in FlattenToLinearOffset (and the
fast-path that builds tl::access_ptr / emits maca_memcpy_async) uses the
original Buffer indices but codegen later rewrites loads via buffer_remap_ /
layout_map_, so you must either perform the base-index + contiguity check on the
remapped buffer and the indices after layout->Forward(...) (i.e., call
layout_map_ / buffer_remap_ to obtain the physical flattened buffer and remapped
indices and then run FlattenToLinearOffset on that), or conservatively disable
this optimization whenever the src or dst buffer appears in layout_map_ /
buffer_remap_ (add a guard checking membership in layout_map_ before folding
into maca_memcpy_async); update the checks at the other affected regions (the
blocks referenced around lines 366-395 and 441-468) to use the same
remapped-buffer contiguity logic or the conservative guard.
---
Outside diff comments:
In `@src/transform/legalize_safe_memory_access.cc`:
- Around line 558-603: In RewriteCPAsync: (1) tighten the argument-count check
to require at least 3 args (ICHECK_GE(..., 3U)) to match
GetCPAsyncSourceBuffer's expectation (dst, src, bytes). (2) Avoid dropping the
mbar argument for MACA by skipping the short-circuit predicated-call path when
the original call carries an mbar (i.e., args.size() > 3 / mbar present) or when
the op is MACA; in that case always return the IfThenElse wrapper (keep
BufferStore fallback). Ensure any code that builds the predicated Call (the
Array new_args{call->args[0], call->args[1], call->args[2]} + predicate) is only
executed for non-MACA calls so args[3] (mbar) is not lost.
In `@testing/python/language/test_tilelang_language_access_ptr_codegen.py`:
- Around line 8-9: Replace the incorrect CUDA availability decorator with the
MACA one across the tests: for each test function named
test_access_ptr_cp_async_codegen, test_vectorized_cp_async_num_elems_codegen,
test_vectorized_int4_cp_async_num_elems_codegen,
test_async_copy_tileop_lowers_to_cp_async,
test_async_copy_tileop_rejects_invalid_cp_async_scope,
test_parallel_simt_copy_respects_enable_async_copy_config, and
test_async_copy_oob_lowers_to_predicated_cp_async_without_wait, change the
decorator from `@tilelang.testing.requires_cuda` to the MACA availability
decorator (e.g., `@tilelang.testing.requires_maca`) so the tests target="maca" run
only when MACA is available.
---
Nitpick comments:
In `@testing/python/language/test_tilelang_language_access_ptr_codegen.py`:
- Around line 169-170: Rename the test function
test_async_copy_oob_lowers_to_predicated_cp_async_without_wait to reflect MACA
behavior (e.g., test_async_copy_oob_lowers_to_predicated_maca_without_wait) and
update its docstring to mention MACA conditional copy rather than "cp.async" so
the name and description accurately state that the test validates MACA
conditional copy behavior; adjust any internal references to the old function
name if used elsewhere.
- Line 142: The docstring incorrectly says "auto cp.async rewriting" but the
test in testing/python/language/test_tilelang_language_access_ptr_codegen.py is
asserting MACA async copy behavior controlled by tl.enable_async_copy; update
the triple-quoted docstring to reference "MACA async copy" (or mention
tl.enable_async_copy disabling MACA async copy) so it accurately describes what
the test validates (e.g., "Check tl.enable_async_copy=False disables MACA async
copy").
- Around line 86-87: Rename the test function
test_async_copy_tileop_lowers_to_cp_async to reflect the MACA target (e.g.,
test_async_copy_tileop_lowers_to_maca_memcpy_async) and update its docstring to
say it validates MACA memcpy_async lowering and non-auto-wait behavior instead
of "CPAsync path"; modify the function definition name and the triple-quoted
string at the top of that function to reference "MACA memcpy_async" and the
correct expected behavior.
- Line 63: Update the test docstring that currently references "tl.ptx_cp_async"
to accurately describe the behavior being tested (MACA async copy); locate the
docstring in
testing/python/language/test_tilelang_language_access_ptr_codegen.py (the
triple-quoted string beginning "Check subbyte tl.ptx_cp_async derives PTX bytes
from logical element counts.") and replace it with a concise description
mentioning MACA async copy (e.g., "Check subbyte MACA async copy derives PTX
bytes from logical element counts.") so the docstring matches the test intent.
- Around line 9-10: Rename the test function test_access_ptr_cp_async_codegen
and update its docstring to reflect the MACA target: change the function name to
something like test_access_ptr_maca_memcpy_async_codegen and update the
docstring string to describe that the smoke-test validates T.access_ptr ->
tl.access_ptr -> tvm_access_ptr -> MACA memcpy_async codegen (replace any
"cp.async" mentions with "memcpy_async" or "MACA memcpy_async"). Ensure the new
name and docstring are used consistently in the file so the test reflects the
MACA target.
- Around line 110-111: Rename the test function
test_async_copy_tileop_rejects_invalid_cp_async_scope to a name that references
MACA (e.g., test_maca_async_copy_rejects_invalid_scope) and update its docstring
to describe that the test validates MACA async copy scope restrictions (replace
"T.async_copy" and "non global->shared patterns" wording with a concise
description of MACA async copy scope rejection). Ensure any internal references
in the test body or asserts that mention the old name or T.async_copy are
updated to match the MACA terminology.
- Line 57: Update the assertion message in the test assertion that checks for
"memcpy_async<8>" so it references MACA instead of "cp.async bytes";
specifically modify the failure message in the assert statement in
test_tilelang_language_access_ptr_codegen.py (the line containing assert
"memcpy_async<8>" in src) to say something like "Expected vectorized MACA async
copy to fold into memcpy_async<8>" so the message correctly mentions MACA.
- Line 38: Update the module/test docstring to reference MACA's memcpy_async
rather than "vectorized tl.ptx_cp_async": replace the current docstring "check
vectorized tl.ptx_cp_async widens logical element counts." with a concise
description mentioning MACA memcpy_async (e.g., "check MACA memcpy_async widens
logical element counts") so the docstring accurately reflects the behavior
validated by the test.
- Line 122: Update the inline test comment that currently reads "shared->shared
cannot use cp.async and should fail for async_copy." to reference MACA async
copy instead of "cp.async" (e.g., "shared->shared cannot use MACA async copy and
should fail for async_copy."). Locate the comment in the
test_tilelang_language_access_ptr_codegen.py test where the shared->shared
async_copy expectation is documented and replace the text "cp.async" with "MACA
async copy" so the comment accurately reflects the mechanism under test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 96cb46ee-c275-454b-be68-8c064a9199a6
📒 Files selected for processing (5)
src/maca/codegen/codegen_maca.ccsrc/transform/flatten_buffer.ccsrc/transform/legalize_safe_memory_access.ccsrc/transform/lower_tile_op.cctesting/python/language/test_tilelang_language_access_ptr_codegen.py
| ICHECK_GE(op->args.size(), 3) | ||
| << "maca_memcpy_async expects at least 3 arguments (dst_access_ptr, " | ||
| "src_access_ptr, bytes)"; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Find all constructions of tl::maca_memcpy_async calls and verify argument counts
ast-grep --pattern $'Call($$$, tl::maca_memcpy_async(), $ARGS, $$$)'Repository: tile-ai/tilelang-metax
Length of output: 474
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) Inspect the codegen area around the reported lines
echo "---- codegen_maca.cc (context) ----"
sed -n '1540,1625p' src/maca/codegen/codegen_maca.cc | nl -ba | sed -n '1,110p'
# 2) Find all tl::maca_memcpy_async call sites (text search)
echo "---- rg tl::maca_memcpy_async ----"
rg -n "tl::maca_memcpy_async\\(" -S src | head -n 200
# 3) For each file that mentions it, print a small surrounding context
echo "---- context around each occurrence ----"
rg -n "tl::maca_memcpy_async\\(" -S src | cut -d: -f1 | sort -u | while read -r f; do
echo "## $f"
# print first few hits with context
rg -n "tl::maca_memcpy_async\\(" -S "$f" -n | head -n 20 | while read -r line; do
ln=$(echo "$line" | cut -d: -f2)
start=$((ln-8))
end=$((ln+12))
sed -n "${start},${end}p" "$f" | nl -ba | sed -n "${start},${end}p" | head -n 30
echo "----"
done
done
# 4) Re-run ast-grep more systematically to extract all arg lists
# (This only reports files/lines where the pattern matches, but we rely on printed context above for full args.)
echo "---- ast-grep pattern occurrences ----"
ast-grep --pattern $'Call($$$, tl::maca_memcpy_async(), $ARGS, $$$)' 2>/dev/null || trueRepository: tile-ai/tilelang-metax
Length of output: 163
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) Print context around the reported lines (with line numbers via awk)
echo "---- codegen_maca.cc context 1540-1625 ----"
awk 'NR>=1540 && NR<=1625 {printf "%6d:%s\n", NR, $0}' src/maca/codegen/codegen_maca.cc
# 2) Locate call sites
echo "---- occurrences: tl::maca_memcpy_async( ----"
rg -n "tl::maca_memcpy_async\\(" -S src || true
# 3) For each file that contains it, print a bit more context around occurrences
echo "---- context around each call site (small window) ----"
rg -n "tl::maca_memcpy_async\\(" -S src | cut -d: -f1 | sort -u | while read -r f; do
echo "## $f"
rg -n "tl::maca_memcpy_async\\(" -S "$f" -n | while read -r line; do
ln=$(echo "$line" | cut -d: -f2)
# Print +/- 10 lines around each match
start=$((ln-10))
end=$((ln+20))
awk -v s="$start" -v e="$end" 'NR>=s && NR<=e {printf "%6d:%s\n", NR, $0}' "$f" | sed -n "1,60p"
echo "----"
done
done
# 4) If ast-grep exists, run the original pattern again (just to see if it matches current AST)
command -v ast-grep >/dev/null 2>&1 && echo "ast-grep found" || echo "ast-grep not found"
ast-grep --pattern $'Call($$$, tl::maca_memcpy_async(), $ARGS, $$$)' 2>/dev/null || trueRepository: tile-ai/tilelang-metax
Length of output: 5717
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "---- Inspect exact braces/scope in codegen_maca.cc around 1570-1605 ----"
sed -n '1568,1605p' src/maca/codegen/codegen_maca.cc | nl -ba | head -n 200
echo "---- Find all 'std::string mbar' and 'mbar =' in codegen_maca.cc ----"
rg -n "std::string mbar|mbar\\s*=\\s*\"bar\"|mbar\\s*=|mbar\\b" src/maca/codegen/codegen_maca.cc
echo "---- Inspect tl::maca_memcpy_async constructions (lower_tile_op.cc) ----"
sed -n '430,520p' src/transform/lower_tile_op.cc
echo "---- Inspect tl::maca_memcpy_async constructions (lower_maca_memcpy_async.cc) ----"
sed -n '300,420p' src/transform/lower_maca_memcpy_async.cc
echo "---- Inspect any other constructions in src/transform where code constructs tl::maca_memcpy_async ----"
rg -n "Call\\(.*maca_memcpy_async" -S src/transform
echo "---- ast-grep matches (broader) for maca_memcpy_async call expressions ----"
command -v ast-grep >/dev/null 2>&1 && echo "ast-grep found" || echo "ast-grep not found"
ast-grep --pattern $'Call($$$, tl::maca_memcpy_async(), $$$)' 2>/dev/null || true
ast-grep --pattern $'Call($$$, tvm::tl::maca_memcpy_async(), $$$)' 2>/dev/null || trueRepository: tile-ai/tilelang-metax
Length of output: 200
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "---- codegen_maca.cc: locate mbarrier_name_ definition/initialization ----"
rg -n "mbarrier_name_" src/maca/codegen/codegen_maca.cc
echo "---- codegen_maca.cc: create_barriers emission around 1604-1614 ----"
sed -n '1590,1630p' src/maca/codegen/codegen_maca.cc
echo "---- codegen_maca.cc: exact snippet 1568-1598 (braces + mbar usage) ----"
sed -n '1568,1598p' src/maca/codegen/codegen_maca.cc
echo "---- codegen_maca.cc: all occurrences of string literal \"bar\" ----"
rg -n "\"bar\"" src/maca/codegen/codegen_maca.cc || true
echo "---- codegen_maca.cc: all occurrences of mbar assignment ----"
rg -n "mbar\\s*=" src/maca/codegen/codegen_maca.cc || true
echo "---- transform: where tl::maca_memcpy_async call expressions are constructed (Call(... tl::maca_memcpy_async() ...)) ----"
rg -n "Call\\(.*maca_memcpy_async\\(" -S src/transform || true
echo "---- transform: inspect lower_tile_op.cc around existing maca_memcpy_async construction ----"
sed -n '440,520p' src/transform/lower_tile_op.cc
echo "---- transform: inspect lower_maca_memcpy_async.cc around existing maca_memcpy_async construction ----"
sed -n '320,400p' src/transform/lower_maca_memcpy_async.ccRepository: tile-ai/tilelang-metax
Length of output: 9892
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "---- codegen_maca.cc: print 1550-1605 with line numbers (awk) ----"
awk 'NR>=1550 && NR<=1605 {printf "%6d:%s\n", NR, $0}' src/maca/codegen/codegen_maca.cc
echo "---- transform: exact construction of maca_memcpy_async calls (rg around matches) ----"
# show Evaluate(Call(... tl::maca_memcpy_async() ...)) blocks (limited context)
rg -n "maca_memcpy_async\\(\\)" src/transform -S
echo "---- context for each file containing maca_memcpy_async() in transform ----"
for f in $(rg -l "maca_memcpy_async\\(\\)" src/transform -S); do
echo "## $f"
rg -n -C 5 "maca_memcpy_async\\(\\)" "$f" -S
done
echo "---- transform: also find tvm::tl::maca_memcpy_async() (in case of namespace variants) ----"
rg -n "tvm::tl::maca_memcpy_async\\(\\)" src/transform -S || true
for f in $(rg -l "tvm::tl::maca_memcpy_async\\(\\)" src/transform -S || true); do
echo "## $f"
rg -n -C 5 "tvm::tl::maca_memcpy_async\\(\\)" "$f" -S
done
echo "---- codegen_maca.cc: find any other usage/decl of variable named mbar ----"
rg -n "\\bmbar\\b" src/maca/codegen/codegen_maca.cc -SRepository: tile-ai/tilelang-metax
Length of output: 10452
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "---- codegen_maca.cc: raw lines 1568-1605 (no nl) ----"
sed -n '1568,1605p' src/maca/codegen/codegen_maca.cc
echo "---- flatten_buffer.cc: maca_memcpy_async handling around barrier var ----"
sed -n '260,360p' src/transform/flatten_buffer.cc
echo "---- vectorize_loop.cc: MutateMACAMemcpyAsyncExpr_ around args.size==4/5 ----"
sed -n '740,840p' src/transform/vectorize_loop.cc
echo "---- legalize_safe_memory_access.cc: how it treats tl::maca_memcpy_async barrier/predicate ----"
sed -n '520,590p' src/transform/legalize_safe_memory_access.ccRepository: tile-ai/tilelang-metax
Length of output: 11573
Fix maca_memcpy_async codegen scope/contract mismatch
- In
src/maca/codegen/codegen_maca.cc(around 1576–1596),std::string mbaris declared inside theif (tl::maca_memcpy_async())block but is used after the block, which will not compile. - Existing transform constructions pass a 4th barrier argument (args size 4, or 5 with predicate), so the relaxed
ICHECK_GE(op->args.size(), 3)/ fallbackmbar = "bar"path is either untested or incorrect; if 3-arg calls are intended, codegen should read the barrier handle from annotations (or otherwise guarantee the emittedbaridentifier exists).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/maca/codegen/codegen_maca.cc` around lines 1576 - 1578, The code declares
std::string mbar inside the if (tl::maca_memcpy_async()) block but uses mbar
later, causing a scope/contract bug; update codegen_maca.cc by (1) moving the
declaration of mbar (and any dependent variables) before the
tl::maca_memcpy_async() conditional so it is always in scope, (2) tighten the
argument check ICHECK_GE(op->args.size(), 4) if the transform always expects a
4th barrier arg, or alternatively implement explicit handling for 3-arg calls by
reading the barrier handle from annotations or a well-defined fallback provider
(e.g., a guaranteed "bar" identifier created earlier); ensure maca_memcpy_async
code path (and uses of op->args[3] / mbar) consistently reflect the chosen
contract so compilation and runtime behavior are correct.
| ICHECK_GE(call->args.size(), 2) | ||
| << "maca_memcpy_async must have at least src and dst arguments"; | ||
|
|
||
| Var barrier_var; | ||
| if (call->annotations.count("barrier")) { | ||
| auto bar_attr = call->annotations.at("barrier"); | ||
| if (auto bar_load = bar_attr.as<BufferLoadNode>()) { | ||
| barrier_var = bar_load->buffer->data; | ||
| } else if (auto bar_var_node = bar_attr.as<VarNode>()) { | ||
| barrier_var = Downcast<Var>(bar_attr); | ||
| } | ||
| } else if (call->args.size() >= 4 && call->args[3].as<BufferLoadNode>()) { | ||
| barrier_var = call->args[3].as<BufferLoad>().value()->buffer->data; | ||
| } | ||
|
|
||
| if (barrier_var.defined() && call->annotations.count("barrier_type")) { | ||
| auto barrier_type = | ||
| Downcast<StringImm>(call->annotations.at("barrier_type")); | ||
| maca_barrier_type_map_.Set(barrier_var, barrier_type); | ||
| } | ||
| } |
There was a problem hiding this comment.
❓ Verification inconclusive
🏁 Script executed:
#!/bin/bash
# Description: Check how barrier arguments are constructed in maca_memcpy_async calls
ast-grep --pattern $'Call($$$, tl::maca_memcpy_async(), [$$$, $BARRIER, $$$], $$$)'
rg -nC3 'maca_memcpy_async.*args.*barrier' --type=cppRepository: tile-ai/tilelang-metax
Length of output: 48
Handle Var barriers in maca_memcpy_async argument fallback (args[3])
In src/transform/flatten_buffer.cc (306-326), annotations["barrier"] supports both BufferLoadNode and VarNode, but the fallback from call->args[3] only handles BufferLoadNode. If the barrier passed as args[3] is a Var-typed PrimExpr, barrier_var stays undefined and the maca_barrier_type_map_ entry won’t be recorded.
🛡️ Proposed fix
- } else if (call->args.size() >= 4 && call->args[3].as<BufferLoadNode>()) {
- barrier_var = call->args[3].as<BufferLoad>().value()->buffer->data;
+ } else if (call->args.size() >= 4) {
+ if (auto bar_load = call->args[3].as<BufferLoadNode>()) {
+ barrier_var = bar_load->buffer->data;
+ } else if (auto bar_var_node = call->args[3].as<VarNode>()) {
+ barrier_var = GetRef<Var>(bar_var_node);
+ }
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ICHECK_GE(call->args.size(), 2) | |
| << "maca_memcpy_async must have at least src and dst arguments"; | |
| Var barrier_var; | |
| if (call->annotations.count("barrier")) { | |
| auto bar_attr = call->annotations.at("barrier"); | |
| if (auto bar_load = bar_attr.as<BufferLoadNode>()) { | |
| barrier_var = bar_load->buffer->data; | |
| } else if (auto bar_var_node = bar_attr.as<VarNode>()) { | |
| barrier_var = Downcast<Var>(bar_attr); | |
| } | |
| } else if (call->args.size() >= 4 && call->args[3].as<BufferLoadNode>()) { | |
| barrier_var = call->args[3].as<BufferLoad>().value()->buffer->data; | |
| } | |
| if (barrier_var.defined() && call->annotations.count("barrier_type")) { | |
| auto barrier_type = | |
| Downcast<StringImm>(call->annotations.at("barrier_type")); | |
| maca_barrier_type_map_.Set(barrier_var, barrier_type); | |
| } | |
| } | |
| ICHECK_GE(call->args.size(), 2) | |
| << "maca_memcpy_async must have at least src and dst arguments"; | |
| Var barrier_var; | |
| if (call->annotations.count("barrier")) { | |
| auto bar_attr = call->annotations.at("barrier"); | |
| if (auto bar_load = bar_attr.as<BufferLoadNode>()) { | |
| barrier_var = bar_load->buffer->data; | |
| } else if (auto bar_var_node = bar_attr.as<VarNode>()) { | |
| barrier_var = Downcast<Var>(bar_attr); | |
| } | |
| } else if (call->args.size() >= 4) { | |
| if (auto bar_load = call->args[3].as<BufferLoadNode>()) { | |
| barrier_var = bar_load->buffer->data; | |
| } else if (auto bar_var_node = call->args[3].as<VarNode>()) { | |
| barrier_var = GetRef<Var>(bar_var_node); | |
| } | |
| } | |
| if (barrier_var.defined() && call->annotations.count("barrier_type")) { | |
| auto barrier_type = | |
| Downcast<StringImm>(call->annotations.at("barrier_type")); | |
| maca_barrier_type_map_.Set(barrier_var, barrier_type); | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/transform/flatten_buffer.cc` around lines 306 - 326, The fallback
handling for maca_memcpy_async only extracts a barrier Var when args[3] is a
BufferLoadNode, so if args[3] is a Var-typed PrimExpr barrier_var stays
undefined; update the args[3] branch to also check for a VarNode (e.g.
call->args[3].as<VarNode>()) and set barrier_var = Downcast<Var>(call->args[3])
when present so that the subsequent maca_barrier_type_map_.Set(...) sees the
Var; keep the existing BufferLoad extraction
(call->args[3].as<BufferLoad>()->buffer->data) and ensure both paths populate
barrier_var before the barrier_type lookup.
| int GetCPAsyncSrcPtrArgIndex(const Call &call) const { | ||
| if (call->op == tl::maca_memcpy_async()) { | ||
| return 0; | ||
| } | ||
| return 1; | ||
| } | ||
|
|
||
| int GetCPAsyncDstPtrArgIndex(const Call &call) const { | ||
| if (call->op == tl::maca_memcpy_async()) { | ||
| return 1; | ||
| } | ||
| return 0; | ||
| } |
There was a problem hiding this comment.
Critical: MACA source/destination indices are reversed.
The helper functions return incorrect argument indices for tl::maca_memcpy_async(). According to the codegen contract (Context snippet 1, codegen_maca.cc:1573-1596) and the lowering logic (Context snippet 2), MACA uses:
args[0]= dst_access_ptrargs[1]= src_access_ptr
But the helpers claim:
- Source is at index 0 (line 422) → should be 1
- Destination is at index 1 (line 429) → should be 0
This causes downstream callers to extract the wrong buffers: CollectCPAsyncConditions checks the destination buffer instead of the source for boundary conditions, GetCPAsyncSourceBuffer returns the destination buffer, and RewriteCPAsync applies safe-value fallback to the source instead of the destination.
🐛 Proposed fix: swap the return values
int GetCPAsyncSrcPtrArgIndex(const Call &call) const {
if (call->op == tl::maca_memcpy_async()) {
- return 0;
+ return 1;
}
return 1;
}
int GetCPAsyncDstPtrArgIndex(const Call &call) const {
if (call->op == tl::maca_memcpy_async()) {
- return 1;
+ return 0;
}
return 0;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| int GetCPAsyncSrcPtrArgIndex(const Call &call) const { | |
| if (call->op == tl::maca_memcpy_async()) { | |
| return 0; | |
| } | |
| return 1; | |
| } | |
| int GetCPAsyncDstPtrArgIndex(const Call &call) const { | |
| if (call->op == tl::maca_memcpy_async()) { | |
| return 1; | |
| } | |
| return 0; | |
| } | |
| int GetCPAsyncSrcPtrArgIndex(const Call &call) const { | |
| if (call->op == tl::maca_memcpy_async()) { | |
| return 1; | |
| } | |
| return 1; | |
| } | |
| int GetCPAsyncDstPtrArgIndex(const Call &call) const { | |
| if (call->op == tl::maca_memcpy_async()) { | |
| return 0; | |
| } | |
| return 0; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/transform/legalize_safe_memory_access.cc` around lines 420 - 432, The two
helper functions GetCPAsyncSrcPtrArgIndex and GetCPAsyncDstPtrArgIndex have
their return values inverted for calls to tl::maca_memcpy_async(); swap the
indices for that op so GetCPAsyncSrcPtrArgIndex returns 1 for
tl::maca_memcpy_async() and GetCPAsyncDstPtrArgIndex returns 0 for
tl::maca_memcpy_async(), ensuring callers like CollectCPAsyncConditions,
GetCPAsyncSourceBuffer and RewriteCPAsync operate on the correct src/dst
buffers.
| Buffer GetCPAsyncSourceBuffer(const Call &call) { | ||
| ICHECK_GE(call->args.size(), 3U) | ||
| << "cp.async expects at least 3 arguments, but got " << call->args; | ||
| AccessPtrInfo src_info = | ||
| GetRequiredAccessPtrInfo(call->args[kCPAsyncSrcPtrArg], "cp.async"); | ||
| ICHECK_GE(call->args.size(), 2U) | ||
| << "cp.async expects at least 2 arguments, but got " << call->args; | ||
| AccessPtrInfo src_info = GetRequiredAccessPtrInfo( | ||
| call->args[GetCPAsyncSrcPtrArgIndex(call)], "cp.async"); | ||
| return src_info.base_load->buffer; | ||
| } |
There was a problem hiding this comment.
Relax argument-count check too far for MACA.
The check now requires only >= 2 arguments, but MACA requires at least 3 (dst, src, bytes) per the codegen contract. A malformed 2-argument call would pass this check but fail later.
Additionally, this function extracts the wrong buffer due to the index bug flagged in lines 420-432.
🛡️ Proposed fix: restore minimum argument count
- ICHECK_GE(call->args.size(), 2U)
- << "cp.async expects at least 2 arguments, but got " << call->args;
+ ICHECK_GE(call->args.size(), 3U)
+ << "cp.async expects at least 3 arguments (dst, src, size), but got " << call->args;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Buffer GetCPAsyncSourceBuffer(const Call &call) { | |
| ICHECK_GE(call->args.size(), 3U) | |
| << "cp.async expects at least 3 arguments, but got " << call->args; | |
| AccessPtrInfo src_info = | |
| GetRequiredAccessPtrInfo(call->args[kCPAsyncSrcPtrArg], "cp.async"); | |
| ICHECK_GE(call->args.size(), 2U) | |
| << "cp.async expects at least 2 arguments, but got " << call->args; | |
| AccessPtrInfo src_info = GetRequiredAccessPtrInfo( | |
| call->args[GetCPAsyncSrcPtrArgIndex(call)], "cp.async"); | |
| return src_info.base_load->buffer; | |
| } | |
| Buffer GetCPAsyncSourceBuffer(const Call &call) { | |
| ICHECK_GE(call->args.size(), 3U) | |
| << "cp.async expects at least 3 arguments (dst, src, size), but got " << call->args; | |
| AccessPtrInfo src_info = GetRequiredAccessPtrInfo( | |
| call->args[GetCPAsyncSrcPtrArgIndex(call)], "cp.async"); | |
| return src_info.base_load->buffer; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/transform/legalize_safe_memory_access.cc` around lines 531 - 537, Restore
the CPAsync argument-count guard to require at least 3 arguments (dst, src,
bytes) by changing the ICHECK to require >= 3, and ensure the function extracts
the source buffer using the source-argument index helper (use
GetCPAsyncSrcPtrArgIndex(call) to fetch the src pointer and call
GetRequiredAccessPtrInfo on that argument) so GetCPAsyncSourceBuffer returns
src_info.base_load->buffer rather than using the wrong arg index.
| Optional<PrimExpr> FlattenToLinearOffset(const Buffer &buffer, | ||
| const Array<PrimExpr> &indices) { | ||
| Array<PrimExpr> physical = buffer.OffsetOf(indices); | ||
| Buffer flattened = buffer.GetFlattenedBuffer(); | ||
| if (physical.empty() || physical.size() != flattened->shape.size()) { | ||
| return Optional<PrimExpr>(); | ||
| } | ||
|
|
||
| PrimExpr linear = physical[0]; | ||
| for (size_t i = 1; i < physical.size(); ++i) { | ||
| linear = linear * flattened->shape[i] + physical[i]; | ||
| } | ||
| return analyzer_->Simplify(linear); | ||
| } |
There was a problem hiding this comment.
Check contiguity after layout remap, not on the pre-remap buffer.
This fast path proves delta == 1 against src/dst as they appear in the original BufferRegion, then builds tl::access_ptr from those same indices. But later in this pass the BufferLoad inside that tl::access_ptr is rewritten through buffer_remap_ / layout_map_. For shared buffers with a swizzled or otherwise remapped layout, consecutive logical lanes are not guaranteed to stay consecutive after that rewrite, so folding them into one contiguous maca_memcpy_async can write the wrong shared-memory addresses.
Please either:
- run the base-index + contiguity check on the remapped buffer and
layout->Forward(...)indices that will actually reach codegen, or - conservatively disable this optimization whenever either side participates in
layout_map_until that physical-address check is in place.
Also applies to: 366-395, 441-468
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/transform/lower_tile_op.cc` around lines 337 - 350, The contiguity check
in FlattenToLinearOffset (and the fast-path that builds tl::access_ptr / emits
maca_memcpy_async) uses the original Buffer indices but codegen later rewrites
loads via buffer_remap_ / layout_map_, so you must either perform the base-index
+ contiguity check on the remapped buffer and the indices after
layout->Forward(...) (i.e., call layout_map_ / buffer_remap_ to obtain the
physical flattened buffer and remapped indices and then run
FlattenToLinearOffset on that), or conservatively disable this optimization
whenever the src or dst buffer appears in layout_map_ / buffer_remap_ (add a
guard checking membership in layout_map_ before folding into maca_memcpy_async);
update the checks at the other affected regions (the blocks referenced around
lines 366-395 and 441-468) to use the same remapped-buffer contiguity logic or
the conservative guard.
| if (call_node == nullptr || | ||
| !call_node->op.same_as(Op::Get("tl.tileop.maca_async_copy")) || | ||
| !call_node->annotations.count("barrier")) { | ||
| return Optional<Stmt>(); |
There was a problem hiding this comment.
Allow barrier-less MACA copies through this widening path.
The matcher currently bails out unless the tile-op call carries a barrier annotation, and the replacement always emits a 4-arg tl::maca_memcpy_async. That diverges from the downstream contract added in this stack, where maca_memcpy_async is valid with only (dst, src, bytes). As written, any vectorized MACA copy that intentionally omits the barrier will fall back to the per-lane path instead of being upgraded here.
Suggested change
- if (call_node == nullptr ||
- !call_node->op.same_as(Op::Get("tl.tileop.maca_async_copy")) ||
- !call_node->annotations.count("barrier")) {
+ if (call_node == nullptr ||
+ !call_node->op.same_as(Op::Get("tl.tileop.maca_async_copy"))) {
return Optional<Stmt>();
}
...
- PrimExpr barrier = Downcast<PrimExpr>(call_node->annotations.at("barrier"));
PrimExpr dst_access_ptr =
MakeTLAccessPtr(dst, dst_base.value(), lanes, /*rw_mask=*/2);
PrimExpr src_access_ptr =
MakeTLAccessPtr(src, src_base.value(), lanes, /*rw_mask=*/1);
+ Array<PrimExpr> async_args = {dst_access_ptr, src_access_ptr,
+ IntImm(DataType::Int(32), total_bytes)};
+ if (auto barrier = call_node->annotations.Get("barrier")) {
+ async_args.push_back(Downcast<PrimExpr>(barrier.value()));
+ }
...
- return Evaluate(Call(dst->dtype, tl::maca_memcpy_async(),
- {dst_access_ptr, src_access_ptr,
- IntImm(DataType::Int(32), total_bytes), barrier},
+ return Evaluate(Call(dst->dtype, tl::maca_memcpy_async(), async_args,
{{"barrier_type", StringImm(barrier_type)}}));Also applies to: 464-482
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/transform/lower_tile_op.cc` around lines 414 - 417, The matcher in
lower_tile_op.cc currently rejects tl.tileop.maca_async_copy calls that lack a
"barrier" annotation and always emits a 4-argument tl::maca_memcpy_async; change
the logic to allow calls without the "barrier" annotation (remove the
annotations.count("barrier") requirement around call_node) and when constructing
the replacement emit the 3-argument tl::maca_memcpy_async(dst, src, bytes) if
the "barrier" annotation is absent or the 4-argument form
tl::maca_memcpy_async(dst, src, bytes, barrier) only when
annotations.count("barrier") is true; apply the same fix in both
matching/replacement sites (around call_node /
Op::Get("tl.tileop.maca_async_copy") and the later block at lines 464-482) so
vectorized MACA copies without barriers are widened to the 3-arg async copy
path.
Replace with Maca's asynchronous copy command, and fix the issues with OOB and upgrading vectorized copies to asynchronous copies
Summary by CodeRabbit
New Features
Tests