[Fix issue #1023]: T.tile.atomic_add with slice syntax + vector lane variables#1034
Conversation
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
There was a problem hiding this comment.
Code Review
This pull request introduces several updates to the Ascend backend to support vectorized offsets and multi-rank data transfers. Key changes include handling multi-lane data types in compute_valid_extent, allowing mismatched ranks between source and destination regions in AscendAtomicAdd, and extracting scalar base offsets from RampNode during code generation. Review feedback suggests refactoring the duplicated compute_valid_extent logic into a shared helper and adding a verification step to ensure total element counts match when source and destination ranks differ.
| if (remaining.dtype().lanes() > 1) { | ||
| return extent; | ||
| } |
| << "tl.ascend_atomic_add requires src and dst regions to have the same " | ||
| "rank, got src " | ||
| << src_extents.size() << " and dst " << dst_extents.size(); | ||
|
|
There was a problem hiding this comment.
The rank equality check between src_extents and dst_extents has been removed to support cases where the source and destination regions have different ranks (e.g., 2D to 4D). While this is necessary for the described fix, it is important to ensure that the total number of elements in both regions still matches. Consider adding a check to verify that src_len == dst_len after they are computed (around line 664) to prevent potential out-of-bounds access or mismatched data transfers.
|
Could you add repro_atomic_add_issue.py to the test? |
感谢 review。repro_atomic_add_issue.py 目前只是一个临时的复现脚本,主要作用是验证编译是否通过,并把 tensor 具体元素打印出来供人工排查,并不符合我们测试框架的规范。 不过,PR #991 中会带一个更规范的 example,其中会用相同的方式使用 atomic_add,并加入可自动验证的逻辑(例如对比预期输出)。等到那个 PR 合入后,我们自然就可以在仓库上通过运行该 example 来覆盖这个问题。 所以建议暂时不直接把 repro_atomic_add_issue.py 放进测试目录,而是等 #991 合入后,用其中的 example 来补充测试。如您觉得合适,我们也可以专门为这种使用方式撰写一个test用例。 |
|
非常棒的修复,使我的算子可以有更多选择! |
Thanks for your recognition. If you encounter any bugs in the future, please feel free to open an issue and @ me. |
|
After verification, the |
I really appreciate you taking the time to test it and leave this feedback. |
|
是否涉及编程手册改动? |
感谢review!不涉及手册改动。本次修改是 AscendAtomicAdd::Lower 内部的编译逻辑修复——向量化 pass 引入的 T.Ramp 在多维度 offset 计算、stride 推导和边界检查中未被正确处理。API 接口、参数类型、使用方式均无变化,现有手册中 dst 已支持BufferRegion(region/slice语法)。仅修复了当 region 索引中含有会被向量化的 loop 变量时编译器内部崩溃/生成错误 DMA 参数的问题。 这是编译器内部的bug fix,API没有变化,因此不需要改动变成手册。 |
[Fix issue #1023] Fix T.tile.atomic_add with slice syntax + vector lane variables
Summary:
When
dQ[i_b, cid*sC + vid*sC : ..., i_h, ...]is used as the dst ofT.tile.atomic_add, thevidvariable gets vectorized byAscendLowerParallelToVector, introducingT.Ramp(SIMD lane types) into buffer indices, extents, offset computations, and validity-check expressions. Four downstream components could not handle this.The vectorization scrambles the dst extents: a 32-element contiguous range along the S dimension becomes
extent=1with a 32-lane Ramp, while the neighbouring H dimension (originallyextent=1) absorbs the 32 — turning[1, 32, 1, 64]into[1, 1, 32, 64]. This breakscompute_strideNandfind_active_dim_indices, which rely on extent values to identify row/column dimensions and compute the inter-row stride.The fixes are applied in two layers: compile-time (allow the IR to lower without crashing) and runtime-correctness (make the lowered code produce correct DMA parameters so the kernel passes on NPU).
Changes:
src/op/ascend.cc—AscendAtomicAdd::LowerRemove incorrect rank-equality check between src and dst regions. A 2D UB source and a 4D global destination slice naturally have different ranks; only require each side to match its own buffer rank.
compute_valid_extent: whenmin_valis aRamp,shape - min_valproduces a vector type. Detect this and return the scalar extent directly, avoidingSelecton vector lanes.Scalarize indices before
OffsetOf: extractRampNode::basefrom every index before computing the flat buffer offset. Ramp lane counts already contribute to the access extent viadst_len; they must not participate in multi-dimensional flat-offset arithmetic where lanes from different dimensions (e.g. 32 vs 64) collide inElemOffset→BinaryOpMatchTypes.Compute
dst_lenaccounting for Ramp lanes: when a dimension carries a Ramp index, multiply the region extent by the Ramp lane count to obtain the true number of elements accessed.Compute effective extents for boundary checks: when a dimension has
extent == 1but the index isRamp(base, stride, N), replace the extent withN(the true element count). When the extent already equals the full count (e.g.extent=64with a 64-lane Ramp), keep it unchanged — previously64 * 64 = 4096inflatedvalidColand caused out-of-bounds DMA on larger shapes.Identify row/col from Ramp lanes instead of scrambled extents: when two dimensions carry Ramp indices with distinct lane counts, treat the smaller-lane Ramp as the row dimension and the larger-lane Ramp as the column dimension. Use these to drive
validRow/validColboundary computation andstrideN.Compute
strideNfrom buffer shape: when Ramp dimensions are identified, compute the inter-row stride as the product of all buffer shape dimensions to the right of the row dimension. Only fall back tocompute_strideN(dst, dst_extents)when no Ramps are present.src/target/codegen_ascend.cc—CopyCodegenAscendLowerParallelToVectormay introduce Ramp expressions intoaccess_ptroffset fields. AscendC'sGlobalTensor::operator[]expects a scalar start offset, so extractRampNode::baseat codegen time. Per-element strides are already handled byDataCopyExtParams/ DMA.src/transform/ascend_storage_rewrite.cc—OnArrayAccessindex_lanes > 1 && value_dtype.lanes() == 1), allow the mismatch instead of asserting. The scalar value is broadcast across lanes by the hardware.Test:
repro_atomic_add_issue.py— 2D UB src → 4D GM dst with slice syntax containingvid. Compiles and runs correctly on NPU.examples/linear_attention/example_linear_attn_bwd.py— full linear attention backward kernel usingT.tile.atomic_addfor dQ accumulation. Verified on Ascend NPU with three configurations: