Skip to content

[Fix issue #1023]: T.tile.atomic_add with slice syntax + vector lane variables#1034

Merged
fuhouyu-hw merged 3 commits into
tile-ai:ascendc_ptofrom
ShareableXue:atomic_add
May 19, 2026
Merged

[Fix issue #1023]: T.tile.atomic_add with slice syntax + vector lane variables#1034
fuhouyu-hw merged 3 commits into
tile-ai:ascendc_ptofrom
ShareableXue:atomic_add

Conversation

@ShareableXue
Copy link
Copy Markdown
Contributor

@ShareableXue ShareableXue commented May 16, 2026

[Fix issue #1023] Fix T.tile.atomic_add with slice syntax + vector lane variables

Summary:

  • When dQ[i_b, cid*sC + vid*sC : ..., i_h, ...] is used as the dst of T.tile.atomic_add, the vid variable gets vectorized by AscendLowerParallelToVector, introducing T.Ramp (SIMD lane types) into buffer indices, extents, offset computations, and validity-check expressions. Four downstream components could not handle this.

  • The vectorization scrambles the dst extents: a 32-element contiguous range along the S dimension becomes extent=1 with a 32-lane Ramp, while the neighbouring H dimension (originally extent=1) absorbs the 32 — turning [1, 32, 1, 64] into [1, 1, 32, 64]. This breaks compute_strideN and find_active_dim_indices, which rely on extent values to identify row/column dimensions and compute the inter-row stride.

  • The fixes are applied in two layers: compile-time (allow the IR to lower without crashing) and runtime-correctness (make the lowered code produce correct DMA parameters so the kernel passes on NPU).

Changes:

src/op/ascend.ccAscendAtomicAdd::Lower

  • Remove incorrect rank-equality check between src and dst regions. A 2D UB source and a 4D global destination slice naturally have different ranks; only require each side to match its own buffer rank.

  • compute_valid_extent: when min_val is a Ramp, shape - min_val produces a vector type. Detect this and return the scalar extent directly, avoiding Select on vector lanes.

  • Scalarize indices before OffsetOf: extract RampNode::base from every index before computing the flat buffer offset. Ramp lane counts already contribute to the access extent via dst_len; they must not participate in multi-dimensional flat-offset arithmetic where lanes from different dimensions (e.g. 32 vs 64) collide in ElemOffsetBinaryOpMatchTypes.

  • Compute dst_len accounting for Ramp lanes: when a dimension carries a Ramp index, multiply the region extent by the Ramp lane count to obtain the true number of elements accessed.

  • Compute effective extents for boundary checks: when a dimension has extent == 1 but the index is Ramp(base, stride, N), replace the extent with N (the true element count). When the extent already equals the full count (e.g. extent=64 with a 64-lane Ramp), keep it unchanged — previously 64 * 64 = 4096 inflated validCol and caused out-of-bounds DMA on larger shapes.

  • Identify row/col from Ramp lanes instead of scrambled extents: when two dimensions carry Ramp indices with distinct lane counts, treat the smaller-lane Ramp as the row dimension and the larger-lane Ramp as the column dimension. Use these to drive validRow / validCol boundary computation and strideN.

  • Compute strideN from buffer shape: when Ramp dimensions are identified, compute the inter-row stride as the product of all buffer shape dimensions to the right of the row dimension. Only fall back to compute_strideN(dst, dst_extents) when no Ramps are present.

src/target/codegen_ascend.ccCopyCodegen

  • AscendLowerParallelToVector may introduce Ramp expressions into access_ptr offset fields. AscendC's GlobalTensor::operator[] expects a scalar start offset, so extract RampNode::base at codegen time. Per-element strides are already handled by DataCopyExtParams / DMA.

src/transform/ascend_storage_rewrite.ccOnArrayAccess

  • When a buffer is indexed with vector lanes (Ramp) but the stored value is scalar (index_lanes > 1 && value_dtype.lanes() == 1), allow the mismatch instead of asserting. The scalar value is broadcast across lanes by the hardware.

Test:

repro_atomic_add_issue.py — 2D UB src → 4D GM dst with slice syntax containing vid. Compiles and runs correctly on NPU.

examples/linear_attention/example_linear_attn_bwd.py — full linear attention backward kernel using T.tile.atomic_add for dQ accumulation. Verified on Ascend NPU with three configurations:

B S H D Result
1 128 8 64 Passed
2 1024 16 128 Passed
8 1024 32 128 Passed

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several updates to the Ascend backend to support vectorized offsets and multi-rank data transfers. Key changes include handling multi-lane data types in compute_valid_extent, allowing mismatched ranks between source and destination regions in AscendAtomicAdd, and extracting scalar base offsets from RampNode during code generation. Review feedback suggests refactoring the duplicated compute_valid_extent logic into a shared helper and adding a verification step to ensure total element counts match when source and destination ranks differ.

Comment thread src/op/ascend.cc
Comment on lines +576 to +578
if (remaining.dtype().lanes() > 1) {
return extent;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for compute_valid_extent is duplicated here from AscendCopy::Lower (line 333). To improve maintainability and reduce code duplication, consider refactoring this into a shared helper function or a static method within the file.

Comment thread src/op/ascend.cc
<< "tl.ascend_atomic_add requires src and dst regions to have the same "
"rank, got src "
<< src_extents.size() << " and dst " << dst_extents.size();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The rank equality check between src_extents and dst_extents has been removed to support cases where the source and destination regions have different ranks (e.g., 2D to 4D). While this is necessary for the described fix, it is important to ensure that the total number of elements in both regions still matches. Consider adding a check to verify that src_len == dst_len after they are computed (around line 664) to prevent potential out-of-bounds access or mismatched data transfers.

@benyang0506
Copy link
Copy Markdown
Collaborator

Could you add repro_atomic_add_issue.py to the test?

@ShareableXue
Copy link
Copy Markdown
Contributor Author

Could you add repro_atomic_add_issue.py to the test?

感谢 review。repro_atomic_add_issue.py 目前只是一个临时的复现脚本,主要作用是验证编译是否通过,并把 tensor 具体元素打印出来供人工排查,并不符合我们测试框架的规范。

不过,PR #991 中会带一个更规范的 example,其中会用相同的方式使用 atomic_add,并加入可自动验证的逻辑(例如对比预期输出)。等到那个 PR 合入后,我们自然就可以在仓库上通过运行该 example 来覆盖这个问题。

所以建议暂时不直接把 repro_atomic_add_issue.py 放进测试目录,而是等 #991 合入后,用其中的 example 来补充测试。如您觉得合适,我们也可以专门为这种使用方式撰写一个test用例。

@handsomeRobotSK
Copy link
Copy Markdown
Contributor

非常棒的修复,使我的算子可以有更多选择!

@ShareableXue
Copy link
Copy Markdown
Contributor Author

非常棒的修复,使我的算子可以有更多选择!

Thanks for your recognition. If you encounter any bugs in the future, please feel free to open an issue and @ me.

@ArmandAlbert
Copy link
Copy Markdown
Collaborator

After verification, the T.tile.atomic_add interface works correctly in the copying of linear_attn_bwd without any issues. Thanks for the fix!😊
#991

@ShareableXue
Copy link
Copy Markdown
Contributor Author

After verification, the T.tile.atomic_add interface works correctly in the copying of linear_attn_bwd without any issues. Thanks for the fix!😊 #991

I really appreciate you taking the time to test it and leave this feedback.
I'll go ahead and merge the PR. Thanks again for your collaboration on this! 😊

@fuhouyu-hw
Copy link
Copy Markdown
Collaborator

是否涉及编程手册改动?

@ShareableXue
Copy link
Copy Markdown
Contributor Author

是否涉及编程手册改动?

感谢review!不涉及手册改动。本次修改是 AscendAtomicAdd::Lower 内部的编译逻辑修复——向量化 pass 引入的 T.Ramp 在多维度 offset 计算、stride 推导和边界检查中未被正确处理。API 接口、参数类型、使用方式均无变化,现有手册中 dst 已支持BufferRegion(region/slice语法)。仅修复了当 region 索引中含有会被向量化的 loop 变量时编译器内部崩溃/生成错误 DMA 参数的问题。

这是编译器内部的bug fix,API没有变化,因此不需要改动变成手册。

Copy link
Copy Markdown
Collaborator

@fuhouyu-hw fuhouyu-hw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@fuhouyu-hw fuhouyu-hw merged commit 1255d4f into tile-ai:ascendc_pto May 19, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants