[Fix issue #1023]: T.tile.atomic_add with slice syntax + vector lane variables by ShareableXue · Pull Request #1034 · tile-ai/tilelang-ascend

ShareableXue · 2026-05-16T03:46:45Z

[Fix issue #1023] Fix T.tile.atomic_add with slice syntax + vector lane variables

Summary:

When dQ[i_b, cid*sC + vid*sC : ..., i_h, ...] is used as the dst of T.tile.atomic_add, the vid variable gets vectorized by AscendLowerParallelToVector, introducing T.Ramp (SIMD lane types) into buffer indices, extents, offset computations, and validity-check expressions. Four downstream components could not handle this.
The vectorization scrambles the dst extents: a 32-element contiguous range along the S dimension becomes extent=1 with a 32-lane Ramp, while the neighbouring H dimension (originally extent=1) absorbs the 32 — turning [1, 32, 1, 64] into [1, 1, 32, 64]. This breaks compute_strideN and find_active_dim_indices, which rely on extent values to identify row/column dimensions and compute the inter-row stride.
The fixes are applied in two layers: compile-time (allow the IR to lower without crashing) and runtime-correctness (make the lowered code produce correct DMA parameters so the kernel passes on NPU).

Changes:

`src/op/ascend.cc` — `AscendAtomicAdd::Lower`

Remove incorrect rank-equality check between src and dst regions. A 2D UB source and a 4D global destination slice naturally have different ranks; only require each side to match its own buffer rank.
compute_valid_extent: when min_val is a Ramp, shape - min_val produces a vector type. Detect this and return the scalar extent directly, avoiding Select on vector lanes.
Scalarize indices before OffsetOf: extract RampNode::base from every index before computing the flat buffer offset. Ramp lane counts already contribute to the access extent via dst_len; they must not participate in multi-dimensional flat-offset arithmetic where lanes from different dimensions (e.g. 32 vs 64) collide in ElemOffset → BinaryOpMatchTypes.
Compute dst_len accounting for Ramp lanes: when a dimension carries a Ramp index, multiply the region extent by the Ramp lane count to obtain the true number of elements accessed.
Compute effective extents for boundary checks: when a dimension has extent == 1 but the index is Ramp(base, stride, N), replace the extent with N (the true element count). When the extent already equals the full count (e.g. extent=64 with a 64-lane Ramp), keep it unchanged — previously 64 * 64 = 4096 inflated validCol and caused out-of-bounds DMA on larger shapes.
Identify row/col from Ramp lanes instead of scrambled extents: when two dimensions carry Ramp indices with distinct lane counts, treat the smaller-lane Ramp as the row dimension and the larger-lane Ramp as the column dimension. Use these to drive validRow / validCol boundary computation and strideN.
Compute strideN from buffer shape: when Ramp dimensions are identified, compute the inter-row stride as the product of all buffer shape dimensions to the right of the row dimension. Only fall back to compute_strideN(dst, dst_extents) when no Ramps are present.

`src/target/codegen_ascend.cc` — `CopyCodegen`

AscendLowerParallelToVector may introduce Ramp expressions into access_ptr offset fields. AscendC's GlobalTensor::operator[] expects a scalar start offset, so extract RampNode::base at codegen time. Per-element strides are already handled by DataCopyExtParams / DMA.

`src/transform/ascend_storage_rewrite.cc` — `OnArrayAccess`

When a buffer is indexed with vector lanes (Ramp) but the stored value is scalar (index_lanes > 1 && value_dtype.lanes() == 1), allow the mismatch instead of asserting. The scalar value is broadcast across lanes by the hardware.

Test:

repro_atomic_add_issue.py — 2D UB src → 4D GM dst with slice syntax containing vid. Compiles and runs correctly on NPU.

examples/linear_attention/example_linear_attn_bwd.py — full linear attention backward kernel using T.tile.atomic_add for dQ accumulation. Verified on Ascend NPU with three configurations:

B	S	H	D	Result
1	128	8	64	Passed
2	1024	16	128	Passed
8	1024	32	128	Passed

github-actions · 2026-05-16T03:46:55Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

gemini-code-assist

Code Review

This pull request introduces several updates to the Ascend backend to support vectorized offsets and multi-rank data transfers. Key changes include handling multi-lane data types in compute_valid_extent, allowing mismatched ranks between source and destination regions in AscendAtomicAdd, and extracting scalar base offsets from RampNode during code generation. Review feedback suggests refactoring the duplicated compute_valid_extent logic into a shared helper and adding a verification step to ensure total element counts match when source and destination ranks differ.

gemini-code-assist · 2026-05-16T03:48:03Z

+    if (remaining.dtype().lanes() > 1) {
+      return extent;
+    }


The logic for compute_valid_extent is duplicated here from AscendCopy::Lower (line 333). To improve maintainability and reduce code duplication, consider refactoring this into a shared helper function or a static method within the file.

gemini-code-assist · 2026-05-16T03:48:03Z

-      << "tl.ascend_atomic_add requires src and dst regions to have the same "
-         "rank, got src "
-      << src_extents.size() << " and dst " << dst_extents.size();



The rank equality check between src_extents and dst_extents has been removed to support cases where the source and destination regions have different ranks (e.g., 2D to 4D). While this is necessary for the described fix, it is important to ensure that the total number of elements in both regions still matches. Consider adding a check to verify that src_len == dst_len after they are computed (around line 664) to prevent potential out-of-bounds access or mismatched data transfers.

benyang0506 · 2026-05-18T03:15:17Z

Could you add repro_atomic_add_issue.py to the test?

ShareableXue · 2026-05-18T07:12:30Z

Could you add repro_atomic_add_issue.py to the test?

感谢 review。repro_atomic_add_issue.py 目前只是一个临时的复现脚本，主要作用是验证编译是否通过，并把 tensor 具体元素打印出来供人工排查，并不符合我们测试框架的规范。

不过，PR #991 中会带一个更规范的 example，其中会用相同的方式使用 atomic_add，并加入可自动验证的逻辑（例如对比预期输出）。等到那个 PR 合入后，我们自然就可以在仓库上通过运行该 example 来覆盖这个问题。

所以建议暂时不直接把 repro_atomic_add_issue.py 放进测试目录，而是等 #991 合入后，用其中的 example 来补充测试。如您觉得合适，我们也可以专门为这种使用方式撰写一个test用例。

handsomeRobotSK · 2026-05-18T08:10:44Z

非常棒的修复，使我的算子可以有更多选择！

ShareableXue · 2026-05-18T08:12:16Z

非常棒的修复，使我的算子可以有更多选择！

Thanks for your recognition. If you encounter any bugs in the future, please feel free to open an issue and @ me.

ArmandAlbert · 2026-05-19T02:18:29Z

After verification, the T.tile.atomic_add interface works correctly in the copying of linear_attn_bwd without any issues. Thanks for the fix!😊
#991

ShareableXue · 2026-05-19T03:02:31Z

After verification, the T.tile.atomic_add interface works correctly in the copying of linear_attn_bwd without any issues. Thanks for the fix!😊 #991

I really appreciate you taking the time to test it and leave this feedback.
I'll go ahead and merge the PR. Thanks again for your collaboration on this! 😊

fuhouyu-hw · 2026-05-19T09:13:54Z

是否涉及编程手册改动？

ShareableXue · 2026-05-19T09:26:57Z

是否涉及编程手册改动？

感谢review！不涉及手册改动。本次修改是 AscendAtomicAdd::Lower 内部的编译逻辑修复——向量化 pass 引入的 T.Ramp 在多维度 offset 计算、stride 推导和边界检查中未被正确处理。API 接口、参数类型、使用方式均无变化，现有手册中 dst 已支持BufferRegion（region/slice语法）。仅修复了当 region 索引中含有会被向量化的 loop 变量时编译器内部崩溃/生成错误 DMA 参数的问题。

这是编译器内部的bug fix，API没有变化，因此不需要改动变成手册。

fuhouyu-hw

/approve

[fix] Fix T.tile.atomic_add with slice syntax + vector lane variables

9f950c6

gemini-code-assist Bot reviewed May 16, 2026

View reviewed changes

Shareable added 2 commits May 18, 2026 15:00

[Fix] fix the issue tile-ai#1023. Make sure linear attention bwd pass.

56a0589

[chore] clang-format for src/op/ascend.cc

a21abf2

benyang0506 approved these changes May 18, 2026

View reviewed changes

fuhouyu-hw approved these changes May 19, 2026

View reviewed changes

fuhouyu-hw merged commit 1255d4f into tile-ai:ascendc_pto May 19, 2026
6 checks passed

Conversation

ShareableXue commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[Fix issue #1023] Fix T.tile.atomic_add with slice syntax + vector lane variables

Summary:

Changes:

src/op/ascend.cc — AscendAtomicAdd::Lower

src/target/codegen_ascend.cc — CopyCodegen

src/transform/ascend_storage_rewrite.cc — OnArrayAccess

Test:

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

benyang0506 commented May 18, 2026

Uh oh!

ShareableXue commented May 18, 2026

Uh oh!

handsomeRobotSK commented May 18, 2026

Uh oh!

ShareableXue commented May 18, 2026

Uh oh!

ArmandAlbert commented May 19, 2026

Uh oh!

ShareableXue commented May 19, 2026

Uh oh!

fuhouyu-hw commented May 19, 2026

Uh oh!

ShareableXue commented May 19, 2026

Uh oh!

fuhouyu-hw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ShareableXue commented May 16, 2026 •

edited

Loading

`src/op/ascend.cc` — `AscendAtomicAdd::Lower`

`src/target/codegen_ascend.cc` — `CopyCodegen`

`src/transform/ascend_storage_rewrite.cc` — `OnArrayAccess`