Skip to content

[AscendNPU-IR][Expert] Fix FA reduce block_n_tail bug#851

Open
Wamdus3 wants to merge 1 commit intotile-ai:npuirfrom
WCY2000:fix_fa
Open

[AscendNPU-IR][Expert] Fix FA reduce block_n_tail bug#851
Wamdus3 wants to merge 1 commit intotile-ai:npuirfrom
WCY2000:fix_fa

Conversation

@Wamdus3
Copy link
Copy Markdown

@Wamdus3 Wamdus3 commented Apr 17, 2026

Fixed FA reduce block_n_tail bug

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the T.reduce calls in examples/flash_attn_npuir.py to include a size parameter, intended to handle tail data during max and sum reductions. However, the review feedback indicates that the current library infrastructure in tilelang (both the Python API and the C++ backend) does not support these keyword arguments or the size parameter, meaning the underlying implementation must be updated to accommodate these changes.

T.vmul(cross_kernel_f32_N, acc_c_scale, cross_kernel_f32_N)
T.reduce(
cross_kernel_f32_N, scores_max, dims=[1], reduce_mode="max"
cross_kernel_f32_N, scores_max, dims=[1], reduce_mode="max", size=[real_m, tail_size_n]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The addition of the size parameter is logically correct for fixing the tail bug by limiting the reduction to valid data. However, there is a significant discrepancy between this call and the library definition in tilelang/language/reduce.py. The current T.reduce implementation expects (buffer, out, reduce_type, dim, clear) and does not support keyword arguments like dims, reduce_mode, or size. Furthermore, the C++ implementation in src/op/reduce.cc (specifically the ReduceOp constructor and Lower function) lacks the logic to handle a dynamic size argument. Please ensure the library infrastructure is updated to support this extended API and to utilize the size bounds during lowering.


T.reduce(
cross_kernel_f32_N, scores_sum, dims=[1], reduce_mode="sum"
cross_kernel_f32_N, scores_sum, dims=[1], reduce_mode="sum", size=[real_m, tail_size_n]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the change at line 210, this T.reduce call uses an API that is not currently supported by the provided reduce.py and reduce.cc files. While the logic of passing size=[real_m, tail_size_n] correctly addresses the issue of stale data in the padding area during the sum reduction, the underlying operator must be updated to accept and implement these bounds.

@baberrehman baberrehman added the AscendNPU-IR Want to merge into the npuir branch label Apr 17, 2026
T.vmul(cross_kernel_f32_N, acc_c_scale, cross_kernel_f32_N)
T.reduce(
cross_kernel_f32_N, scores_max, dims=[1], reduce_mode="max"
cross_kernel_f32_N, scores_max, dims=[1], reduce_mode="max", size=[real_m, tail_size_n]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We previously removed size because slice syntax was more precise; why is size being reinstated here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AscendNPU-IR Want to merge into the npuir branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants