Skip to content

Port single-core scan from pto-kernels#113

Merged
learning-chip merged 7 commits into
huawei-csl:mainfrom
vloncar:scan
May 11, 2026
Merged

Port single-core scan from pto-kernels#113
learning-chip merged 7 commits into
huawei-csl:mainfrom
vloncar:scan

Conversation

@vloncar
Copy link
Copy Markdown
Collaborator

@vloncar vloncar commented Apr 20, 2026

Mostly straight port of scan algorithm from huawei-csl/pto-kernels#101. I took a different approach with syncing, instead of TPUSH/TPOP semantics, I relied on sync_set/sync_wait calls. As a QoL improvement, I made the insertion of the FFTS commands automatic`, the interface for that can be further polished. Other small changes to interface had to be made to expose scalar ops and the sync ops.

The implementation follows the one from pto-kernels, however I encountered issues with the running sum calculation, which was initialized outside the loop and then updated in the loop. The compiler couldn't figure out that the loop will execute multiple times, so it removed the running_sum completely. As a workaround, we now store it in a tile. Furthermore, I couldn't figure out how to use pto.record_wait_pair to sync on the scalar pipe and we can't put PIPE_S directly, so as a workaround I used barrier. We should look into options of exposing scalar pipe in the sync interface in PTOAS.

@vloncar
Copy link
Copy Markdown
Collaborator Author

vloncar commented Apr 20, 2026

Updated to main (PTOAS 0.27).

Examples all passing:

==============================================================================
short summary info
==============================================================================
PASSED aot/activations/geglu_dynamic_multicore [25.54s]
PASSED aot/activations/relu_dynamic_multicore [13.02s]
PASSED aot/batch_matmul/matmul_dynbatch_multicore [11.07s]
PASSED aot/batch_matmul/matmul_dynbatch_multicore_2buf [10.61s]
PASSED aot/batch_matmul/matmul_dynbatch_multicore_opt [12.76s]
PASSED aot/elementwise/add_dynamic_multicore [20.69s]
PASSED aot/fast_hadamard [48.15s]
PASSED aot/fast_inverse/basic_dense [35.09s]
PASSED aot/fast_inverse/block_inversion [20.89s]
PASSED aot/matmul_optimization_guide [109.13s]
PASSED aot/matmul_optimization_guide/experimental [62.10s]
PASSED aot/print_tile [11.86s]
PASSED aot/simple_static/add_static_multicore [11.05s]
PASSED aot/simple_static/matmul_static_singlecore [10.89s]
PASSED aot/topk [19.40s]
PASSED aot/tpushpop/mix-kernel_mlir [39.93s]
PASSED jit/add_dynamic_multicore [9.42s]
PASSED jit/add_static_multicore [19.73s]
PASSED jit/matmul_dynamic_multicore [10.36s]
PASSED jit/scan [10.42s]
==============================================================================
20 passed, 0 failed in 512.13s

Tests all passing:

========================================================== test session starts ===========================================================
platform linux -- Python 3.11.15, pytest-9.0.3, pluggy-1.6.0
rootdir: /mounted_home/repo/pto-dsl
configfile: pytest.ini
plugins: anyio-4.13.0
collected 626 items                                                                                                                      

frontend/test_add_dynamic_ir.py .                                                                                                  [  0%]
frontend/test_add_ir.py .                                                                                                          [  0%]
frontend/test_caller_gen.py ....                                                                                                   [  0%]
frontend/test_compile.py ........                                                                                                  [  2%]
frontend/test_matmul_dynamic_ir.py .                                                                                               [  2%]
frontend/test_multifunc_ir.py ...                                                                                                  [  2%]
frontend/test_sub_ir.py .                                                                                                          [  3%]
npu/cvt_dynamic_multicore/test_cvt.py ........................                                                                     [  6%]
npu/elementwise_binary_dynamic_multicore/test_binary_builder.py ...                                                                [  7%]
npu/elementwise_unary_dynamic_multicore/test_unary_builder.py .......                                                              [  8%]
npu/gather_dynamic_multicore/test_gather_dynamic.py .....                                                                          [  9%]
npu/mrgsort_dynamic_multicore/test_mrgsort.py ........                                                                             [ 10%]
npu/sort32_dynamic_multicore/test_tsort32.py ........                                                                              [ 11%]
npu/elementwise_binary_dynamic_multicore/test_binary_builder.py ...                                                                [ 12%]
npu/elementwise_unary_dynamic_multicore/test_unary_builder.py .......                                                              [ 13%]
npu/gather_dynamic_multicore/test_gather_dynamic.py .....                                                                          [ 14%]
npu/sort32_dynamic_multicore/test_tsort32.py ........                                                                              [ 15%]
npu/elementwise_binary_dynamic_multicore/test_binary_builder.py ...                                                                [ 15%]
npu/elementwise_unary_dynamic_multicore/test_unary_builder.py .......                                                              [ 17%]
npu/gather_dynamic_multicore/test_gather_dynamic.py .....                                                                          [ 17%]
npu/elementwise_binary_dynamic_multicore/test_binary_builder.py ...                                                                [ 18%]
npu/elementwise_unary_dynamic_multicore/test_unary_builder.py .......                                                              [ 19%]
npu/gather_dynamic_multicore/test_gather_dynamic.py .....                                                                          [ 20%]
npu/elementwise_binary_dynamic_multicore/test_binary_builder.py ...                                                                [ 20%]
npu/elementwise_unary_dynamic_multicore/test_unary_builder.py .......                                                              [ 21%]
npu/elementwise_binary_dynamic_multicore/test_binary_builder.py ...                                                                [ 22%]
npu/elementwise_unary_dynamic_multicore/test_unary_builder.py .......                                                              [ 23%]
npu/elementwise_binary_dynamic_multicore/test_binary_builder.py ...                                                                [ 23%]
npu/elementwise_unary_dynamic_multicore/test_unary_builder.py .......                                                              [ 25%]
npu/elementwise_binary_dynamic_multicore/test_binary_builder.py ...                                                                [ 25%]
npu/elementwise_unary_dynamic_multicore/test_unary_builder.py .......                                                              [ 26%]
npu/elementwise_binary_dynamic_multicore/test_binary_builder.py ...                                                                [ 27%]
npu/elementwise_unary_dynamic_multicore/test_unary_builder.py .......                                                              [ 28%]
npu/elementwise_binary_dynamic_multicore/test_binary_builder.py ...                                                                [ 28%]
npu/elementwise_unary_dynamic_multicore/test_unary_builder.py .......                                                              [ 29%]
npu/elementwise_binary_dynamic_multicore/test_binary_builder.py ...                                                                [ 30%]
npu/elementwise_unary_dynamic_multicore/test_unary_builder.py .......                                                              [ 31%]
npu/elementwise_binary_dynamic_multicore/test_binary_builder.py sss                                                                [ 31%]
npu/elementwise_unary_dynamic_multicore/test_unary_builder.py .......                                                              [ 33%]
npu/elementwise_binary_dynamic_multicore/test_binary_builder.py ...                                                                [ 33%]
npu/elementwise_unary_dynamic_multicore/test_unary_builder.py .......                                                              [ 34%]
npu/elementwise_binary_dynamic_multicore/test_binary_builder.py ...                                                                [ 35%]
npu/elementwise_unary_dynamic_multicore/test_unary_builder.py .......                                                              [ 36%]
npu/elementwise_binary_dynamic_multicore/test_binary_builder.py ............ssssss...ssssss...ssssss...                            [ 42%]
npu/expand_dynamic_multicore/test_expand.py ...................................................................................... [ 56%]
..........................................................................................                                         [ 70%]
npu/quant_dynamic_multicore/test_quant_dynamic.py ................                                                                 [ 73%]
npu/reduce_dynamic_multicore/test_reduce.py ...................................................................................... [ 86%]
..................................................................................                                                 [100%]

============================================== 605 passed, 21 skipped in 130.20s (0:02:10) ===============================================

Comment on lines +179 to +191
pto.store(cTile, svOut)
pto.record_wait_pair("STORE_ACC", "LOAD", 2)

pto.sync_set(pto.PIPE_FIX, 0)
pto.sync_wait(pto.PIPE_MTE3, 1)

with pto.vector_section():
tvOut_vec = pto.as_tensor(
tensor_type,
ptr=y_ptr,
shape=[total_len // cTILE_SIZE, cTILE_SIZE],
strides=[cTILE_SIZE, c1],
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the with pto.vector_section() syntax actually works for mix kernels as long as using manual sync + explicit sync_set/sync_wait... The syntax is quite different from the entry + subfunction + push/pop + auto-sync as in #98 . So we have two styles to write mix kernels now. They can co-exist for now, but will need to unify the style in the future.

@learning-chip learning-chip merged commit 5176daa into huawei-csl:main May 11, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants