feat: Add PTO-DSL GEMM performance kernel with Python build and validation by Crystal-wzy · Pull Request #116 · hw-native-sys/pto-isa

Crystal-wzy · 2026-05-07T08:35:30Z

Summary

Add kernels/python/gemm_performance/ with a PTO-DSL GEMM kernel
(gemm_performance.py) targeting A2/A3, featuring 2D multi-core tiling,
L1 double buffering, L0 ping-pong buffers, and a four-stage
LOAD→EXTRACT→MATMUL→STORE pipeline
Add run_gemm.py runner with multiple shape presets, auto-build
(ptoas + bisheng), Torch-NPU correctness validation, and benchmark
with optional torch.matmul baseline comparison
Add compile.sh for standalone .pto → .cpp → .so build flow
Add bilingual README (English and Chinese) documenting operator
description, tiling parameters, prerequisites, build/run instructions,
and available presets
Register python/gemm_performance/ in the top-level kernels/ README

Testing

Correctness test passes on A2/A3 hardware (python3 run_gemm.py)
Benchmark mode reports expected TFLOPS (python3 run_gemm.py --benchmark)
Pre-commit hooks pass (ruff, pyright, markdownlint)

gemini-code-assist

Code Review

This pull request introduces a PTO-DSL implementation of a GEMM performance kernel for Ascend A2/A3 processors, along with a C++ wrapper, build scripts, and a Python-based validation and benchmarking suite. The kernel features a four-stage pipeline with double buffering and multi-core partitioning. Feedback identifies a critical issue where the Python script fails to output the actual IR, as well as opportunities to improve build flexibility by making the NPU architecture configurable and ensuring broader DSL compatibility by avoiding list arguments in synchronization primitives.

gemini-code-assist · 2026-05-07T08:37:04Z

+
+
+if __name__ == "__main__":
+    print(build())


The build() function returns the GemmPerformance kernel function object decorated with @to_ir_module. To generate and print the actual PTO IR (which is what compile.sh expects to capture), you must invoke the returned function. Printing the function object itself will only output its string representation (e.g., <function GemmPerformance at ...>), resulting in an invalid .pto file and causing the build to fail.

Suggested change

print(build())

print(build()())

gemini-code-assist · 2026-05-07T08:37:04Z

+PTO_LIB_PATH=${PTO_LIB_PATH:-$(cd "${SCRIPT_DIR}/../../.." && pwd)}
+
+bisheng -fPIC -shared -xcce -O2 -std=c++17 \
+    --npu-arch=dav-2201 \


The NPU architecture is hardcoded to dav-2201 (which corresponds to the A2 platform). Since this kernel is intended to support both A2 and A3 (as stated in the README), hardcoding the architecture might prevent the compiler from applying A3-specific optimizations or could lead to compatibility issues on A3 hardware. Consider making the architecture configurable via an environment variable.

Suggested change

--npu-arch=dav-2201 \

--npu-arch=${NPU_ARCH:-dav-2201} \

gemini-code-assist · 2026-05-07T08:37:05Z

+            pto.record_event("MOV_M2L", "LOAD", event_id=[0, 1])
+            pto.record_event("MATMUL", "MOV_M2L", event_id=[0, 1])


Passing a list to event_id in pto.record_event may not be supported by all versions of the PTO-DSL or might lead to unexpected IR generation if the backend expects a single integer constant. It is safer and more explicit to record each event individually to ensure both synchronization slots are correctly initialized.

pto.record_event("MOV_M2L", "LOAD", event_id=0) pto.record_event("MOV_M2L", "LOAD", event_id=1) pto.record_event("MATMUL", "MOV_M2L", event_id=0) pto.record_event("MATMUL", "MOV_M2L", event_id=1)

gemini-code-assist · 2026-05-07T08:37:05Z

+            pto.wait_event("MOV_M2L", "LOAD", event_id=[0, 1])
+            pto.wait_event("MATMUL", "MOV_M2L", event_id=[0, 1])


Similar to the initialization at the start of the kernel, pto.wait_event should ideally be called for each event ID individually if the DSL does not explicitly support list inputs for synchronization primitives.

Suggested change

pto.wait_event("MOV_M2L", "LOAD", event_id=[0, 1])

pto.wait_event("MATMUL", "MOV_M2L", event_id=[0, 1])

pto.wait_event("MOV_M2L", "LOAD", event_id=0)

pto.wait_event("MOV_M2L", "LOAD", event_id=1)

pto.wait_event("MATMUL", "MOV_M2L", event_id=0)

pto.wait_event("MATMUL", "MOV_M2L", event_id=1)

zhoubot · 2026-05-08T01:34:30Z

Triage review (2026-05-08): this PR is in good merge shape. GitHub reports it as clean against main, and all four CI checks pass (Pre-commit, Docs build, CPU SIM smoke, CPU SIM full ST). The scope is also self-contained under kernels/python/gemm_performance/ plus the kernel README index updates.

Two pre-merge checks I recommend: confirm the build artifacts listed in the README (gemm_performance.pto, generated .cpp, .so) are not accidentally committed by a normal local run, and keep the hardware validation note in the PR body because CI cannot prove the A2/A3 Torch-NPU path. Otherwise this looks ready for normal maintainer review/merge.

…ation (hw-native-sys#116) ## Summary - Add `kernels/python/gemm_performance/` with a PTO-DSL GEMM kernel (`gemm_performance.py`) targeting A2/A3, featuring 2D multi-core tiling, L1 double buffering, L0 ping-pong buffers, and a four-stage LOAD→EXTRACT→MATMUL→STORE pipeline - Add `run_gemm.py` runner with multiple shape presets, auto-build (ptoas + bisheng), Torch-NPU correctness validation, and benchmark with optional `torch.matmul` baseline comparison - Add `compile.sh` for standalone `.pto` → `.cpp` → `.so` build flow - Add bilingual README (English and Chinese) documenting operator description, tiling parameters, prerequisites, build/run instructions, and available presets - Register `python/gemm_performance/` in the top-level `kernels/` README ## Testing - [x] Correctness test passes on A2/A3 hardware (`python3 run_gemm.py`) - [x] Benchmark mode reports expected TFLOPS (`python3 run_gemm.py --benchmark`) - [x] Pre-commit hooks pass (ruff, pyright, markdownlint)

…ation ## Summary - Add `kernels/python/gemm_performance/` with a PTO-DSL GEMM kernel (`gemm_performance.py`) targeting A2/A3, featuring 2D multi-core tiling, L1 double buffering, L0 ping-pong buffers, and a four-stage LOAD→EXTRACT→MATMUL→STORE pipeline - Add `run_gemm.py` runner with multiple shape presets, auto-build (ptoas + bisheng), Torch-NPU correctness validation, and benchmark with optional `torch.matmul` baseline comparison - Add `compile.sh` for standalone `.pto` → `.cpp` → `.so` build flow - Add bilingual README (English and Chinese) documenting operator description, tiling parameters, prerequisites, build/run instructions, and available presets - Register `python/gemm_performance/` in the top-level `kernels/` README ## Testing - [x] Correctness test passes on A2/A3 hardware (`python3 run_gemm.py`) - [x] Benchmark mode reports expected TFLOPS (`python3 run_gemm.py --benchmark`) - [x] Pre-commit hooks pass (ruff, pyright, markdownlint)

gemini-code-assist Bot reviewed May 7, 2026

View reviewed changes

Crystal-wzy force-pushed the main branch from 84e8594 to d939e8a Compare May 7, 2026 08:41

Crystal-wzy closed this May 8, 2026

Crystal-wzy force-pushed the main branch from 415d4cb to 4e2de63 Compare May 8, 2026 07:57

Crystal-wzy reopened this May 8, 2026

Crystal-wzy changed the title ~~feat: Add PTO-DSL GEMM performance kernel with Python validation runner~~ feat: Add PTO-DSL GEMM performance kernel with Python build and validation May 8, 2026

Crystal-wzy force-pushed the main branch from fc27cc7 to 5aeace7 Compare May 8, 2026 08:13

Crystal-wzy force-pushed the main branch 2 times, most recently from 5a61598 to 7705674 Compare May 13, 2026 01:42

Crystal-wzy force-pushed the main branch from 7705674 to 794cef9 Compare May 13, 2026 02:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add PTO-DSL GEMM performance kernel with Python build and validation#116

feat: Add PTO-DSL GEMM performance kernel with Python build and validation#116
Crystal-wzy wants to merge 1 commit into
hw-native-sys:mainfrom
Crystal-wzy:main

Crystal-wzy commented May 7, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 7, 2026

Uh oh!

gemini-code-assist Bot May 7, 2026

Uh oh!

gemini-code-assist Bot May 7, 2026

Uh oh!

gemini-code-assist Bot May 7, 2026

Uh oh!

zhoubot commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		pto.record_event("MOV_M2L", "LOAD", event_id=[0, 1])
		pto.record_event("MATMUL", "MOV_M2L", event_id=[0, 1])

		pto.wait_event("MOV_M2L", "LOAD", event_id=[0, 1])
		pto.wait_event("MATMUL", "MOV_M2L", event_id=[0, 1])

-            pto.wait_event("MOV_M2L", "LOAD", event_id=[0, 1])
-            pto.wait_event("MATMUL", "MOV_M2L", event_id=[0, 1])
+            pto.wait_event("MOV_M2L", "LOAD", event_id=0)
+            pto.wait_event("MOV_M2L", "LOAD", event_id=1)
+            pto.wait_event("MATMUL", "MOV_M2L", event_id=0)
+            pto.wait_event("MATMUL", "MOV_M2L", event_id=1)

Conversation

Crystal-wzy commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

zhoubot commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Crystal-wzy commented May 7, 2026 •

edited

Loading