Skip to content

feat: Add PTO-DSL GEMM performance kernel with Python build and validation#116

Open
Crystal-wzy wants to merge 1 commit into
hw-native-sys:mainfrom
Crystal-wzy:main
Open

feat: Add PTO-DSL GEMM performance kernel with Python build and validation#116
Crystal-wzy wants to merge 1 commit into
hw-native-sys:mainfrom
Crystal-wzy:main

Conversation

@Crystal-wzy
Copy link
Copy Markdown
Collaborator

@Crystal-wzy Crystal-wzy commented May 7, 2026

Summary

  • Add kernels/python/gemm_performance/ with a PTO-DSL GEMM kernel
    (gemm_performance.py) targeting A2/A3, featuring 2D multi-core tiling,
    L1 double buffering, L0 ping-pong buffers, and a four-stage
    LOAD→EXTRACT→MATMUL→STORE pipeline
  • Add run_gemm.py runner with multiple shape presets, auto-build
    (ptoas + bisheng), Torch-NPU correctness validation, and benchmark
    with optional torch.matmul baseline comparison
  • Add compile.sh for standalone .pto.cpp.so build flow
  • Add bilingual README (English and Chinese) documenting operator
    description, tiling parameters, prerequisites, build/run instructions,
    and available presets
  • Register python/gemm_performance/ in the top-level kernels/ README

Testing

  • Correctness test passes on A2/A3 hardware (python3 run_gemm.py)
  • Benchmark mode reports expected TFLOPS (python3 run_gemm.py --benchmark)
  • Pre-commit hooks pass (ruff, pyright, markdownlint)

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a PTO-DSL implementation of a GEMM performance kernel for Ascend A2/A3 processors, along with a C++ wrapper, build scripts, and a Python-based validation and benchmarking suite. The kernel features a four-stage pipeline with double buffering and multi-core partitioning. Feedback identifies a critical issue where the Python script fails to output the actual IR, as well as opportunities to improve build flexibility by making the NPU architecture configurable and ensuring broader DSL compatibility by avoiding list arguments in synchronization primitives.



if __name__ == "__main__":
print(build())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The build() function returns the GemmPerformance kernel function object decorated with @to_ir_module. To generate and print the actual PTO IR (which is what compile.sh expects to capture), you must invoke the returned function. Printing the function object itself will only output its string representation (e.g., <function GemmPerformance at ...>), resulting in an invalid .pto file and causing the build to fail.

Suggested change
print(build())
print(build()())

PTO_LIB_PATH=${PTO_LIB_PATH:-$(cd "${SCRIPT_DIR}/../../.." && pwd)}

bisheng -fPIC -shared -xcce -O2 -std=c++17 \
--npu-arch=dav-2201 \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The NPU architecture is hardcoded to dav-2201 (which corresponds to the A2 platform). Since this kernel is intended to support both A2 and A3 (as stated in the README), hardcoding the architecture might prevent the compiler from applying A3-specific optimizations or could lead to compatibility issues on A3 hardware. Consider making the architecture configurable via an environment variable.

Suggested change
--npu-arch=dav-2201 \
--npu-arch=${NPU_ARCH:-dav-2201} \

Comment on lines +194 to +195
pto.record_event("MOV_M2L", "LOAD", event_id=[0, 1])
pto.record_event("MATMUL", "MOV_M2L", event_id=[0, 1])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Passing a list to event_id in pto.record_event may not be supported by all versions of the PTO-DSL or might lead to unexpected IR generation if the backend expects a single integer constant. It is safer and more explicit to record each event individually to ensure both synchronization slots are correctly initialized.

            pto.record_event("MOV_M2L", "LOAD", event_id=0)
            pto.record_event("MOV_M2L", "LOAD", event_id=1)
            pto.record_event("MATMUL", "MOV_M2L", event_id=0)
            pto.record_event("MATMUL", "MOV_M2L", event_id=1)

Comment on lines +320 to +321
pto.wait_event("MOV_M2L", "LOAD", event_id=[0, 1])
pto.wait_event("MATMUL", "MOV_M2L", event_id=[0, 1])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the initialization at the start of the kernel, pto.wait_event should ideally be called for each event ID individually if the DSL does not explicitly support list inputs for synchronization primitives.

Suggested change
pto.wait_event("MOV_M2L", "LOAD", event_id=[0, 1])
pto.wait_event("MATMUL", "MOV_M2L", event_id=[0, 1])
pto.wait_event("MOV_M2L", "LOAD", event_id=0)
pto.wait_event("MOV_M2L", "LOAD", event_id=1)
pto.wait_event("MATMUL", "MOV_M2L", event_id=0)
pto.wait_event("MATMUL", "MOV_M2L", event_id=1)

@zhoubot
Copy link
Copy Markdown
Collaborator

zhoubot commented May 8, 2026

Triage review (2026-05-08): this PR is in good merge shape. GitHub reports it as clean against main, and all four CI checks pass (Pre-commit, Docs build, CPU SIM smoke, CPU SIM full ST). The scope is also self-contained under kernels/python/gemm_performance/ plus the kernel README index updates.

Two pre-merge checks I recommend: confirm the build artifacts listed in the README (gemm_performance.pto, generated .cpp, .so) are not accidentally committed by a normal local run, and keep the hardware validation note in the PR body because CI cannot prove the A2/A3 Torch-NPU path. Otherwise this looks ready for normal maintainer review/merge.

@Crystal-wzy Crystal-wzy closed this May 8, 2026
@Crystal-wzy Crystal-wzy reopened this May 8, 2026
@Crystal-wzy Crystal-wzy changed the title feat: Add PTO-DSL GEMM performance kernel with Python validation runner feat: Add PTO-DSL GEMM performance kernel with Python build and validation May 8, 2026
Crystal-wzy pushed a commit to Crystal-wzy/pto-isa that referenced this pull request May 13, 2026
…ation (hw-native-sys#116)

## Summary
- Add `kernels/python/gemm_performance/` with a PTO-DSL GEMM kernel
  (`gemm_performance.py`) targeting A2/A3, featuring 2D multi-core tiling,
  L1 double buffering, L0 ping-pong buffers, and a four-stage
  LOAD→EXTRACT→MATMUL→STORE pipeline
- Add `run_gemm.py` runner with multiple shape presets, auto-build
  (ptoas + bisheng), Torch-NPU correctness validation, and benchmark
  with optional `torch.matmul` baseline comparison
- Add `compile.sh` for standalone `.pto` → `.cpp` → `.so` build flow
- Add bilingual README (English and Chinese) documenting operator
  description, tiling parameters, prerequisites, build/run instructions,
  and available presets
- Register `python/gemm_performance/` in the top-level `kernels/` README

## Testing
- [x] Correctness test passes on A2/A3 hardware (`python3 run_gemm.py`)
- [x] Benchmark mode reports expected TFLOPS (`python3 run_gemm.py --benchmark`)
- [x] Pre-commit hooks pass (ruff, pyright, markdownlint)
@Crystal-wzy Crystal-wzy force-pushed the main branch 2 times, most recently from 5a61598 to 7705674 Compare May 13, 2026 01:42
…ation

## Summary
- Add `kernels/python/gemm_performance/` with a PTO-DSL GEMM kernel
  (`gemm_performance.py`) targeting A2/A3, featuring 2D multi-core tiling,
  L1 double buffering, L0 ping-pong buffers, and a four-stage
  LOAD→EXTRACT→MATMUL→STORE pipeline
- Add `run_gemm.py` runner with multiple shape presets, auto-build
  (ptoas + bisheng), Torch-NPU correctness validation, and benchmark
  with optional `torch.matmul` baseline comparison
- Add `compile.sh` for standalone `.pto` → `.cpp` → `.so` build flow
- Add bilingual README (English and Chinese) documenting operator
  description, tiling parameters, prerequisites, build/run instructions,
  and available presets
- Register `python/gemm_performance/` in the top-level `kernels/` README

## Testing
- [x] Correctness test passes on A2/A3 hardware (`python3 run_gemm.py`)
- [x] Benchmark mode reports expected TFLOPS (`python3 run_gemm.py --benchmark`)
- [x] Pre-commit hooks pass (ruff, pyright, markdownlint)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants