hw-native-sys · Crystal-wzy · May 18, 2026
diff --git a/kernels/README.md b/kernels/README.md
@@ -23,6 +23,8 @@ Most kernel subdirectories are **self-contained mini-projects** (kernel + host +
     - `manual/a5/matmul_mxfp8_performance/`: MXFP8 matrix multiplication example
   - `manual/common/`: Cross-platform kernels
     - `manual/common/flash_atten/`: Flash-Attention kernel (A2/A3/A5)
+- `python/`: PTO-DSL kernel sources and lightweight Python-side validation runners
+  - `python/gemm_performance/`: PTO-DSL GEMM kernel build script and Torch-NPU validation/benchmark runner
 - `custom/`: Examples/scaffolding for custom kernel/operator extensions
 
 ## Notes

diff --git a/kernels/README_zh.md b/kernels/README_zh.md
@@ -23,6 +23,8 @@
     - `manual/a5/matmul_mxfp8_performance/`：MXFP8 矩阵乘法示例
   - `manual/common/`：跨平台 kernels
     - `manual/common/flash_atten/`：Flash-Attention kernel（A2/A3/A5）
+- `python/`：PTO-DSL kernel 源码与轻量级 Python 侧验证/性能脚本
+  - `python/gemm_performance/`：PTO-DSL GEMM kernel 的构建脚本，以及基于 Torch-NPU 的验证/benchmark 入口
 - `custom/`：自定义 kernel / operator 扩展的示例与脚手架
 
 ## 备注

diff --git a/kernels/python/gemm_performance/README.md b/kernels/python/gemm_performance/README.md
@@ -0,0 +1,128 @@
+# PTO-DSL GEMM Performance Kernel
+
+## Overview
+
+This directory contains a PTO-DSL implementation of the A2/A3 GEMM performance kernel plus a Python runner for correctness checks and benchmarking.
+
+## Supported AI Processors
+
+- A2/A3
+
+## Directory Layout
+
+```text
+kernels/python/gemm_performance/
+├── gemm_performance.py      # PTO-DSL source that emits PTO IR
+├── compile.sh               # Builds .pto -> .cpp -> .so
+├── caller.cpp               # Shared-library launcher wrapper
+├── run_gemm.py              # Torch-NPU correctness test and benchmark runner
+├── gemm_performance.pto     # Generated PTO IR (build artifact)
+├── gemm_performance.cpp     # Generated C++ kernel (build artifact)
+└── gemm_kernel.so           # Compiled shared library (build artifact)
+```
+
+## Operator Description
+
+The kernel computes:
+
+```text
+C = A x B
+```
+
+- `A`: `m×k`, `float16`, `ND`
+- logical `B`: `k×n`, `float16`
+- GM storage for `B`: transposed `DN` layout (`n×k`)
+- `C`: `m×n`, `float32`, `ND`
+
+The Python runner follows that contract explicitly: it creates a logical `B[k, n]`, transposes it before passing it to the kernel, and uses the non-transposed tensor for the Torch reference.
+
+## Default Shape And Tiling
+
+- `m = 6144`
+- `k = 6144`
+- `n = 6144`
+- `singleCoreM = 1536`
+- `singleCoreK = 6144`
+- `singleCoreN = 1024`
+- `baseM = 128`
+- `baseK = 64`
+- `baseN = 256`
+- `stepKa = 4`
+- `stepKb = 4`
+- `blockDim = 24`
+
+## Prerequisites
+
+Before building or running, ensure the following are available:
+
+- Ascend CANN toolchain environment
+- `bisheng`
+- `ptoas`
+- Python modules for MLIR/PTO (`mlir.ir`, `mlir.dialects.pto`)
+- `torch` and `torch_npu`
+
+`compile.sh` will try to source a local Ascend environment automatically when `bisheng` is not already on `PATH`, and it also honors:
+
+- `PTOAS_ROOT`: prepends PTO assembler binaries/libs to `PATH` and `LD_LIBRARY_PATH`
+- `PTO_LIB_PATH`: overrides the repo root used for `include/`
+
+## Build
+
+```bash
+cd ${git_clone_path}/kernels/python/gemm_performance
+bash compile.sh
+```
+
+Expected build steps:
+
+1. Generate `gemm_performance.pto` from `gemm_performance.py`
+2. Assemble PTO IR into `gemm_performance.cpp`
+3. Compile `caller.cpp` into `gemm_kernel.so`
+
+## Run
+
+Correctness test:
+
+```bash
+cd ${git_clone_path}/kernels/python/gemm_performance
+python3 run_gemm.py
+```
+
+Correctness test plus benchmark:
+
+```bash
+python3 run_gemm.py --benchmark
+```
+
+Benchmark plus matching Torch-NPU baseline:
+
+```bash
+python3 run_gemm.py --torch-npu
+```
+
+This baseline uses a single torch.matmul call with fp16 inputs and fp16 output.
+
+Use a specific shared library:
+
+```bash
+python3 run_gemm.py --lib ./gemm_kernel.so
+```
+
+Select device with either of these environment variables:
+
+- `PTODSL_TEST_DEVICE_ID`
+- `TASK_DEVICE`
+
+If neither is set, the runner defaults to `npu:0`.
+
+If `gemm_kernel.so` is missing, `run_gemm.py` automatically invokes `compile.sh`.
+
+## Expected Output
+
+Correctness mode prints a `PASS` or `FAIL` line for the configured shape and ends with:
+
+```text
+Result: ALL PASS
+```
+
+Benchmark mode additionally prints average latency and TFLOPS.
diff --git a/kernels/python/gemm_performance/README_zh.md b/kernels/python/gemm_performance/README_zh.md
@@ -0,0 +1,128 @@
+# PTO-DSL GEMM 性能 Kernel
+
+## 概览
+
+本目录包含一个面向 A2/A3 的 PTO-DSL GEMM 性能 kernel，以及用于正确性验证和 benchmark 的 Python 运行脚本。
+
+## 支持的 AI 处理器
+
+- A2/A3
+
+## 目录结构
+
+```text
+kernels/python/gemm_performance/
+├── gemm_performance.py      # 生成 PTO IR 的 PTO-DSL 源码
+├── compile.sh               # 执行 .pto -> .cpp -> .so 的构建脚本
+├── caller.cpp               # 共享库 launcher 包装层
+├── run_gemm.py              # 基于 Torch-NPU 的正确性测试和 benchmark 入口
+├── gemm_performance.pto     # 生成的 PTO IR（构建产物）
+├── gemm_performance.cpp     # 生成的 C++ kernel（构建产物）
+└── gemm_kernel.so           # 编译后的共享库（构建产物）
+```
+
+## 算子说明
+
+该 kernel 计算：
+
+```text
+C = A x B
+```
+
+- `A`：`m×k`，`float16`，`ND`
+- 逻辑上的 `B`：`k×n`，`float16`
+- `B` 在 GM 中的存储：转置后的 `DN` 布局（`n×k`）
+- `C`：`m×n`，`float32`，`ND`
+
+Python runner 明确遵循这一定义：先构造逻辑上的 `B[k, n]`，传给 kernel 前先转置，再使用未转置的 `B` 参与 Torch 参考计算。
+
+## 默认形状与 Tiling
+
+- `m = 6144`
+- `k = 6144`
+- `n = 6144`
+- `singleCoreM = 1536`
+- `singleCoreK = 6144`
+- `singleCoreN = 1024`
+- `baseM = 128`
+- `baseK = 64`
+- `baseN = 256`
+- `stepKa = 4`
+- `stepKb = 4`
+- `blockDim = 24`
+
+## 依赖条件
+
+在构建或运行前，需要具备：
+
+- Ascend CANN 工具链环境
+- `bisheng`
+- `ptoas`
+- MLIR/PTO Python 模块（`mlir.ir`、`mlir.dialects.pto`）
+- `torch` 与 `torch_npu`
+
+`compile.sh` 在 `PATH` 中找不到 `bisheng` 时，会尝试自动加载本地 Ascend 环境；同时它也支持：
+
+- `PTOAS_ROOT`：将 PTO assembler 的二进制和库目录追加到 `PATH` / `LD_LIBRARY_PATH`
+- `PTO_LIB_PATH`：覆盖 `include/` 所使用的仓库根目录
+
+## 构建
+
+```bash
+cd ${git_clone_path}/kernels/python/gemm_performance
+bash compile.sh
+```
+
+预期构建步骤：
+
+1. 由 `gemm_performance.py` 生成 `gemm_performance.pto`
+2. 将 PTO IR 汇编为 `gemm_performance.cpp`
+3. 将 `caller.cpp` 编译成 `gemm_kernel.so`
+
+## 运行
+
+仅做正确性测试：
+
+```bash
+cd ${git_clone_path}/kernels/python/gemm_performance
+python3 run_gemm.py
+```
+
+正确性测试加 benchmark：
+
+```bash
+python3 run_gemm.py --benchmark
+```
+
+benchmark 并附带同 shape 的 Torch-NPU baseline：
+
+```bash
+python3 run_gemm.py --torch-npu
+```
+
+该 baseline 使用单次 `torch.matmul`，输入为 fp16，输出也为 fp16。
+
+指定共享库路径：
+
+```bash
+python3 run_gemm.py --lib ./gemm_kernel.so
+```
+
+设备可通过以下环境变量指定：
+
+- `PTODSL_TEST_DEVICE_ID`
+- `TASK_DEVICE`
+
+如果两者都未设置，runner 默认使用 `npu:0`。
+
+如果 `gemm_kernel.so` 不存在，`run_gemm.py` 会自动调用 `compile.sh`。
+
+## 预期输出
+
+正确性模式会打印当前 shape 的 `PASS` / `FAIL`，并最终输出：
+
+```text
+Result: ALL PASS
+```
+
+开启 benchmark 后，还会额外打印平均时延和 TFLOPS。
diff --git a/kernels/python/gemm_performance/compile.sh b/kernels/python/gemm_performance/compile.sh
@@ -0,0 +1,62 @@
+#!/bin/bash
+set -e
+
+SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
+cd "${SCRIPT_DIR}"
+
+# Load a local Ascend environment when the compiler is not already on PATH.
+if ! command -v bisheng >/dev/null 2>&1; then
+    if [[ -f /usr/local/Ascend/ascend-toolkit/set_env.sh ]]; then
+        # shellcheck disable=SC1091
+        source /usr/local/Ascend/ascend-toolkit/set_env.sh || true
+    elif [[ -f /usr/local/Ascend/cann-8.5.0/set_env.sh ]]; then
+        # shellcheck disable=SC1091
+        source /usr/local/Ascend/cann-8.5.0/set_env.sh || true
+    fi
+fi
+
+if [[ -n "${PTOAS_ROOT:-}" ]]; then
+    export PATH="${PTOAS_ROOT}/bin:${PTOAS_ROOT}:${PATH}"
+    export LD_LIBRARY_PATH="${PTOAS_ROOT}/lib:${LD_LIBRARY_PATH:-}"
+fi
+
+if ! command -v bisheng >/dev/null 2>&1; then
+    echo "[ERROR] Missing executable: bisheng"
+    echo "Source Ascend CANN environment or add bisheng to PATH before running compile.sh."
+    exit 1
+fi
+
+# Remove previous build artifacts.
+rm -f gemm_performance.pto gemm_performance.cpp gemm_kernel.so
+
+if ! python3 -c "import mlir.ir; from mlir.dialects import pto" >/dev/null 2>&1; then
+    echo "[ERROR] Missing Python MLIR/PTO modules."
+    echo "Ensure the MLIR Python environment is available before running compile.sh."
+    exit 1
+fi
+
+if ! command -v ptoas >/dev/null 2>&1; then
+    echo "[ERROR] Missing executable: ptoas"
+    echo "Ensure ptoas is on PATH before running compile.sh."
+    exit 1
+fi
+
+# Step 1: Generate PTO IR from Python DSL
+echo "[1/3] Generating IR..."
+python3 "${SCRIPT_DIR}/gemm_performance.py" > gemm_performance.pto
+
+# Step 2: Assemble PTO IR to C++
+echo "[2/3] Assembling with ptoas..."
+ptoas --enable-insert-sync gemm_performance.pto -o gemm_performance.cpp
+
+# Step 3: Compile to shared library
+echo "[3/3] Compiling with bisheng..."
+PTO_LIB_PATH=${PTO_LIB_PATH:-$(cd "${SCRIPT_DIR}/../../.." && pwd)}
+
+bisheng -fPIC -shared -xcce -O2 -std=c++17 \
+    --npu-arch=dav-2201 \
+    -I${PTO_LIB_PATH}/include \
+    "${SCRIPT_DIR}/caller.cpp" \
+    -o "${SCRIPT_DIR}/gemm_kernel.so"
+
+echo "Done! Run: python3 run_gemm.py"