Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions kernels/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ Most kernel subdirectories are **self-contained mini-projects** (kernel + host +
- `manual/a5/matmul_mxfp8_performance/`: MXFP8 matrix multiplication example
- `manual/common/`: Cross-platform kernels
- `manual/common/flash_atten/`: Flash-Attention kernel (A2/A3/A5)
- `python/`: PTO-DSL kernel sources and lightweight Python-side validation runners
- `python/gemm_performance/`: PTO-DSL GEMM kernel build script and Torch-NPU validation/benchmark runner
- `custom/`: Examples/scaffolding for custom kernel/operator extensions

## Notes
Expand Down
2 changes: 2 additions & 0 deletions kernels/README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@
- `manual/a5/matmul_mxfp8_performance/`:MXFP8 矩阵乘法示例
- `manual/common/`:跨平台 kernels
- `manual/common/flash_atten/`:Flash-Attention kernel(A2/A3/A5)
- `python/`:PTO-DSL kernel 源码与轻量级 Python 侧验证/性能脚本
- `python/gemm_performance/`:PTO-DSL GEMM kernel 的构建脚本,以及基于 Torch-NPU 的验证/benchmark 入口
- `custom/`:自定义 kernel / operator 扩展的示例与脚手架

## 备注
Expand Down
128 changes: 128 additions & 0 deletions kernels/python/gemm_performance/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
# PTO-DSL GEMM Performance Kernel

## Overview

This directory contains a PTO-DSL implementation of the A2/A3 GEMM performance kernel plus a Python runner for correctness checks and benchmarking.

## Supported AI Processors

- A2/A3

## Directory Layout

```text
kernels/python/gemm_performance/
├── gemm_performance.py # PTO-DSL source that emits PTO IR
├── compile.sh # Builds .pto -> .cpp -> .so
├── caller.cpp # Shared-library launcher wrapper
├── run_gemm.py # Torch-NPU correctness test and benchmark runner
├── gemm_performance.pto # Generated PTO IR (build artifact)
├── gemm_performance.cpp # Generated C++ kernel (build artifact)
└── gemm_kernel.so # Compiled shared library (build artifact)
```

## Operator Description

The kernel computes:

```text
C = A x B
```

- `A`: `m×k`, `float16`, `ND`
- logical `B`: `k×n`, `float16`
- GM storage for `B`: transposed `DN` layout (`n×k`)
- `C`: `m×n`, `float32`, `ND`

The Python runner follows that contract explicitly: it creates a logical `B[k, n]`, transposes it before passing it to the kernel, and uses the non-transposed tensor for the Torch reference.

## Default Shape And Tiling

- `m = 6144`
- `k = 6144`
- `n = 6144`
- `singleCoreM = 1536`
- `singleCoreK = 6144`
- `singleCoreN = 1024`
- `baseM = 128`
- `baseK = 64`
- `baseN = 256`
- `stepKa = 4`
- `stepKb = 4`
- `blockDim = 24`

## Prerequisites

Before building or running, ensure the following are available:

- Ascend CANN toolchain environment
- `bisheng`
- `ptoas`
- Python modules for MLIR/PTO (`mlir.ir`, `mlir.dialects.pto`)
- `torch` and `torch_npu`

`compile.sh` will try to source a local Ascend environment automatically when `bisheng` is not already on `PATH`, and it also honors:

- `PTOAS_ROOT`: prepends PTO assembler binaries/libs to `PATH` and `LD_LIBRARY_PATH`
- `PTO_LIB_PATH`: overrides the repo root used for `include/`

## Build

```bash
cd ${git_clone_path}/kernels/python/gemm_performance
bash compile.sh
```

Expected build steps:

1. Generate `gemm_performance.pto` from `gemm_performance.py`
2. Assemble PTO IR into `gemm_performance.cpp`
3. Compile `caller.cpp` into `gemm_kernel.so`

## Run

Correctness test:

```bash
cd ${git_clone_path}/kernels/python/gemm_performance
python3 run_gemm.py
```

Correctness test plus benchmark:

```bash
python3 run_gemm.py --benchmark
```

Benchmark plus matching Torch-NPU baseline:

```bash
python3 run_gemm.py --torch-npu
```

This baseline uses a single torch.matmul call with fp16 inputs and fp16 output.

Use a specific shared library:

```bash
python3 run_gemm.py --lib ./gemm_kernel.so
```

Select device with either of these environment variables:

- `PTODSL_TEST_DEVICE_ID`
- `TASK_DEVICE`

If neither is set, the runner defaults to `npu:0`.

If `gemm_kernel.so` is missing, `run_gemm.py` automatically invokes `compile.sh`.

## Expected Output

Correctness mode prints a `PASS` or `FAIL` line for the configured shape and ends with:

```text
Result: ALL PASS
```

Benchmark mode additionally prints average latency and TFLOPS.
128 changes: 128 additions & 0 deletions kernels/python/gemm_performance/README_zh.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
# PTO-DSL GEMM 性能 Kernel

## 概览

本目录包含一个面向 A2/A3 的 PTO-DSL GEMM 性能 kernel,以及用于正确性验证和 benchmark 的 Python 运行脚本。

## 支持的 AI 处理器

- A2/A3

## 目录结构

```text
kernels/python/gemm_performance/
├── gemm_performance.py # 生成 PTO IR 的 PTO-DSL 源码
├── compile.sh # 执行 .pto -> .cpp -> .so 的构建脚本
├── caller.cpp # 共享库 launcher 包装层
├── run_gemm.py # 基于 Torch-NPU 的正确性测试和 benchmark 入口
├── gemm_performance.pto # 生成的 PTO IR(构建产物)
├── gemm_performance.cpp # 生成的 C++ kernel(构建产物)
└── gemm_kernel.so # 编译后的共享库(构建产物)
```

## 算子说明

该 kernel 计算:

```text
C = A x B
```

- `A`:`m×k`,`float16`,`ND`
- 逻辑上的 `B`:`k×n`,`float16`
- `B` 在 GM 中的存储:转置后的 `DN` 布局(`n×k`)
- `C`:`m×n`,`float32`,`ND`

Python runner 明确遵循这一定义:先构造逻辑上的 `B[k, n]`,传给 kernel 前先转置,再使用未转置的 `B` 参与 Torch 参考计算。

## 默认形状与 Tiling

- `m = 6144`
- `k = 6144`
- `n = 6144`
- `singleCoreM = 1536`
- `singleCoreK = 6144`
- `singleCoreN = 1024`
- `baseM = 128`
- `baseK = 64`
- `baseN = 256`
- `stepKa = 4`
- `stepKb = 4`
- `blockDim = 24`

## 依赖条件

在构建或运行前,需要具备:

- Ascend CANN 工具链环境
- `bisheng`
- `ptoas`
- MLIR/PTO Python 模块(`mlir.ir`、`mlir.dialects.pto`)
- `torch` 与 `torch_npu`

`compile.sh` 在 `PATH` 中找不到 `bisheng` 时,会尝试自动加载本地 Ascend 环境;同时它也支持:

- `PTOAS_ROOT`:将 PTO assembler 的二进制和库目录追加到 `PATH` / `LD_LIBRARY_PATH`
- `PTO_LIB_PATH`:覆盖 `include/` 所使用的仓库根目录

## 构建

```bash
cd ${git_clone_path}/kernels/python/gemm_performance
bash compile.sh
```

预期构建步骤:

1. 由 `gemm_performance.py` 生成 `gemm_performance.pto`
2. 将 PTO IR 汇编为 `gemm_performance.cpp`
3. 将 `caller.cpp` 编译成 `gemm_kernel.so`

## 运行

仅做正确性测试:

```bash
cd ${git_clone_path}/kernels/python/gemm_performance
python3 run_gemm.py
```

正确性测试加 benchmark:

```bash
python3 run_gemm.py --benchmark
```

benchmark 并附带同 shape 的 Torch-NPU baseline:

```bash
python3 run_gemm.py --torch-npu
```

该 baseline 使用单次 `torch.matmul`,输入为 fp16,输出也为 fp16。

指定共享库路径:

```bash
python3 run_gemm.py --lib ./gemm_kernel.so
```

设备可通过以下环境变量指定:

- `PTODSL_TEST_DEVICE_ID`
- `TASK_DEVICE`

如果两者都未设置,runner 默认使用 `npu:0`。

如果 `gemm_kernel.so` 不存在,`run_gemm.py` 会自动调用 `compile.sh`。

## 预期输出

正确性模式会打印当前 shape 的 `PASS` / `FAIL`,并最终输出:

```text
Result: ALL PASS
```

开启 benchmark 后,还会额外打印平均时延和 TFLOPS。
62 changes: 62 additions & 0 deletions kernels/python/gemm_performance/compile.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
#!/bin/bash
set -e

SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
cd "${SCRIPT_DIR}"

# Load a local Ascend environment when the compiler is not already on PATH.
if ! command -v bisheng >/dev/null 2>&1; then
if [[ -f /usr/local/Ascend/ascend-toolkit/set_env.sh ]]; then
# shellcheck disable=SC1091
source /usr/local/Ascend/ascend-toolkit/set_env.sh || true
elif [[ -f /usr/local/Ascend/cann-8.5.0/set_env.sh ]]; then
# shellcheck disable=SC1091
source /usr/local/Ascend/cann-8.5.0/set_env.sh || true
fi
fi

if [[ -n "${PTOAS_ROOT:-}" ]]; then
export PATH="${PTOAS_ROOT}/bin:${PTOAS_ROOT}:${PATH}"
export LD_LIBRARY_PATH="${PTOAS_ROOT}/lib:${LD_LIBRARY_PATH:-}"
fi

if ! command -v bisheng >/dev/null 2>&1; then
echo "[ERROR] Missing executable: bisheng"
echo "Source Ascend CANN environment or add bisheng to PATH before running compile.sh."
exit 1
fi

# Remove previous build artifacts.
rm -f gemm_performance.pto gemm_performance.cpp gemm_kernel.so

if ! python3 -c "import mlir.ir; from mlir.dialects import pto" >/dev/null 2>&1; then
echo "[ERROR] Missing Python MLIR/PTO modules."
echo "Ensure the MLIR Python environment is available before running compile.sh."
exit 1
fi

if ! command -v ptoas >/dev/null 2>&1; then
echo "[ERROR] Missing executable: ptoas"
echo "Ensure ptoas is on PATH before running compile.sh."
exit 1
fi

# Step 1: Generate PTO IR from Python DSL
echo "[1/3] Generating IR..."
python3 "${SCRIPT_DIR}/gemm_performance.py" > gemm_performance.pto

# Step 2: Assemble PTO IR to C++
echo "[2/3] Assembling with ptoas..."
ptoas --enable-insert-sync gemm_performance.pto -o gemm_performance.cpp

# Step 3: Compile to shared library
echo "[3/3] Compiling with bisheng..."
PTO_LIB_PATH=${PTO_LIB_PATH:-$(cd "${SCRIPT_DIR}/../../.." && pwd)}

bisheng -fPIC -shared -xcce -O2 -std=c++17 \
--npu-arch=dav-2201 \
-I${PTO_LIB_PATH}/include \
"${SCRIPT_DIR}/caller.cpp" \
-o "${SCRIPT_DIR}/gemm_kernel.so"

echo "Done! Run: python3 run_gemm.py"
Loading
Loading