Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
b269e2b
feat: init
fiskrt Mar 24, 2026
e821d3d
feat: up isa version, need etc.. for mix-kernels
fiskrt Mar 24, 2026
a83b06d
feat: add simpler cpp case
fiskrt Mar 26, 2026
0ad3cee
feat: rename
fiskrt Mar 26, 2026
4e03435
feat: add mlir example[ WIP ]
fiskrt Mar 26, 2026
da15c6d
feat: add gitignore
fiskrt Mar 26, 2026
041625f
feat: simple bidirectional transfer working in mlir
fiskrt Mar 27, 2026
5137291
feat: now does simple add
fiskrt Mar 27, 2026
6210864
feat: clean working version simple
fiskrt Mar 27, 2026
f9d8812
feat: clean working version simple
fiskrt Mar 27, 2026
d2bf0ca
wip: add transfer ops to dsl
fiskrt Mar 27, 2026
6055f69
feat: docker add compiled cpp and bindings
fiskrt Mar 27, 2026
8ea9459
feat: use classes instead
fiskrt Mar 27, 2026
3c7dbd9
WIP: add builder with multiple funcs
fiskrt Apr 7, 2026
834ca8f
feat: add type arg to const() api
fiskrt Apr 7, 2026
422f5f2
WIP: in decorated function we allow multiple functions
fiskrt Apr 7, 2026
2362f7e
WIP: simplify ir.py
fiskrt Apr 7, 2026
c7b31f4
use new ptodsl api for builder
fiskrt Apr 7, 2026
0597c0a
feat: remove files
fiskrt Apr 7, 2026
57e30c0
test: add old and new
fiskrt Apr 7, 2026
7182e8a
feat: remove docs
fiskrt Apr 7, 2026
48b1eb3
fix: arith import in builder
fiskrt Apr 7, 2026
0c865a1
test: compare to MLIR pybindings
fiskrt Apr 7, 2026
ea019ed
fix: names
fiskrt Apr 8, 2026
f928163
fix: naming
fiskrt Apr 8, 2026
8ae8635
feat: deuglify the wrappers
fiskrt Apr 8, 2026
62db364
feat: add more examples v2c, c2v,
fiskrt Apr 8, 2026
1d83124
feat: add ffts address (needed for bidir comm)
fiskrt Apr 9, 2026
e32afba
feat: unmangle kernel name
fiskrt Apr 9, 2026
8029599
feat: add ffts functionality to api
fiskrt Apr 9, 2026
10bfb1b
feat: add bidir example
fiskrt Apr 9, 2026
56527f4
chore: docker ptoas ver and pto-isa
fiskrt Apr 9, 2026
9d6c1e0
Merge remote-tracking branch 'origin/main' into feat-mix-kernel-gb
fiskrt Apr 9, 2026
d40da05
chore: black
fiskrt Apr 9, 2026
7c2a4a0
chore: black
fiskrt Apr 9, 2026
2779e56
feat: gitignore
fiskrt Apr 9, 2026
49eae78
feat: move files and cleanup
fiskrt Apr 9, 2026
8ca2c8f
test: add ptoas test
fiskrt Apr 9, 2026
d048958
chore: update pto-isa version in ci
fiskrt Apr 13, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ jobs:
RELEASE_VER: 0.24
RELEASE_TAG: v0.24
CLI_DIR: /installers/ptoas-cli
PTOISA_COMMIT: 2ee948ef636863ed149f176d5327d9db5f349bb6
PTOISA_COMMIT: a8c3fbf42a2f4a0f609f64e138dda62deefddb8e

steps:
- name: Install system packages
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,5 @@ __pycache__
extra-info

*.ptodsl_jit

msprof_res/
4 changes: 2 additions & 2 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@ RUN pip install --no-cache-dir \
ipython jupyterlab matplotlib pandas

# certain operations need latest isa header, not CANN 8.5.0 default
# header on 2026/04/01
ARG PTOISA_COMMIT=2ee948ef636863ed149f176d5327d9db5f349bb6
# header on 2026/04/08
ARG PTOISA_COMMIT=a8c3fbf42a2f4a0f609f64e138dda62deefddb8e
Comment thread
fiskrt marked this conversation as resolved.
WORKDIR /sources
RUN git clone https://gitcode.com/cann/pto-isa.git \
&& cd pto-isa && git checkout $PTOISA_COMMIT
Expand Down
1 change: 1 addition & 0 deletions examples/aot/tpushpop/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
build_artifacts/
80 changes: 80 additions & 0 deletions examples/aot/tpushpop/mix-kernel_mlir/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# TPush / TPop mixed-kernel examples

Small examples of tile FIFO communication between Cube (`AIC`) and Vector (`AIV`).

```bash
python run.py c2v
python run.py v2c
python run.py bidi
```

`python run.py` defaults to `c2v`.

Files:

- `kernels/` has the Python builders.
- `build_artifacts/` gets generated MLIR, generated C++, and the `.so`.
- `gm_slot_buffer` is the GM backing store for the pipe.
- `caller.cpp` sets the FFTS base before launching the generated kernel.

Core idea:

- `aic_initialize_pipe` / `aiv_initialize_pipe` lower to matching `TPipe<...>` objects.
- `gm_slot_buffer` is the shared GM slot memory used by that `TPipe`.
- `tpush_to_aiv` / `tpush_to_aic` lower to `TPUSH(pipe, tile)`.
- `tpop_from_aic` / `tpop_from_aiv` lower to `TPOP(pipe, tile)`.
- `tfree_from_aic` / `tfree_from_aiv` lower to `TFREE(pipe)` and release the consumed slot.

## C2V

Cube sends. Vector receives.

This example computes `X @ X` on Cube, sends the accumulator tile to Vector, then Vector stores it to GM.

```text
Cube: load X -> matmul -> tpush_to_aiv
Vector: tpop_from_aic -> store Y -> tfree_from_aic
```

Pipe wiring:

- Vector owns the consumer buffer: `reserve_buffer("c2v_fifo", location="VEC")`
- Cube imports it: `import_reserved_buffer("c2v_fifo", peer_func="@vector_kernel")`
- Both sides initialize with `dir_mask = 1`

## V2C

Vector sends. Cube receives.

This example loads `X` on Vector, sends that tile to Cube, then Cube stores it to GM.

```text
Vector: load X -> tpush_to_aic
Cube: tpop_from_aiv -> store Y -> tfree_from_aiv
```

Pipe wiring:

- Cube owns the consumer buffer: `reserve_buffer("v2c_fifo", location="MAT")`
- Vector imports it: `import_reserved_buffer("v2c_fifo", peer_func="@cube_kernel")`
- Both sides initialize with `dir_mask = 2`

## BIDI

Both directions are enabled.

This example sends `X @ X` from Cube to Vector. Vector doubles it and sends it back. Cube receives the returned tile and stores it to GM.

```text
Cube: matmul -> tpush_to_aiv
Vector: tpop_from_aic -> add -> tpush_to_aic -> tfree_from_aic
Cube: tpop_from_aiv -> store Y -> tfree_from_aiv
```

Pipe wiring:

- Vector reserves `c2v_fifo`; Cube imports it
- Cube reserves `v2c_fifo`; Vector imports it
- Both sides initialize with `dir_mask = 3`

For `dir_mask = 3`, allocate FIFO backing for both directions. `run.py` uses `8 KiB`.
89 changes: 89 additions & 0 deletions examples/aot/tpushpop/mix-kernel_mlir/c2v.mlir
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
// Bidirectional pipe example.
//
// This reduced version only uses the C2V pipe:
// - `c2v_fifo`: cube/kernel `@cube_kernel` pushes to vector/kernel `@vector_kernel`
//
// `gm_slot_buffer` is the GM-backed slot storage for these pipes. The reserve/import
// ops connect each side of the same named FIFO, and `aic/aiv_initialize_pipe`
// binds those FIFO endpoints to the shared GM slot buffer plus each side's local
// consumer buffer.
//
// End-to-end data flow:
// - Cube loads one input matrix `X` from GM.
// - Cube computes `Y = X @ X`.
// - Cube sends that accumulator tile to vector over `c2v_fifo`.
// - Vector pops the tile and stores it to GM as output matrix `Y`.
//
// What is transferred:
// - Cube -> Vector: one full `16 x 16` `f32` accumulator tile `Y = X @ X`
// sent with `pto.tpush_to_aiv` using `split = 0` (no split). Vector receives
// that same logical `16 x 16` tile with `pto.tpop_from_aic` in a vector tile
// type/layout, then stores it to the GM output buffer.
//
// Shape summary:
// - All transferred tiles are `rows=16, cols=16, dtype=f32`
// - Cube-produced C2V tile: `loc=acc`, `blayout=col_major`, `slayout=row_major`
// - Vector-consumed tile after C2V pop: `loc=vec`, `blayout=row_major`, `slayout=none_box`
module {

func.func @call_both(%gm_slot_buffer: !pto.ptr<f32>, %gm_x: !pto.ptr<f32>, %gm_y: !pto.ptr<f32>) attributes {pto.entry} {
func.call @cube_kernel(%gm_slot_buffer, %gm_x) : (!pto.ptr<f32>, !pto.ptr<f32>) -> ()
func.call @vector_kernel(%gm_slot_buffer, %gm_y) : (!pto.ptr<f32>, !pto.ptr<f32>) -> ()
return
}

func.func @cube_kernel(%gm_slot_buffer: !pto.ptr<f32>, %gm_x: !pto.ptr<f32>) attributes {pto.kernel_kind = #pto.kernel_kind<cube>} {
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%c16 = arith.constant 16 : index
%c2v_import = pto.import_reserved_buffer {
name = "c2v_fifo",
peer_func = @vector_kernel
} -> i32
%c0_i32 = arith.constant 0 : i32
pto.aic_initialize_pipe {dir_mask = 1, slot_size = 1024}
(gm_slot_buffer = %gm_slot_buffer : !pto.ptr<f32>,
c2v_consumer_buf = %c2v_import : i32,
v2c_consumer_buf = %c0_i32 : i32)

%x_mat_tile = pto.alloc_tile : !pto.tile_buf<loc=mat, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=col_major, slayout=row_major, fractal=512, pad=0>
%x_left_tile = pto.alloc_tile : !pto.tile_buf<loc=left, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=col_major, slayout=row_major, fractal=512, pad=0>
%x_right_tile = pto.alloc_tile : !pto.tile_buf<loc=right, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=row_major, slayout=col_major, fractal=512, pad=0>
%acc_tile = pto.alloc_tile : !pto.tile_buf<loc=acc, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=col_major, slayout=row_major, fractal=1024, pad=0>
%gm_x_view = pto.make_tensor_view %gm_x, shape = [%c16, %c16], strides = [%c16, %c1] : !pto.tensor_view<?x?xf32>
%gm_x_tile_view = pto.partition_view %gm_x_view, offsets = [%c0, %c0], sizes = [%c16, %c16] : !pto.tensor_view<?x?xf32> -> !pto.partition_tensor_view<16x16xf32>
pto.tload ins(%gm_x_tile_view : !pto.partition_tensor_view<16x16xf32>) outs(%x_mat_tile : !pto.tile_buf<loc=mat, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=col_major, slayout=row_major, fractal=512, pad=0>)
pto.tmov ins(%x_mat_tile : !pto.tile_buf<loc=mat, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=col_major, slayout=row_major, fractal=512, pad=0>) outs(%x_left_tile : !pto.tile_buf<loc=left, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=col_major, slayout=row_major, fractal=512, pad=0>)
pto.tmov ins(%x_mat_tile : !pto.tile_buf<loc=mat, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=col_major, slayout=row_major, fractal=512, pad=0>) outs(%x_right_tile : !pto.tile_buf<loc=right, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=row_major, slayout=col_major, fractal=512, pad=0>)
pto.tmatmul ins(%x_left_tile, %x_right_tile : !pto.tile_buf<loc=left, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=col_major, slayout=row_major, fractal=512, pad=0>, !pto.tile_buf<loc=right, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=row_major, slayout=col_major, fractal=512, pad=0>) outs(%acc_tile : !pto.tile_buf<loc=acc, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=col_major, slayout=row_major, fractal=1024, pad=0>)
pto.tpush_to_aiv(%acc_tile : !pto.tile_buf<loc=acc, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=col_major, slayout=row_major, fractal=1024, pad=0>) {split = 0}
return
}

func.func @vector_kernel(%gm_slot_buffer: !pto.ptr<f32>, %gm_y: !pto.ptr<f32>)
attributes {pto.kernel_kind = #pto.kernel_kind<vector>} {
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%c16 = arith.constant 16 : index
%c2v_local = pto.reserve_buffer {
name = "c2v_fifo",
size = 4096,
location = #pto.address_space<vec>,
auto = true
} -> i32
%c0_i32 = arith.constant 0 : i32
pto.aiv_initialize_pipe {dir_mask = 1, slot_size = 1024}
(gm_slot_buffer = %gm_slot_buffer : !pto.ptr<f32>,
c2v_consumer_buf = %c2v_local : i32,
v2c_consumer_buf = %c0_i32 : i32)

%gm_y_view = pto.make_tensor_view %gm_y, shape = [%c16, %c16], strides = [%c16, %c1] : !pto.tensor_view<?x?xf32>
%gm_y_tile_view = pto.partition_view %gm_y_view, offsets = [%c0, %c0], sizes = [%c16, %c16] : !pto.tensor_view<?x?xf32> -> !pto.partition_tensor_view<16x16xf32>
%recv_tile = pto.tpop_from_aic {split = 0}
-> !pto.tile_buf<loc=vec, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0>
pto.tstore ins(%recv_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0>) outs(%gm_y_tile_view : !pto.partition_tensor_view<16x16xf32>)
pto.tfree_from_aic {split = 0}
return
}

}
28 changes: 28 additions & 0 deletions examples/aot/tpushpop/mix-kernel_mlir/caller.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#ifndef KERNEL_CPP
#error "KERNEL_CPP must be defined at compile time."
#endif

#include <cstdint>

extern "C" int rtGetC2cCtrlAddr(uint64_t *ctrlAddr, uint32_t *ctrlLen);

#include KERNEL_CPP

extern "C" void call_kernel(
uint32_t blockDim,
void *stream,
uint8_t *gmSlotBuffer,
uint8_t *x,
uint8_t *y)
{
void *fftsAddr = nullptr;
uint32_t fftsLen = 0;
(void)rtGetC2cCtrlAddr(reinterpret_cast<uint64_t *>(&fftsAddr), &fftsLen);
(void)fftsLen;

call_both<<<blockDim, nullptr, stream>>>(
(__gm__ int64_t *)fftsAddr,
(__gm__ float *)gmSlotBuffer,
(__gm__ float *)x,
(__gm__ float *)y);
}
45 changes: 45 additions & 0 deletions examples/aot/tpushpop/mix-kernel_mlir/compile.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
#!/usr/bin/env bash
set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
ARTIFACT_DIR="${SCRIPT_DIR}/build_artifacts"
MODE="${TPUSHPOP_MODE:-c2v}"
BUILDER_PATH="${SCRIPT_DIR}/kernels/${MODE}_builder.py"
MLIR_GEN_PATH="${ARTIFACT_DIR}/${MODE}_gen.mlir"
GENERATED_CPP="${ARTIFACT_DIR}/${MODE}.cpp"
LIB_PATH="${ARTIFACT_DIR}/tpushpop_mlir_lib.so"

case "${MODE}" in
c2v|c2v_add|v2c|bidi) ;;
*)
echo "Unknown TPUSHPOP_MODE: ${MODE}" >&2
exit 2
;;
esac

mkdir -p "${ARTIFACT_DIR}"
rm -f "${GENERATED_CPP}" "${LIB_PATH}"

python "${BUILDER_PATH}" > "${MLIR_GEN_PATH}"
ptoas --pto-arch=a3 --enable-insert-sync "${MLIR_GEN_PATH}" > "${GENERATED_CPP}"
# add extern "C" to function so kernel name is not mangled
perl -0pi -e 's/\b__global__ AICORE void call_both\(/extern "C" __global__ AICORE void call_both(/' "${GENERATED_CPP}"

bisheng \
-I/sources/pto-isa/include/ \
-fPIC -shared -D_FORTIFY_SOURCE=2 -O2 -std=c++17 -g \
-Wno-macro-redefined -Wno-ignored-attributes -fstack-protector-strong \
-xcce -Xhost-start -Xhost-end \
-mllvm -cce-aicore-stack-size=0x8000 \
-mllvm -cce-aicore-function-stack-size=0x8000 \
-mllvm -cce-aicore-record-overflow=true \
-mllvm -cce-aicore-addr-transform \
-mllvm -cce-aicore-dcci-insert-for-scalar=false \
--npu-arch=dav-2201 -DMEMORY_BASE \
-std=gnu++17 \
-DKERNEL_CPP="\"${GENERATED_CPP}\"" \
"${SCRIPT_DIR}/caller.cpp" \
-o "${LIB_PATH}"

echo "Generated ${GENERATED_CPP}."
echo "Built ${LIB_PATH}."
Loading
Loading