feat: Cross-core comm with `TPUSH`/`TPOP` by fiskrt · Pull Request #98 · huawei-csl/pto-dsl

fiskrt · 2026-03-24T16:41:52Z

Cross-core comm with `TPUSH`/`TPOP`

Three example implementations of

v2c
c2v
bidi

TODO:

Currently the pto dialect only supports cross core transfers when cube/vector are separate func.funcs, see mlir examples below. So far we have been using pto.section.vector/cube so how can we unify this in the dsl?
Make a python example using MLIR bindings and pto_general.py wrappers, rather than just pure mlir

NOTE:

We don't just wrap the pto-isa TPUSH/TPOP/TFREE but rather the higher level ops in pto dialect, such as _pto.TPushToAivOp

A simple matmul -> transfer result to aiv -> store to GM from aiv:

We pass the kernel x, y, fifo where x is the input 16x16xfp32 tensor, y is same shape where result will be stored, and fifo is the buffer which is used for cross communication.

Now steps are:

Using x in GM do a regular matmul so r1=x@x ends up in an Acc tile which can be transferred into aiv using pto.push_to_aiv (through the GM buffer fifo that we allocated before program start).

When TPUSH is used on a TileType::Acc that resides in L0C it will use FIXP to write back to fifo in GM.

The TPOP instruction handles waiting for the other core using WAIT_FLAG_DEVI and also the loading from fifo buffer into UB memory.
Finally the regular pto.tstore is executed from the aiv which writes r1 into GM in the y slot.

caller function

  func.func @call_both(%gm_slot_buffer: !pto.ptr<f32>, %gm_x: !pto.ptr<f32>, %gm_y: !pto.ptr<f32>) attributes {pto.entry} {
    func.call @cube_kernel(%gm_slot_buffer, %gm_x) : (!pto.ptr<f32>, !pto.ptr<f32>) -> ()
    func.call @vector_kernel(%gm_slot_buffer, %gm_y) : (!pto.ptr<f32>, !pto.ptr<f32>) -> ()
    return
  }

cube kernel

func.func @cube_kernel(%gm_slot_buffer: !pto.ptr<f32>, %gm_x: !pto.ptr<f32>) attributes {pto.kernel_kind = #pto.kernel_kind<cube>} {
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %c16 = arith.constant 16 : index
    %c2v_import = pto.import_reserved_buffer {
      name = "c2v_fifo",
      peer_func = @vector_kernel
    } -> i32
    %c0_i32 = arith.constant 0 : i32
    pto.aic_initialize_pipe {dir_mask = 1, slot_size = 1024}
      (gm_slot_buffer = %gm_slot_buffer : !pto.ptr<f32>,
       c2v_consumer_buf = %c2v_import : i32,
       v2c_consumer_buf = %c0_i32 : i32)

    %x_mat_tile = pto.alloc_tile : !pto.tile_buf<loc=mat, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=col_major, slayout=row_major, fractal=512, pad=0>
    %x_left_tile = pto.alloc_tile : !pto.tile_buf<loc=left, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=col_major, slayout=row_major, fractal=512, pad=0>
    %x_right_tile = pto.alloc_tile : !pto.tile_buf<loc=right, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=row_major, slayout=col_major, fractal=512, pad=0>
    %acc_tile = pto.alloc_tile : !pto.tile_buf<loc=acc, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=col_major, slayout=row_major, fractal=1024, pad=0>
    %gm_x_view = pto.make_tensor_view %gm_x, shape = [%c16, %c16], strides = [%c16, %c1] : !pto.tensor_view<?x?xf32>
    %gm_x_tile_view = pto.partition_view %gm_x_view, offsets = [%c0, %c0], sizes = [%c16, %c16] : !pto.tensor_view<?x?xf32> -> !pto.partition_tensor_view<16x16xf32>
    pto.tload ins(%gm_x_tile_view : !pto.partition_tensor_view<16x16xf32>) outs(%x_mat_tile : !pto.tile_buf<loc=mat, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=col_major, slayout=row_major, fractal=512, pad=0>)
    pto.tmov ins(%x_mat_tile : !pto.tile_buf<loc=mat, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=col_major, slayout=row_major, fractal=512, pad=0>) outs(%x_left_tile : !pto.tile_buf<loc=left, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=col_major, slayout=row_major, fractal=512, pad=0>)
    pto.tmov ins(%x_mat_tile : !pto.tile_buf<loc=mat, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=col_major, slayout=row_major, fractal=512, pad=0>) outs(%x_right_tile : !pto.tile_buf<loc=right, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=row_major, slayout=col_major, fractal=512, pad=0>)
    pto.tmatmul ins(%x_left_tile, %x_right_tile : !pto.tile_buf<loc=left, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=col_major, slayout=row_major, fractal=512, pad=0>, !pto.tile_buf<loc=right, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=row_major, slayout=col_major, fractal=512, pad=0>) outs(%acc_tile : !pto.tile_buf<loc=acc, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=col_major, slayout=row_major, fractal=1024, pad=0>)
    pto.tpush_to_aiv(%acc_tile : !pto.tile_buf<loc=acc, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=col_major, slayout=row_major, fractal=1024, pad=0>) {split = 0}
    return
  }

vector kernel

 func.func @vector_kernel(%gm_slot_buffer: !pto.ptr<f32>, %gm_y: !pto.ptr<f32>)
      attributes {pto.kernel_kind = #pto.kernel_kind<vector>} {
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %c16 = arith.constant 16 : index
    %c2v_local = pto.reserve_buffer {
      name = "c2v_fifo",
      size = 4096,
      location = #pto.address_space<vec>,
      auto = true
    } -> i32
    %c0_i32 = arith.constant 0 : i32
    pto.aiv_initialize_pipe {dir_mask = 1, slot_size = 1024}
      (gm_slot_buffer = %gm_slot_buffer : !pto.ptr<f32>,
       c2v_consumer_buf = %c2v_local : i32,
       v2c_consumer_buf = %c0_i32 : i32)

    %gm_y_view = pto.make_tensor_view %gm_y, shape = [%c16, %c16], strides = [%c16, %c1] : !pto.tensor_view<?x?xf32>
    %gm_y_tile_view = pto.partition_view %gm_y_view, offsets = [%c0, %c0], sizes = [%c16, %c16] : !pto.tensor_view<?x?xf32> -> !pto.partition_tensor_view<16x16xf32>
    %recv_tile = pto.tpop_from_aic {split = 0}
      -> !pto.tile_buf<loc=vec, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0>
    pto.tstore ins(%recv_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0>) outs(%gm_y_tile_view : !pto.partition_tensor_view<16x16xf32>)
    pto.tfree_from_aic {split = 0}
    return
  }

NOTE:

should probably not allow both vector cores to write in this example, since they are both writing to same location? (just do store for vid==0 or use split=1 and let each aiv write half)

API update

No changes to current code, but add the possibility to have multiple func.funcs in a module.

Currently for a single function we can do:

@to_ir_module(meta_data=meta_data)
def kernel(arg0: "ptr_ty") -> None:
    ...

to_ir_module(...) creates the module and function.

Multi function support

To support mixed kernels where we have one entrypoint function and one vector/cube function we add the option to specify module=True so the outer decorator is the module, while the inner functions become the mlir functions.

@to_ir_module(meta_data=meta_data, module=True)
def module():
    @pto.func(...)
    def f1(...): ...

    @pto.func(entry=True)
    def f2(...): ...

learning-chip · 2026-04-07T12:41:03Z

Will merge this after #102 so we have a newer ptoas version.

learning-chip · 2026-04-13T07:28:54Z

Currently the pto dialect only supports cross core transfers when cube/vector are separate func.funcs, see mlir examples below. So far we have been using pto.section.vector/cube so how can we unify this in the dsl?

Does func.func(kind=...) without pto.section.vector/cube also generate proper #ifdef guards so that the generated cpp source can be compiled by bisheng directly?
For pure vec/cube functions, can we also use func.func(kind=...) without pto.section.vector/cube? If so maybe we can sunset with pto.section to save one indentation.

fiskrt · 2026-04-13T09:23:21Z

Does func.func(kind=...) without pto.section.vector/cube also generate proper #ifdef guards so that the generated cpp source can be compiled by bisheng directly?

Yes, ptoas generates the guards based on the pto.kernel_kind = #pto.kernel_kind<vector> attribute for each function

For pure vec/cube functions, can we also use func.func(kind=...) without pto.section.vector/cube? If so maybe we can sunset with pto.section to save one indentation.

In a future PR we could do that, it should be a simple change but we still have a nesting level since we have the outer module.

@to_ir_module(meta_data=meta_data, module=True)
def module():
    @pto.func(kernel="cube")
    def cube_kernel()
         ...

But we should probably make the module implicit in the future since it's always needed.
But then we might have to pass in meta_data to each function separately...

MirkoDeVita98

lgtm

fiskrt added 11 commits March 24, 2026 15:58

feat: init

b269e2b

feat: up isa version, need etc.. for mix-kernels

e821d3d

feat: add simpler cpp case

a83b06d

feat: rename

0ad3cee

feat: add mlir example[ WIP ]

4e03435

feat: add gitignore

da15c6d

feat: simple bidirectional transfer working in mlir

041625f

feat: now does simple add

5137291

feat: clean working version simple

6210864

feat: clean working version simple

f9d8812

wip: add transfer ops to dsl

d2bf0ca

fiskrt changed the title ~~Feat mix kernel gb~~ feat: Cross-core comm with TPUSH/TPOP Mar 27, 2026

fiskrt added 11 commits March 27, 2026 16:08

feat: docker add compiled cpp and bindings

6055f69

feat: use classes instead

8ea9459

WIP: add builder with multiple funcs

3c7dbd9

feat: add type arg to const() api

834ca8f

WIP: in decorated function we allow multiple functions

422f5f2

WIP: simplify ir.py

2362f7e

use new ptodsl api for builder

c7b31f4

feat: remove files

0597c0a

test: add old and new

57e30c0

feat: remove docs

7182e8a

fix: arith import in builder

48b1eb3

learning-chip reviewed Apr 7, 2026

View reviewed changes

Comment thread tests/frontend/test_multifunc_ir.py Outdated

fiskrt added 5 commits April 7, 2026 14:22

test: compare to MLIR pybindings

0c865a1

fix: names

ea019ed

fix: naming

f928163

feat: deuglify the wrappers

8ae8635

feat: add more examples v2c, c2v,

62db364

fiskrt added 4 commits April 9, 2026 08:18

feat: add ffts address (needed for bidir comm)

1d83124

feat: unmangle kernel name

e32afba

feat: add ffts functionality to api

8029599

feat: add bidir example

10bfb1b

fiskrt marked this pull request as ready for review April 9, 2026 08:41

fiskrt added 7 commits April 9, 2026 09:09

chore: docker ptoas ver and pto-isa

56527f4

Merge remote-tracking branch 'origin/main' into feat-mix-kernel-gb

9d6c1e0

chore: black

d40da05

chore: black

7c2a4a0

feat: gitignore

2779e56

feat: move files and cleanup

49eae78

test: add ptoas test

8ca2c8f

learning-chip reviewed Apr 13, 2026

View reviewed changes

Comment thread tests/frontend/test_multifunc_ir.py

learning-chip reviewed Apr 13, 2026

View reviewed changes

Comment thread examples/aot/tpushpop/mix-kernel_mlir/kernels/v2c_builder.py

learning-chip reviewed Apr 13, 2026

View reviewed changes

Comment thread docker/Dockerfile

fiskrt mentioned this pull request Apr 13, 2026

Manual on-device CI logs #47

Open

chore: update pto-isa version in ci

d048958

learning-chip requested a review from MirkoDeVita98 April 13, 2026 13:49

MirkoDeVita98 approved these changes Apr 13, 2026

View reviewed changes

MirkoDeVita98 merged commit 99270b4 into huawei-csl:main Apr 13, 2026
5 checks passed

fiskrt mentioned this pull request Apr 13, 2026

feat: remove entry=True arg to decorator pto.func #110

Merged

learning-chip mentioned this pull request Apr 13, 2026

Frontend: auto-discover and compile-test DSL kernels with ptoas & bisheng #108

Merged

learning-chip mentioned this pull request May 11, 2026

Port single-core scan from pto-kernels #113

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Cross-core comm with `TPUSH`/`TPOP`#98

feat: Cross-core comm with `TPUSH`/`TPOP`#98
MirkoDeVita98 merged 39 commits into
huawei-csl:mainfrom
fiskrt:feat-mix-kernel-gb

fiskrt commented Mar 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

learning-chip commented Apr 7, 2026

Uh oh!

learning-chip commented Apr 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fiskrt commented Apr 13, 2026 •

edited

Loading

Uh oh!

MirkoDeVita98 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

fiskrt commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cross-core comm with TPUSH/TPOP

A simple matmul -> transfer result to aiv -> store to GM from aiv:

API update

Multi function support

Uh oh!

Uh oh!

learning-chip commented Apr 7, 2026

Uh oh!

learning-chip commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fiskrt commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MirkoDeVita98 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fiskrt commented Mar 24, 2026 •

edited

Loading

Cross-core comm with `TPUSH`/`TPOP`

learning-chip commented Apr 13, 2026 •

edited

Loading

fiskrt commented Apr 13, 2026 •

edited

Loading