Skip to content

feat: Cross-core comm with TPUSH/TPOP#98

Merged
MirkoDeVita98 merged 39 commits into
huawei-csl:mainfrom
fiskrt:feat-mix-kernel-gb
Apr 13, 2026
Merged

feat: Cross-core comm with TPUSH/TPOP#98
MirkoDeVita98 merged 39 commits into
huawei-csl:mainfrom
fiskrt:feat-mix-kernel-gb

Conversation

@fiskrt
Copy link
Copy Markdown
Collaborator

@fiskrt fiskrt commented Mar 24, 2026

Cross-core comm with TPUSH/TPOP

Three example implementations of

  • v2c
  • c2v
  • bidi

TODO:

  • Currently the pto dialect only supports cross core transfers when cube/vector are separate func.funcs, see mlir examples below. So far we have been using pto.section.vector/cube so how can we unify this in the dsl?
  • Make a python example using MLIR bindings and pto_general.py wrappers, rather than just pure mlir

NOTE:

  • We don't just wrap the pto-isa TPUSH/TPOP/TFREE but rather the higher level ops in pto dialect, such as _pto.TPushToAivOp

A simple matmul -> transfer result to aiv -> store to GM from aiv:

We pass the kernel x, y, fifo where x is the input 16x16xfp32 tensor, y is same shape where result will be stored, and fifo is the buffer which is used for cross communication.

Now steps are:

  1. Using x in GM do a regular matmul so r1=x@x ends up in an Acc tile which can be transferred into aiv using pto.push_to_aiv (through the GM buffer fifo that we allocated before program start).

When TPUSH is used on a TileType::Acc that resides in L0C it will use FIXP to write back to fifo in GM.

  1. The TPOP instruction handles waiting for the other core using WAIT_FLAG_DEVI and also the loading from fifo buffer into UB memory.

  2. Finally the regular pto.tstore is executed from the aiv which writes r1 into GM in the y slot.

caller function
  func.func @call_both(%gm_slot_buffer: !pto.ptr<f32>, %gm_x: !pto.ptr<f32>, %gm_y: !pto.ptr<f32>) attributes {pto.entry} {
    func.call @cube_kernel(%gm_slot_buffer, %gm_x) : (!pto.ptr<f32>, !pto.ptr<f32>) -> ()
    func.call @vector_kernel(%gm_slot_buffer, %gm_y) : (!pto.ptr<f32>, !pto.ptr<f32>) -> ()
    return
  }
cube kernel
func.func @cube_kernel(%gm_slot_buffer: !pto.ptr<f32>, %gm_x: !pto.ptr<f32>) attributes {pto.kernel_kind = #pto.kernel_kind<cube>} {
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %c16 = arith.constant 16 : index
    %c2v_import = pto.import_reserved_buffer {
      name = "c2v_fifo",
      peer_func = @vector_kernel
    } -> i32
    %c0_i32 = arith.constant 0 : i32
    pto.aic_initialize_pipe {dir_mask = 1, slot_size = 1024}
      (gm_slot_buffer = %gm_slot_buffer : !pto.ptr<f32>,
       c2v_consumer_buf = %c2v_import : i32,
       v2c_consumer_buf = %c0_i32 : i32)

    %x_mat_tile = pto.alloc_tile : !pto.tile_buf<loc=mat, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=col_major, slayout=row_major, fractal=512, pad=0>
    %x_left_tile = pto.alloc_tile : !pto.tile_buf<loc=left, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=col_major, slayout=row_major, fractal=512, pad=0>
    %x_right_tile = pto.alloc_tile : !pto.tile_buf<loc=right, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=row_major, slayout=col_major, fractal=512, pad=0>
    %acc_tile = pto.alloc_tile : !pto.tile_buf<loc=acc, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=col_major, slayout=row_major, fractal=1024, pad=0>
    %gm_x_view = pto.make_tensor_view %gm_x, shape = [%c16, %c16], strides = [%c16, %c1] : !pto.tensor_view<?x?xf32>
    %gm_x_tile_view = pto.partition_view %gm_x_view, offsets = [%c0, %c0], sizes = [%c16, %c16] : !pto.tensor_view<?x?xf32> -> !pto.partition_tensor_view<16x16xf32>
    pto.tload ins(%gm_x_tile_view : !pto.partition_tensor_view<16x16xf32>) outs(%x_mat_tile : !pto.tile_buf<loc=mat, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=col_major, slayout=row_major, fractal=512, pad=0>)
    pto.tmov ins(%x_mat_tile : !pto.tile_buf<loc=mat, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=col_major, slayout=row_major, fractal=512, pad=0>) outs(%x_left_tile : !pto.tile_buf<loc=left, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=col_major, slayout=row_major, fractal=512, pad=0>)
    pto.tmov ins(%x_mat_tile : !pto.tile_buf<loc=mat, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=col_major, slayout=row_major, fractal=512, pad=0>) outs(%x_right_tile : !pto.tile_buf<loc=right, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=row_major, slayout=col_major, fractal=512, pad=0>)
    pto.tmatmul ins(%x_left_tile, %x_right_tile : !pto.tile_buf<loc=left, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=col_major, slayout=row_major, fractal=512, pad=0>, !pto.tile_buf<loc=right, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=row_major, slayout=col_major, fractal=512, pad=0>) outs(%acc_tile : !pto.tile_buf<loc=acc, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=col_major, slayout=row_major, fractal=1024, pad=0>)
    pto.tpush_to_aiv(%acc_tile : !pto.tile_buf<loc=acc, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=col_major, slayout=row_major, fractal=1024, pad=0>) {split = 0}
    return
  }
vector kernel
 func.func @vector_kernel(%gm_slot_buffer: !pto.ptr<f32>, %gm_y: !pto.ptr<f32>)
      attributes {pto.kernel_kind = #pto.kernel_kind<vector>} {
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %c16 = arith.constant 16 : index
    %c2v_local = pto.reserve_buffer {
      name = "c2v_fifo",
      size = 4096,
      location = #pto.address_space<vec>,
      auto = true
    } -> i32
    %c0_i32 = arith.constant 0 : i32
    pto.aiv_initialize_pipe {dir_mask = 1, slot_size = 1024}
      (gm_slot_buffer = %gm_slot_buffer : !pto.ptr<f32>,
       c2v_consumer_buf = %c2v_local : i32,
       v2c_consumer_buf = %c0_i32 : i32)

    %gm_y_view = pto.make_tensor_view %gm_y, shape = [%c16, %c16], strides = [%c16, %c1] : !pto.tensor_view<?x?xf32>
    %gm_y_tile_view = pto.partition_view %gm_y_view, offsets = [%c0, %c0], sizes = [%c16, %c16] : !pto.tensor_view<?x?xf32> -> !pto.partition_tensor_view<16x16xf32>
    %recv_tile = pto.tpop_from_aic {split = 0}
      -> !pto.tile_buf<loc=vec, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0>
    pto.tstore ins(%recv_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=16, cols=16, v_row=16, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0>) outs(%gm_y_tile_view : !pto.partition_tensor_view<16x16xf32>)
    pto.tfree_from_aic {split = 0}
    return
  }
image

NOTE:

  • should probably not allow both vector cores to write in this example, since they are both writing to same location? (just do store for vid==0 or use split=1 and let each aiv write half)

API update

No changes to current code, but add the possibility to have multiple func.funcs in a module.

Currently for a single function we can do:

@to_ir_module(meta_data=meta_data)
def kernel(arg0: "ptr_ty") -> None:
    ...

to_ir_module(...) creates the module and function.

Multi function support

To support mixed kernels where we have one entrypoint function and one vector/cube function we add the option to specify module=True so the outer decorator is the module, while the inner functions become the mlir functions.

@to_ir_module(meta_data=meta_data, module=True)
def module():
    @pto.func(...)
    def f1(...): ...

    @pto.func(entry=True)
    def f2(...): ...

@fiskrt fiskrt changed the title Feat mix kernel gb feat: Cross-core comm with TPUSH/TPOP Mar 27, 2026
Comment thread tests/frontend/test_multifunc_ir.py Outdated
@learning-chip
Copy link
Copy Markdown
Collaborator

Will merge this after #102 so we have a newer ptoas version.

@fiskrt fiskrt marked this pull request as ready for review April 9, 2026 08:41
@learning-chip
Copy link
Copy Markdown
Collaborator

learning-chip commented Apr 13, 2026

  • Currently the pto dialect only supports cross core transfers when cube/vector are separate func.funcs, see mlir examples below. So far we have been using pto.section.vector/cube so how can we unify this in the dsl?
  • Does func.func(kind=...) without pto.section.vector/cube also generate proper #ifdef guards so that the generated cpp source can be compiled by bisheng directly?
  • For pure vec/cube functions, can we also use func.func(kind=...) without pto.section.vector/cube? If so maybe we can sunset with pto.section to save one indentation.

Comment thread tests/frontend/test_multifunc_ir.py
Comment thread examples/aot/tpushpop/mix-kernel_mlir/kernels/v2c_builder.py
Comment thread docker/Dockerfile
@fiskrt
Copy link
Copy Markdown
Collaborator Author

fiskrt commented Apr 13, 2026

  • Does func.func(kind=...) without pto.section.vector/cube also generate proper #ifdef guards so that the generated cpp source can be compiled by bisheng directly?

Yes, ptoas generates the guards based on the pto.kernel_kind = #pto.kernel_kind<vector> attribute for each function

  • For pure vec/cube functions, can we also use func.func(kind=...) without pto.section.vector/cube? If so maybe we can sunset with pto.section to save one indentation.

In a future PR we could do that, it should be a simple change but we still have a nesting level since we have the outer module.

@to_ir_module(meta_data=meta_data, module=True)
def module():
    @pto.func(kernel="cube")
    def cube_kernel()
         ...

But we should probably make the module implicit in the future since it's always needed.
But then we might have to pass in meta_data to each function separately...

Copy link
Copy Markdown
Collaborator

@MirkoDeVita98 MirkoDeVita98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@MirkoDeVita98 MirkoDeVita98 merged commit 99270b4 into huawei-csl:main Apr 13, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants