Skip to content

[Feature] Expose address-based FIFO slot ops to decouple logical tile from compute subtile #637

@chenshengxin2026

Description

@chenshengxin2026

Context

pto.tpush / pto.tpop / pto.tfree in the PTO dialect today have tile-payload semantics:
one tile is one FIFO transaction, and the tile itself is the payload. The $tile operand is
constrained to PTODpsType (TileBufType).

The underlying FIFO runtime, however, also supports an address-based slot model:

  1. TALLOC reserves a producer slot and writes the slot's GM base address into a
    GlobalTensor descriptor.
  2. The producer issues multiple stores into subviews of the slot (e.g., column subtiles),
    each touching a different sub-region of the same slot.
  3. A single TPUSH commits the slot.
  4. The consumer side mirrors with TPOP returning a slot descriptor, multiple TLOADs
    over subviews, and a single TFREE.

The dialect cannot express step (2) today: there is no op form whose pipe-side operand
is a GlobalTensor / tensor view, and no pto.talloc op at all.

Concrete user impact

hw-native-sys/pto-isa#117 ports the
Flash Attention performance kernel to the PTO Python DSL. The PR is forced to set
S1_TILE = 32 with QK_PRELOAD = 2 in
kernels/python/flash_atten-v1/kernels/fa_builder.py because the larger logical tile
that the C++ kernel uses (TILE_S1 = 256) does not compile through ptoas today.

The C++ kernel solves this by decoupling logical tile width from Cube subtile width:

  • TILE_S1 = 256 is the FIFO slot width.
  • CUBE_S1 = 128 is the Cube matmul subtile width.
  • Per slot the Cube produces two CUBE_S1-wide subtiles into the same logical slot
    and commits once.

That decoupling requires the address-based slot pattern. In the DSL it cannot be
represented because tpush is tile-payload — one tile equals one FIFO transaction —
so S1_TILE is forced to be both the FIFO slot width and the Cube tile width and the
Vec softmax tile width simultaneously.

The DSL is therefore boxed into one of two compromises, both of which lose performance:

Where the dialect is closed off

include/PTO/IR/PTOOps.tdTPushOp / TPopOp: the $tile operand is typed
PTODpsType, with no overload accepting a GlobalTensor or tensor view; no
pto.talloc op is defined.

lib/PTO/IR/PTO.cppTPushOp::verify / TPopOp::verify: pipe-role inference in
getPipe() is driven by the tile's address space (ACC → producer, VEC/MAT
consumer). This is structurally tied to the tile-payload assumption. When the verifier
cannot map the tile to a pipe role it emits:

  • 'pto.tpush' op tile type must map to a supported producer pipe
  • 'pto.tpop' op tile type and target arch must map to a supported consumer pipe

Linked PR

  • hw-native-sys/pto-isa#117
    PTO-DSL Flash Attention v1, currently pinned at S1_TILE = 32 because of the
    limitation described above.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions