Context
pto.tpush / pto.tpop / pto.tfree in the PTO dialect today have tile-payload semantics:
one tile is one FIFO transaction, and the tile itself is the payload. The $tile operand is
constrained to PTODpsType (TileBufType).
The underlying FIFO runtime, however, also supports an address-based slot model:
TALLOC reserves a producer slot and writes the slot's GM base address into a
GlobalTensor descriptor.
- The producer issues multiple stores into subviews of the slot (e.g., column subtiles),
each touching a different sub-region of the same slot.
- A single
TPUSH commits the slot.
- The consumer side mirrors with
TPOP returning a slot descriptor, multiple TLOADs
over subviews, and a single TFREE.
The dialect cannot express step (2) today: there is no op form whose pipe-side operand
is a GlobalTensor / tensor view, and no pto.talloc op at all.
Concrete user impact
hw-native-sys/pto-isa#117 ports the
Flash Attention performance kernel to the PTO Python DSL. The PR is forced to set
S1_TILE = 32 with QK_PRELOAD = 2 in
kernels/python/flash_atten-v1/kernels/fa_builder.py because the larger logical tile
that the C++ kernel uses (TILE_S1 = 256) does not compile through ptoas today.
The C++ kernel solves this by decoupling logical tile width from Cube subtile width:
TILE_S1 = 256 is the FIFO slot width.
CUBE_S1 = 128 is the Cube matmul subtile width.
- Per slot the Cube produces two
CUBE_S1-wide subtiles into the same logical slot
and commits once.
That decoupling requires the address-based slot pattern. In the DSL it cannot be
represented because tpush is tile-payload — one tile equals one FIFO transaction —
so S1_TILE is forced to be both the FIFO slot width and the Cube tile width and the
Vec softmax tile width simultaneously.
The DSL is therefore boxed into one of two compromises, both of which lose performance:
Where the dialect is closed off
include/PTO/IR/PTOOps.td — TPushOp / TPopOp: the $tile operand is typed
PTODpsType, with no overload accepting a GlobalTensor or tensor view; no
pto.talloc op is defined.
lib/PTO/IR/PTO.cpp — TPushOp::verify / TPopOp::verify: pipe-role inference in
getPipe() is driven by the tile's address space (ACC → producer, VEC/MAT →
consumer). This is structurally tied to the tile-payload assumption. When the verifier
cannot map the tile to a pipe role it emits:
'pto.tpush' op tile type must map to a supported producer pipe
'pto.tpop' op tile type and target arch must map to a supported consumer pipe
Linked PR
- hw-native-sys/pto-isa#117 —
PTO-DSL Flash Attention v1, currently pinned at S1_TILE = 32 because of the
limitation described above.
Context
pto.tpush/pto.tpop/pto.tfreein the PTO dialect today have tile-payload semantics:one tile is one FIFO transaction, and the tile itself is the payload. The
$tileoperand isconstrained to
PTODpsType(TileBufType).The underlying FIFO runtime, however, also supports an address-based slot model:
TALLOCreserves a producer slot and writes the slot's GM base address into aGlobalTensordescriptor.each touching a different sub-region of the same slot.
TPUSHcommits the slot.TPOPreturning a slot descriptor, multipleTLOADsover subviews, and a single
TFREE.The dialect cannot express step (2) today: there is no op form whose pipe-side operand
is a
GlobalTensor/ tensor view, and nopto.tallocop at all.Concrete user impact
hw-native-sys/pto-isa#117 ports the
Flash Attention performance kernel to the PTO Python DSL. The PR is forced to set
S1_TILE = 32withQK_PRELOAD = 2inkernels/python/flash_atten-v1/kernels/fa_builder.pybecause the larger logical tilethat the C++ kernel uses (
TILE_S1 = 256) does not compile through ptoas today.The C++ kernel solves this by decoupling logical tile width from Cube subtile width:
TILE_S1 = 256is the FIFO slot width.CUBE_S1 = 128is the Cube matmul subtile width.CUBE_S1-wide subtiles into the same logical slotand commits once.
That decoupling requires the address-based slot pattern. In the DSL it cannot be
represented because tpush is tile-payload — one tile equals one FIFO transaction —
so
S1_TILEis forced to be both the FIFO slot width and the Cube tile width and theVec softmax tile width simultaneously.
The DSL is therefore boxed into one of two compromises, both of which lose performance:
S1_TILE = 32(the path PR EmitC should generateset_mask_norm()andset_vector_maskcalls, to avoid indeterministicVEC instruction error#117 actually takes): compiles, but PV/GU iterationsscale as
S1/32instead ofS1/256. The PV payload size is fixed atS0 × HEAD × fp32 = 64 KiBregardless of S1 tile width, so this multiplies PV pipetraffic and GU rescale frequency by 8× at the same logical S1 span. PR EmitC should generate
set_mask_norm()andset_vector_maskcalls, to avoid indeterministicVEC instruction error#117's ownbenchmark numbers (910B2, head_dim=128, S0=128, causal=False) against the manual C++
baseline:
S1_TILE=32(us)The crossover at
S1 ≥ 2048is where PV/GU amplification overtakes Cube/Vec overlap.S1_TILE = 256(direct): ptoas memory planner emitsright overflow, requires 1048576 bits while 524288 bits avaliable!(from
lib/PTO/Transforms/PTOPlanMemory.cpp). A128×256fp16 K tile and a256×128fp16 V tile are each 64 KiB; both live inRIGHT(64 KiB total) at thesame time, so 128 KiB is requested.
Where the dialect is closed off
include/PTO/IR/PTOOps.td—TPushOp/TPopOp: the$tileoperand is typedPTODpsType, with no overload accepting aGlobalTensoror tensor view; nopto.tallocop is defined.lib/PTO/IR/PTO.cpp—TPushOp::verify/TPopOp::verify: pipe-role inference ingetPipe()is driven by the tile's address space (ACC→ producer,VEC/MAT→consumer). This is structurally tied to the tile-payload assumption. When the verifier
cannot map the tile to a pipe role it emits:
'pto.tpush' op tile type must map to a supported producer pipe'pto.tpop' op tile type and target arch must map to a supported consumer pipeLinked PR
PTO-DSL Flash Attention v1, currently pinned at
S1_TILE = 32because of thelimitation described above.