[Feature] Expose address-based FIFO slot ops to decouple logical tile from compute subtile

## Context

`pto.tpush` / `pto.tpop` / `pto.tfree` in the PTO dialect today have **tile-payload semantics**:
one tile is one FIFO transaction, and the tile itself is the payload. The `$tile` operand is
constrained to `PTODpsType` (`TileBufType`).

The underlying FIFO runtime, however, also supports an **address-based slot model**:

1. `TALLOC` reserves a producer slot and writes the slot's GM base address into a
   `GlobalTensor` descriptor.
2. The producer issues multiple stores into subviews of the slot (e.g., column subtiles),
   each touching a different sub-region of the same slot.
3. A single `TPUSH` commits the slot.
4. The consumer side mirrors with `TPOP` returning a slot descriptor, multiple `TLOAD`s
   over subviews, and a single `TFREE`.

The dialect cannot express step (2) today: there is no op form whose pipe-side operand
is a `GlobalTensor` / tensor view, and no `pto.talloc` op at all.

## Concrete user impact

[hw-native-sys/pto-isa#117](https://github.com/hw-native-sys/pto-isa/pull/117) ports the
Flash Attention performance kernel to the PTO Python DSL. The PR is forced to set
`S1_TILE = 32` with `QK_PRELOAD = 2` in
`kernels/python/flash_atten-v1/kernels/fa_builder.py` because the larger logical tile
that the C++ kernel uses (`TILE_S1 = 256`) does not compile through ptoas today.

The C++ kernel solves this by **decoupling logical tile width from Cube subtile width**:

- `TILE_S1 = 256` is the FIFO slot width.
- `CUBE_S1 = 128` is the Cube matmul subtile width.
- Per slot the Cube produces two `CUBE_S1`-wide subtiles into the same logical slot
  and commits once.

That decoupling requires the address-based slot pattern. In the DSL it cannot be
represented because tpush is tile-payload — one tile equals one FIFO transaction —
so `S1_TILE` is forced to be both the FIFO slot width and the Cube tile width and the
Vec softmax tile width simultaneously.

The DSL is therefore boxed into one of two compromises, both of which lose performance:

- **`S1_TILE = 32`** (the path PR #117 actually takes): compiles, but PV/GU iterations
  scale as `S1/32` instead of `S1/256`. The PV payload size is fixed at
  `S0 × HEAD × fp32 = 64 KiB` regardless of S1 tile width, so this multiplies PV pipe
  traffic and GU rescale frequency by 8× at the same logical S1 span. PR #117's own
  benchmark numbers (910B2, head_dim=128, S0=128, causal=False) against the manual C++
  baseline:

  | S1   | Manual C++ (us) | DSL `S1_TILE=32` (us) | speedup |
  |-----:|----------------:|----------------------:|--------:|
  | 1024 |          39.380 |                 25.74 |   1.53× |
  | 2048 |          57.220 |                 66.05 |   0.87× |
  | 8192 |          90.440 |                160.09 |   0.56× |

  The crossover at `S1 ≥ 2048` is where PV/GU amplification overtakes Cube/Vec overlap.

- **`S1_TILE = 256`** (direct): ptoas memory planner emits
  `right overflow, requires 1048576 bits while 524288 bits avaliable!`
  (from `lib/PTO/Transforms/PTOPlanMemory.cpp`). A `128×256` fp16 K tile and a
  `256×128` fp16 V tile are each 64 KiB; both live in `RIGHT` (64 KiB total) at the
  same time, so 128 KiB is requested.

## Where the dialect is closed off

`include/PTO/IR/PTOOps.td` — `TPushOp` / `TPopOp`: the `$tile` operand is typed
`PTODpsType`, with no overload accepting a `GlobalTensor` or tensor view; no
`pto.talloc` op is defined.

`lib/PTO/IR/PTO.cpp` — `TPushOp::verify` / `TPopOp::verify`: pipe-role inference in
`getPipe()` is driven by the tile's address space (`ACC` → producer, `VEC`/`MAT` →
consumer). This is structurally tied to the tile-payload assumption. When the verifier
cannot map the tile to a pipe role it emits:

- `'pto.tpush' op tile type must map to a supported producer pipe`
- `'pto.tpop' op tile type and target arch must map to a supported consumer pipe`

## Linked PR

- [hw-native-sys/pto-isa#117](https://github.com/hw-native-sys/pto-isa/pull/117) —
  PTO-DSL Flash Attention v1, currently pinned at `S1_TILE = 32` because of the
  limitation described above.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Expose address-based FIFO slot ops to decouple logical tile from compute subtile #637

Context

Concrete user impact

Where the dialect is closed off

Linked PR

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

S1	Manual C++ (us)	DSL `S1_TILE=32` (us)	speedup
1024	39.380	25.74	1.53×
2048	57.220	66.05	0.87×
8192	90.440	160.09	0.56×

[Feature] Expose address-based FIFO slot ops to decouple logical tile from compute subtile #637

Description

Context

Concrete user impact

Where the dialect is closed off

Linked PR

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions