feat(api): support address-based pipe slot model + tfree(entry, split) by chenshengxin2026 · Pull Request #8 · PTO-ISA/pto-dsl

chenshengxin2026 · 2026-05-09T11:23:02Z

Summary

Extend the Python frontend so kernels can express the address-based pipe slot model (gm_slot_tensor init + entry-carrying tfree) that the dialect and ptoas already accept (after the "TALLOC/TPUSH/TPOP/TFREE GlobalTensor" change). All changes are purely additive — every existing call site keeps working unchanged.

This is the long-term API surface intended to replace the local workarounds in hw-native-sys/pto-isa#117 (text canonicalization + PTODSL_SKIP_VERIFY=1).

Companion issue with full motivation: huawei-csl#137.

Changes

`ptodsl/api/pto_general.py`

aic_initialize_pipe / aiv_initialize_pipe: accept gm_slot_tensor= (a !pto.tensor_view<...>) and local_slot_num=. When gm_slot_tensor is provided, the legacy gm_slot_buffer / c2v_consumer_buf / v2c_consumer_buf operands are not required. Falls back to the generic Operation.create path while the installed mlir.dialects.pto._pto_ops_gen binding still predates the new operand shape.
talloc_to_aic / talloc_to_aiv: emit the address-based slot allocation ops (return a tensor_view that subsequent tpush/tfree consume).
tfree_from_aic(entry, split=...) / tfree_from_aiv(entry, split=...): new entry+split overload that carries the popped tensor_view to the free op. Existing tfree(split, id=...) callers keep working — overload dispatched by argument shape. This is the missing operand that turned address-based TFREE into a pipe-only no-op in ptoas-generated C++; carrying the entry restores real free notifications and unblocks long sequences (S1>=4096) that previously hung on slot exhaustion.
bitcast(result_type, src): thin wrapper around _pto.BitcastOp (e.g. for reinterpreting an !pto.ptr<f32> region as !pto.ptr<f16> when one GM FIFO carries multiple dtypes).

`ptodsl/api/type_def.py`

TensorType now accepts shape=[d0, d1, ...] for static-shape views, alongside the existing rank=N dynamic form. Static shape is required by gm_slot_tensor pipe init because the lowered C++ runtime templates on a concrete pto::Shape<...>.

`ptodsl/compiler/ir.py`

New PTODSL_SKIP_VERIFY env knob that suppresses the post-build ir_module.operation.verify(). Transitional — the proper long-term fix is to update the dialect ODS / Python binding so the new gm_slot_tensor init shape verifies cleanly, after which this env knob can be removed. Included so kernels can be authored against the new API without waiting for the binding refresh.

`ptodsl/api/pto.py`

Re-export bitcast, talloc_to_aic, talloc_to_aiv.

Backwards compatibility

Legacy pto.aic_initialize_pipe(..., gm_slot_buffer=..., c2v_consumer_buf=..., v2c_consumer_buf=...) keeps working.
Legacy pto.tfree_from_aic(split, id=...) keeps working — the new tfree_from_aic(entry, split, id=...) form is selected only when split is passed positionally, so the single-arg split=... shape still hits the legacy _pto.TFreeFromAicOp path.
Legacy pto.TensorType(rank=N, dtype=...) keeps working.

Validation

Used end-to-end by kernels/python/flash_atten/kernels/fa_builder.py in hw-native-sys/pto-isa#117:

IR: pto.tfree_from_aic(%entry : !pto.tensor_view<...>) {id, split}, pto.aic_initialize_pipe {...}(gm_slot_tensor = %qk_slot : !pto.tensor_view<128x256xf32>), pto.talloc_to_aic{...} -> !pto.tensor_view<...>.
ptoas lowering: address-based TPipe<..., 8, 8, false> (LocalSlotNum=8 caveat is on the ptoas side, tracked separately) and TFREE<Pipe, GlobalTensor, ...>(pipe, tensor).
Runs end-to-end at S1=8192 / 32 tiles on 910B2 (the case that hung with pipe-only TFREE):
```
s1=8192  tiles=32  fa=21.90us  err=1.48e-05
```

Frontend cannot express address-based pipe slot model (gm_slot_tensor + tfree(entry)) huawei-csl/pto-dsl#137 — companion issue with full root-cause analysis.
feat(flash-attn): add Python DSL Flash Attention example under kernels/python/flash_atten hw-native-sys/pto-isa#117 — PTO-DSL Flash Attention v2 kernel; consumer of this API.
A companion PR/issue against hw-native-sys/PTOAS is filed separately for the gm_slot_tensor + local_slot_num lowering and the verifier message that prevents writing local_slot_num on the IR side.

Test plan

python3 -m py_compile ptodsl/api/pto_general.py ptodsl/api/pto.py ptodsl/api/type_def.py ptodsl/compiler/ir.py
Existing pto-dsl examples (legacy API path) unchanged.
Downstream kernel kernels/python/flash_atten builds end-to-end through ptoas + bisheng and runs on 910B2.

Extend the Python frontend so kernels can express the address-based pipe shape that ptoas and pto-isa already accept (PR #606 / hw-native-sys/pto-isa "update TALLOC/TPUSH/TPOP/TFREE to support push or pop GlobalTensor"): API additions (ptodsl/api/pto_general.py, ptodsl/api/pto.py): - aic_initialize_pipe / aiv_initialize_pipe: accept gm_slot_tensor (!pto.tensor_view<...>) instead of the legacy gm_slot_buffer + c2v/v2c_consumer_buf triplet, plus a local_slot_num attribute that mirrors the C++ TPipe template arg. The c2v_consumer_buf and v2c_consumer_buf operands become optional so kernels using the address-based form do not need to supply them. Falls back to the generic Operation.create path while the installed mlir.dialects.pto._pto_ops_gen binding still predates the new operand shape. - talloc_to_aic / talloc_to_aiv: emit the address-based slot allocation ops (returns a tensor_view that subsequent tpush/tfree consume). - tfree_from_aic(entry, split=...) / tfree_from_aiv(entry, split=...): new entry+split overload that carries the popped tensor_view to the free op. Existing tfree(split, id=...) callers keep working unchanged. This is the missing operand that turned address-based TFREE into a pipe-only no-op in ptoas-generated C++; carrying the entry restores real free notifications and unblocks long sequences (S1>=4096) that previously hung on slot exhaustion. - bitcast: thin wrapper around _pto.BitcastOp (e.g. for cross-cast reinterpreting a !pto.ptr<f32> region as !pto.ptr<f16>). Type system (ptodsl/api/type_def.py): - TensorType now accepts shape=[d0, d1, ...] for static-shape views, alongside the existing rank=N dynamic form. Static shape is required by gm_slot_tensor pipe init because the lowered C++ runtime templates on a concrete pto::Shape<...>. Compiler (ptodsl/compiler/ir.py): - New PTODSL_SKIP_VERIFY env knob that suppresses the post-build ir_module.operation.verify(). This is a transitional escape hatch while the installed mlir-dialect verifier still rejects the address-based gm_slot_tensor init shape; it is meant to be removed once the dialect verifier/binding catches up to ptoas. Backwards compatibility: - All legacy callers (gm_slot_buffer + c2v/v2c, split-only tfree, rank-only TensorType) keep working without modification. The new API surface is purely additive. Motivating downstream: - hw-native-sys/pto-isa#117 (PTO-DSL Flash Attention performance kernel) consumes this API to align with the manual fa_performance_kernel.cpp.

chenshengxin2026 · 2026-05-18T06:53:23Z

Closing this PR because its motivating FA use case is now covered by huawei-csl/pto-dsl mainline through the handle-based L2G2L tensor-entry path.

The original motivation was to express the manual Flash Attention layout where the FIFO/logical slot uses TILE_S1=256 while cube compute remains on CUBE_S1=128 subtiles. Current mainline implements that through initialize_l2g2l_pipe + declare_global + talloc/tpop_into/tpush/tfree(entry=...), and examples/aot/flash_attention/140tflops/fa_dsl_builder.py demonstrates the full pattern.

That means address-based FIFO remains the right abstraction, but this PRs legacy aic/aiv_initialize_pipe(gm_slot_tensor=...) + talloc_to_* + tfree_from_*(entry) surface is no longer required to unblock the FA TILE_S1=256 / CUBE_S1=128 design. Any remaining legacy-id-based pipe or local_slot_num/local_addr alignment work can be tracked separately with a narrower scope.

chenshengxin2026 closed this May 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(api): support address-based pipe slot model + tfree(entry, split)#8

feat(api): support address-based pipe slot model + tfree(entry, split)#8
chenshengxin2026 wants to merge 1 commit into
PTO-ISA:mainfrom
chenshengxin2026:feat/address-based-slot-api

chenshengxin2026 commented May 9, 2026 •

edited

Loading

Uh oh!

chenshengxin2026 commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chenshengxin2026 commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

ptodsl/api/pto_general.py

ptodsl/api/type_def.py

ptodsl/compiler/ir.py

ptodsl/api/pto.py

Backwards compatibility

Validation

Related

Test plan

Uh oh!

chenshengxin2026 commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chenshengxin2026 commented May 9, 2026 •

edited

Loading

`ptodsl/api/pto_general.py`

`ptodsl/api/type_def.py`

`ptodsl/compiler/ir.py`

`ptodsl/api/pto.py`