Skip to content

feat(api): support address-based pipe slot model + tfree(entry, split)#8

Closed
chenshengxin2026 wants to merge 1 commit into
PTO-ISA:mainfrom
chenshengxin2026:feat/address-based-slot-api
Closed

feat(api): support address-based pipe slot model + tfree(entry, split)#8
chenshengxin2026 wants to merge 1 commit into
PTO-ISA:mainfrom
chenshengxin2026:feat/address-based-slot-api

Conversation

@chenshengxin2026
Copy link
Copy Markdown

@chenshengxin2026 chenshengxin2026 commented May 9, 2026

Summary

Extend the Python frontend so kernels can express the address-based pipe slot model (gm_slot_tensor init + entry-carrying tfree) that the dialect and ptoas already accept (after the "TALLOC/TPUSH/TPOP/TFREE GlobalTensor" change). All changes are purely additive — every existing call site keeps working unchanged.

This is the long-term API surface intended to replace the local workarounds in hw-native-sys/pto-isa#117 (text canonicalization + PTODSL_SKIP_VERIFY=1).

Companion issue with full motivation: huawei-csl#137.

Changes

ptodsl/api/pto_general.py

  • aic_initialize_pipe / aiv_initialize_pipe: accept gm_slot_tensor= (a !pto.tensor_view<...>) and local_slot_num=. When gm_slot_tensor is provided, the legacy gm_slot_buffer / c2v_consumer_buf / v2c_consumer_buf operands are not required. Falls back to the generic Operation.create path while the installed mlir.dialects.pto._pto_ops_gen binding still predates the new operand shape.
  • talloc_to_aic / talloc_to_aiv: emit the address-based slot allocation ops (return a tensor_view that subsequent tpush/tfree consume).
  • tfree_from_aic(entry, split=...) / tfree_from_aiv(entry, split=...): new entry+split overload that carries the popped tensor_view to the free op. Existing tfree(split, id=...) callers keep working — overload dispatched by argument shape. This is the missing operand that turned address-based TFREE into a pipe-only no-op in ptoas-generated C++; carrying the entry restores real free notifications and unblocks long sequences (S1>=4096) that previously hung on slot exhaustion.
  • bitcast(result_type, src): thin wrapper around _pto.BitcastOp (e.g. for reinterpreting an !pto.ptr<f32> region as !pto.ptr<f16> when one GM FIFO carries multiple dtypes).

ptodsl/api/type_def.py

  • TensorType now accepts shape=[d0, d1, ...] for static-shape views, alongside the existing rank=N dynamic form. Static shape is required by gm_slot_tensor pipe init because the lowered C++ runtime templates on a concrete pto::Shape<...>.

ptodsl/compiler/ir.py

  • New PTODSL_SKIP_VERIFY env knob that suppresses the post-build ir_module.operation.verify(). Transitional — the proper long-term fix is to update the dialect ODS / Python binding so the new gm_slot_tensor init shape verifies cleanly, after which this env knob can be removed. Included so kernels can be authored against the new API without waiting for the binding refresh.

ptodsl/api/pto.py

  • Re-export bitcast, talloc_to_aic, talloc_to_aiv.

Backwards compatibility

  • Legacy pto.aic_initialize_pipe(..., gm_slot_buffer=..., c2v_consumer_buf=..., v2c_consumer_buf=...) keeps working.
  • Legacy pto.tfree_from_aic(split, id=...) keeps working — the new tfree_from_aic(entry, split, id=...) form is selected only when split is passed positionally, so the single-arg split=... shape still hits the legacy _pto.TFreeFromAicOp path.
  • Legacy pto.TensorType(rank=N, dtype=...) keeps working.

Validation

Used end-to-end by kernels/python/flash_atten/kernels/fa_builder.py in hw-native-sys/pto-isa#117:

  • IR: pto.tfree_from_aic(%entry : !pto.tensor_view<...>) {id, split}, pto.aic_initialize_pipe {...}(gm_slot_tensor = %qk_slot : !pto.tensor_view<128x256xf32>), pto.talloc_to_aic{...} -> !pto.tensor_view<...>.
  • ptoas lowering: address-based TPipe<..., 8, 8, false> (LocalSlotNum=8 caveat is on the ptoas side, tracked separately) and TFREE<Pipe, GlobalTensor, ...>(pipe, tensor).
  • Runs end-to-end at S1=8192 / 32 tiles on 910B2 (the case that hung with pipe-only TFREE):
    s1=8192  tiles=32  fa=21.90us  err=1.48e-05
    

Related

Test plan

  • python3 -m py_compile ptodsl/api/pto_general.py ptodsl/api/pto.py ptodsl/api/type_def.py ptodsl/compiler/ir.py
  • Existing pto-dsl examples (legacy API path) unchanged.
  • Downstream kernel kernels/python/flash_atten builds end-to-end through ptoas + bisheng and runs on 910B2.

Extend the Python frontend so kernels can express the address-based pipe
shape that ptoas and pto-isa already accept (PR #606 / hw-native-sys/pto-isa
"update TALLOC/TPUSH/TPOP/TFREE to support push or pop GlobalTensor"):

API additions (ptodsl/api/pto_general.py, ptodsl/api/pto.py):
- aic_initialize_pipe / aiv_initialize_pipe: accept gm_slot_tensor
  (!pto.tensor_view<...>) instead of the legacy gm_slot_buffer +
  c2v/v2c_consumer_buf triplet, plus a local_slot_num attribute that
  mirrors the C++ TPipe template arg. The c2v_consumer_buf and
  v2c_consumer_buf operands become optional so kernels using the
  address-based form do not need to supply them. Falls back to the
  generic Operation.create path while the installed
  mlir.dialects.pto._pto_ops_gen binding still predates the new operand
  shape.
- talloc_to_aic / talloc_to_aiv: emit the address-based slot allocation
  ops (returns a tensor_view that subsequent tpush/tfree consume).
- tfree_from_aic(entry, split=...) / tfree_from_aiv(entry, split=...):
  new entry+split overload that carries the popped tensor_view to the
  free op. Existing tfree(split, id=...) callers keep working unchanged.
  This is the missing operand that turned address-based TFREE into a
  pipe-only no-op in ptoas-generated C++; carrying the entry restores
  real free notifications and unblocks long sequences (S1>=4096) that
  previously hung on slot exhaustion.
- bitcast: thin wrapper around _pto.BitcastOp (e.g. for cross-cast
  reinterpreting a !pto.ptr<f32> region as !pto.ptr<f16>).

Type system (ptodsl/api/type_def.py):
- TensorType now accepts shape=[d0, d1, ...] for static-shape views,
  alongside the existing rank=N dynamic form. Static shape is required
  by gm_slot_tensor pipe init because the lowered C++ runtime templates
  on a concrete pto::Shape<...>.

Compiler (ptodsl/compiler/ir.py):
- New PTODSL_SKIP_VERIFY env knob that suppresses the post-build
  ir_module.operation.verify(). This is a transitional escape hatch
  while the installed mlir-dialect verifier still rejects the
  address-based gm_slot_tensor init shape; it is meant to be removed
  once the dialect verifier/binding catches up to ptoas.

Backwards compatibility:
- All legacy callers (gm_slot_buffer + c2v/v2c, split-only tfree,
  rank-only TensorType) keep working without modification. The new API
  surface is purely additive.

Motivating downstream:
- hw-native-sys/pto-isa#117 (PTO-DSL Flash Attention performance kernel)
  consumes this API to align with the manual fa_performance_kernel.cpp.
@chenshengxin2026
Copy link
Copy Markdown
Author

Closing this PR because its motivating FA use case is now covered by huawei-csl/pto-dsl mainline through the handle-based L2G2L tensor-entry path.

The original motivation was to express the manual Flash Attention layout where the FIFO/logical slot uses TILE_S1=256 while cube compute remains on CUBE_S1=128 subtiles. Current mainline implements that through initialize_l2g2l_pipe + declare_global + talloc/tpop_into/tpush/tfree(entry=...), and examples/aot/flash_attention/140tflops/fa_dsl_builder.py demonstrates the full pattern.

That means address-based FIFO remains the right abstraction, but this PRs legacy aic/aiv_initialize_pipe(gm_slot_tensor=...) + talloc_to_* + tfree_from_*(entry) surface is no longer required to unblock the FA TILE_S1=256 / CUBE_S1=128 design. Any remaining legacy-id-based pipe or local_slot_num/local_addr alignment work can be tracked separately with a narrower scope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant