feat(api): support address-based pipe slot model + tfree(entry, split)#8
feat(api): support address-based pipe slot model + tfree(entry, split)#8chenshengxin2026 wants to merge 1 commit into
Conversation
Extend the Python frontend so kernels can express the address-based pipe shape that ptoas and pto-isa already accept (PR #606 / hw-native-sys/pto-isa "update TALLOC/TPUSH/TPOP/TFREE to support push or pop GlobalTensor"): API additions (ptodsl/api/pto_general.py, ptodsl/api/pto.py): - aic_initialize_pipe / aiv_initialize_pipe: accept gm_slot_tensor (!pto.tensor_view<...>) instead of the legacy gm_slot_buffer + c2v/v2c_consumer_buf triplet, plus a local_slot_num attribute that mirrors the C++ TPipe template arg. The c2v_consumer_buf and v2c_consumer_buf operands become optional so kernels using the address-based form do not need to supply them. Falls back to the generic Operation.create path while the installed mlir.dialects.pto._pto_ops_gen binding still predates the new operand shape. - talloc_to_aic / talloc_to_aiv: emit the address-based slot allocation ops (returns a tensor_view that subsequent tpush/tfree consume). - tfree_from_aic(entry, split=...) / tfree_from_aiv(entry, split=...): new entry+split overload that carries the popped tensor_view to the free op. Existing tfree(split, id=...) callers keep working unchanged. This is the missing operand that turned address-based TFREE into a pipe-only no-op in ptoas-generated C++; carrying the entry restores real free notifications and unblocks long sequences (S1>=4096) that previously hung on slot exhaustion. - bitcast: thin wrapper around _pto.BitcastOp (e.g. for cross-cast reinterpreting a !pto.ptr<f32> region as !pto.ptr<f16>). Type system (ptodsl/api/type_def.py): - TensorType now accepts shape=[d0, d1, ...] for static-shape views, alongside the existing rank=N dynamic form. Static shape is required by gm_slot_tensor pipe init because the lowered C++ runtime templates on a concrete pto::Shape<...>. Compiler (ptodsl/compiler/ir.py): - New PTODSL_SKIP_VERIFY env knob that suppresses the post-build ir_module.operation.verify(). This is a transitional escape hatch while the installed mlir-dialect verifier still rejects the address-based gm_slot_tensor init shape; it is meant to be removed once the dialect verifier/binding catches up to ptoas. Backwards compatibility: - All legacy callers (gm_slot_buffer + c2v/v2c, split-only tfree, rank-only TensorType) keep working without modification. The new API surface is purely additive. Motivating downstream: - hw-native-sys/pto-isa#117 (PTO-DSL Flash Attention performance kernel) consumes this API to align with the manual fa_performance_kernel.cpp.
|
Closing this PR because its motivating FA use case is now covered by huawei-csl/pto-dsl mainline through the handle-based L2G2L tensor-entry path. The original motivation was to express the manual Flash Attention layout where the FIFO/logical slot uses That means address-based FIFO remains the right abstraction, but this PRs legacy |
Summary
Extend the Python frontend so kernels can express the address-based pipe slot model (
gm_slot_tensorinit + entry-carryingtfree) that the dialect andptoasalready accept (after the "TALLOC/TPUSH/TPOP/TFREE GlobalTensor" change). All changes are purely additive — every existing call site keeps working unchanged.This is the long-term API surface intended to replace the local workarounds in
hw-native-sys/pto-isa#117(text canonicalization +PTODSL_SKIP_VERIFY=1).Companion issue with full motivation: huawei-csl#137.
Changes
ptodsl/api/pto_general.pyaic_initialize_pipe/aiv_initialize_pipe: acceptgm_slot_tensor=(a!pto.tensor_view<...>) andlocal_slot_num=. Whengm_slot_tensoris provided, the legacygm_slot_buffer/c2v_consumer_buf/v2c_consumer_bufoperands are not required. Falls back to the genericOperation.createpath while the installedmlir.dialects.pto._pto_ops_genbinding still predates the new operand shape.talloc_to_aic/talloc_to_aiv: emit the address-based slot allocation ops (return atensor_viewthat subsequenttpush/tfreeconsume).tfree_from_aic(entry, split=...)/tfree_from_aiv(entry, split=...): new entry+split overload that carries the poppedtensor_viewto the free op. Existingtfree(split, id=...)callers keep working — overload dispatched by argument shape. This is the missing operand that turned address-basedTFREEinto a pipe-only no-op inptoas-generated C++; carrying the entry restores real free notifications and unblocks long sequences (S1>=4096) that previously hung on slot exhaustion.bitcast(result_type, src): thin wrapper around_pto.BitcastOp(e.g. for reinterpreting an!pto.ptr<f32>region as!pto.ptr<f16>when one GM FIFO carries multiple dtypes).ptodsl/api/type_def.pyTensorTypenow acceptsshape=[d0, d1, ...]for static-shape views, alongside the existingrank=Ndynamic form. Static shape is required bygm_slot_tensorpipe init because the lowered C++ runtime templates on a concretepto::Shape<...>.ptodsl/compiler/ir.pyPTODSL_SKIP_VERIFYenv knob that suppresses the post-buildir_module.operation.verify(). Transitional — the proper long-term fix is to update the dialect ODS / Python binding so the newgm_slot_tensorinit shape verifies cleanly, after which this env knob can be removed. Included so kernels can be authored against the new API without waiting for the binding refresh.ptodsl/api/pto.pybitcast,talloc_to_aic,talloc_to_aiv.Backwards compatibility
pto.aic_initialize_pipe(..., gm_slot_buffer=..., c2v_consumer_buf=..., v2c_consumer_buf=...)keeps working.pto.tfree_from_aic(split, id=...)keeps working — the newtfree_from_aic(entry, split, id=...)form is selected only whensplitis passed positionally, so the single-argsplit=...shape still hits the legacy_pto.TFreeFromAicOppath.pto.TensorType(rank=N, dtype=...)keeps working.Validation
Used end-to-end by
kernels/python/flash_atten/kernels/fa_builder.pyinhw-native-sys/pto-isa#117:pto.tfree_from_aic(%entry : !pto.tensor_view<...>) {id, split},pto.aic_initialize_pipe {...}(gm_slot_tensor = %qk_slot : !pto.tensor_view<128x256xf32>),pto.talloc_to_aic{...} -> !pto.tensor_view<...>.TPipe<..., 8, 8, false>(LocalSlotNum=8 caveat is on the ptoas side, tracked separately) andTFREE<Pipe, GlobalTensor, ...>(pipe, tensor).S1=8192 / 32 tileson 910B2 (the case that hung with pipe-onlyTFREE):Related
hw-native-sys/PTOASis filed separately for thegm_slot_tensor + local_slot_numlowering and the verifier message that prevents writinglocal_slot_numon the IR side.Test plan
python3 -m py_compile ptodsl/api/pto_general.py ptodsl/api/pto.py ptodsl/api/type_def.py ptodsl/compiler/ir.pykernels/python/flash_attenbuilds end-to-end throughptoas + bishengand runs on 910B2.