Component
PTO Dialect / ODS (include/PTO/IR) and lib/PTO/Transforms/PTOLowerFrontendPipeOpsPass.cpp.
Description
The PTO-DSL FlashAttention v2 example in PR #117 kernels/python/flash_atten-v2/ is structurally aligned with the manual reference kernels/manual/common/flash_atten/fa_performance_kernel.cpp: TILE_S1 = 256, CUBE_S1 = 128, kTileFactor = 2, address-based slot model on all three pipes (pto.aic_initialize_pipe(gm_slot_tensor=...) / talloc_to_aiv / tpush_to_aiv / tpop_from_aic / tfree_from_aic).
Single-call correctness (atol = rtol = 1e-3 against fp32 reference, fresh process per length, on A3):
| S1 |
NUM_TILES |
status |
max_err |
| 1024 |
4 |
PASSED |
4.43e-05 |
| 2048 |
8 |
PASSED |
2.72e-05 |
| 4096 |
16 |
aicore timeout |
|
| 8192 |
32 |
aicore timeout |
|
The manual C++ reference at the same case_float_H_128_S0_128_S1_8192 shape runs to completion. The DSL-generated kernel differs from the manual reference in the TPipe template parameters used for the three cross-core FIFO pipes:
| Source |
QK pipe |
P pipe |
PV pipe |
Manual fa_performance_kernel.cpp:790,795,799 |
TPipe<..., SlotNum=8, LocalSlotNum=2, IsNoSplit=false, EN_UNIT_FLAG=true> |
TPipe<..., SlotNum=8>; effective LocalSlotNum=2 via C++ default |
TPipe<..., SlotNum=8, LocalSlotNum=2, IsNoSplit=false, EN_UNIT_FLAG=true> |
| DSL after ptoas lowering |
TPipe<..., SlotNum=8, LocalSlotNum=8, IsNoSplit=false> |
TPipe<..., SlotNum=8, LocalSlotNum=8, IsNoSplit=false> |
TPipe<..., SlotNum=8, LocalSlotNum=8, IsNoSplit=false> |
Per include/pto/npu/a2a3/TPush.hpp:28, the C++ default for LocalSlotNum is 2. Therefore the manual P pipe, although it does not spell out the fifth template argument, also has effective LocalSlotNum=2.
Note: QK/PV also differ in EN_UNIT_FLAG (true in the manual reference, default false in DSL-generated code). The experiment below specifically isolates LocalSlotNum by only rewriting 8 -> 2; however, this issue should not claim that LocalSlotNum is the only template-level difference.
DSL gets LocalSlotNum=8 because three places in ptoas conspire to drop the manual/default LocalSlotNum=2 behavior:
Defect A — verifier rejects local_slot_num on globaltensor pipe init
lib/PTO/IR/PTO.cpp (HEAD eeeb1f4, lines 10680–10682):
if (op.getLocalSlotNumAttr())
return op.emitOpError(
"globaltensor pipe init does not use 'local_slot_num'");
The DSL has no legal way to override LocalSlotNum on the address-based / gm_slot_tensor form added in PTOAS PR #606. PR #569 (legacy local_slot_num support) only covers gm_slot_buffer.
Defect B — lowering hard-codes empty localSlotNumAttr for the globaltensor branch
lib/PTO/Transforms/PTOLowerFrontendPipeOpsPass.cpp (HEAD eeeb1f4, lines 123–134):
if (initOp.getGmSlotTensor()) {
...
auto pipe = rewriter.create<InitializeL2G2LPipeOp>(
loc, pipeTy, dirAttr, slotSizeAttr, slotNumAttr,
IntegerAttr{}, // ← localSlotNumAttr
IntegerAttr{}, // ← flagBaseAttr
noSplitAttr, initOp.getGmSlotTensor(), Value{}, Value{});
...
}
Even if Defect A were lifted, this branch would still drop the user attribute. The non-globaltensor branch (lines 152–156) at least passes getLocalSlotNumAttr() through.
Defect C — EmitC fallback is getSlotNum() (= 8), not the C++ template default 2
lib/PTO/Transforms/PTOToEmitC.cpp (HEAD eeeb1f4, lines 628–630):
int32_t localSlotNum = initOp.getLocalSlotNumAttr()
? initOp.getLocalSlotNumAttr().getInt()
: initOp.getSlotNum(); // ← =8 in this kernel
When the attr is absent, EmitC writes LocalSlotNum=SlotNum explicitly. This does not match the C++ API default (LocalSlotNum=2) that the manual reference relies on.
What the source code proves, and the likely timeout mechanism
The source-level mismatch is clear:
- A2/A3
TPipe defaults LocalSlotNum to 2.
- The manual FA reference uses effective
LocalSlotNum=2 on QK/P/PV.
- The
gm_slot_tensor frontend form cannot legally carry local_slot_num.
- The globaltensor lowering branch drops/omits
localSlotNumAttr.
- EmitC falls back to
getSlotNum() when the attr is absent, so SlotNum=8 becomes LocalSlotNum=8.
What can be directly seen in include/pto/npu/a2a3/TPush.hpp is that LocalSlotNum is used through RingFIFO<SlotSize, SlotNum, LocalSlotNum> and affects the local consumer-buffer address rotation:
fifo.C2V_CONSUMER_BUF +
(tileIndex % RingFiFo::LOCAL_SLOT_NUM) * ConsM * ConsN * sizeof(T);
and similarly for V2C_CONSUMER_BUF.
Therefore the most conservative source-backed statement is:
- Manual FA rotates consumer local buffers with period 2.
- DSL-generated FA rotates consumer local buffers with period 8.
- Both use the same GM ring depth (
SlotNum=8).
- For long sequences, GM slots are reused after tile index 8, so the generated kernel exercises ring reuse with a different local-buffer rotation policy than the manual reference.
This is a plausible cause of the observed AICore timeout: with LocalSlotNum=8, the consumer-side local buffer lifetime and the FIFO free/ready synchronization no longer match the manual kernel's intended two-slot ping-pong schedule. When the 8-slot GM ring is reused, stale data, premature reuse, or unmatched producer/consumer progress can lead to a wait that never observes the expected signal.
The original explanation involving “8 × 3 = 24 event identities exceeding an 8-event pool” is a possible hypothesis, but it is not directly proven by TPush.hpp: the visible TPipe implementation uses fixed FlagID / FlagID+1-style FFTS messages, while LocalSlotNum is directly visible in local buffer address rotation. To prove the event-ID explanation, we would need to compare the IR/C++ emitted by --enable-insert-sync and show that event lifetimes or assigned event IDs become conflicting only in the LocalSlotNum=8 version.
Observed behavior is still consistent with the LocalSlotNum mismatch:
NUM_TILES <= 8: the GM ring has not been reused beyond its 8 slots, so the mismatch is less likely to surface.
NUM_TILES >= 16: the 8-slot GM ring has been reused multiple times, and the generated LocalSlotNum=8 local-buffer rotation diverges substantially from the manual two-slot ping-pong pattern.
Reproduction
Using PR #117 commit 35b35de4 (kernels/python/flash_atten-v2/) on A3 with ptoas --pto-arch=a3 --enable-insert-sync and bisheng built kernel:
cd kernels/python/flash_atten-v2
bash run_fa.sh --tiles 4 --lengths 1024 # PASSED, max_err 4.43e-05
bash run_fa.sh --tiles 8 --lengths 2048 # PASSED, max_err 2.72e-05
bash run_fa.sh --tiles 32 --lengths 8192 # builds, runtime aicore timeout
Inspecting the emitted build_artifacts/fa_32.cpp:
auto v40 = TPipe<0, Direction::DIR_C2V, 131072, 8, 8, false>(v39, v18, v18);
auto v43 = TPipe<2, Direction::DIR_C2V, 65536, 8, 8, false>(v42, v18, v18);
auto v46 = TPipe<4, Direction::DIR_V2C, 65536, 8, 8, false>(v45, v18, v18);
All three LocalSlotNum=8. Manually rewriting only the generated LocalSlotNum from 8 to 2 and rebuilding allows S1=8192 to run to completion in this setup. This strongly implicates the LocalSlotNum mismatch, although QK/PV still differ from the manual reference in EN_UNIT_FLAG, so a full semantic parity fix should treat local_slot_num as the primary bug and track EN_UNIT_FLAG separately if needed.
Related issues / PRs
- PTOAS PR #606 ("fix global tensor half-slot split pipes", merged 2026-04-29) — introduced the
gm_slot_tensor form; the verifier rejection in Defect A landed in this PR.
- PTOAS PR #569 ("feat: support
local_slot_num on legacy pipe init", merged 2026-04-25) — added the attribute on gm_slot_buffer form only.
- pto-isa #629 ("FA lit regression test with S1_TILE=512 crashes at runtime") — same family, OPEN.
- pto-isa #621 ("Expose FIFO consumer sync period (
cons_sync_period)") — kFaCvFifoConsSyncPeriod=4 is the other manual knob currently missing on globaltensor pipe init; would be natural to expose alongside (1) above.
- pto-isa #622 ("
QK_PRELOAD=4 deadlock") — closed without code fix; the same LocalSlotNum chain likely contributed.
Additional context
- ptoas binary in use:
/usr/local/bin/ptoas-bin/bin/ptoas (mtime 2026-04-30 15:01, includes PTOAS PR #606's pto.talloc_to_aiv/pto.talloc_to_aic).
- mlir_combined Python bindings rebuilt locally from
hw-native-sys/PTOAS@c3a2395 to expose the talloc op classes.
- v2 kernel reproduces the manual's row_slice loop (
Vec_S0=32 × kTileFactor=2) so VEC UB stays under 192 KiB at S1_TILE=256; that part is independent of this issue.
Component
PTO Dialect / ODS (
include/PTO/IR) andlib/PTO/Transforms/PTOLowerFrontendPipeOpsPass.cpp.Description
The PTO-DSL FlashAttention v2 example in PR #117
kernels/python/flash_atten-v2/is structurally aligned with the manual referencekernels/manual/common/flash_atten/fa_performance_kernel.cpp:TILE_S1 = 256,CUBE_S1 = 128,kTileFactor = 2, address-based slot model on all three pipes (pto.aic_initialize_pipe(gm_slot_tensor=...)/talloc_to_aiv/tpush_to_aiv/tpop_from_aic/tfree_from_aic).Single-call correctness (
atol = rtol = 1e-3against fp32 reference, fresh process per length, on A3):The manual C++ reference at the same
case_float_H_128_S0_128_S1_8192shape runs to completion. The DSL-generated kernel differs from the manual reference in theTPipetemplate parameters used for the three cross-core FIFO pipes:fa_performance_kernel.cpp:790,795,799TPipe<..., SlotNum=8, LocalSlotNum=2, IsNoSplit=false, EN_UNIT_FLAG=true>TPipe<..., SlotNum=8>; effectiveLocalSlotNum=2via C++ defaultTPipe<..., SlotNum=8, LocalSlotNum=2, IsNoSplit=false, EN_UNIT_FLAG=true>TPipe<..., SlotNum=8, LocalSlotNum=8, IsNoSplit=false>TPipe<..., SlotNum=8, LocalSlotNum=8, IsNoSplit=false>TPipe<..., SlotNum=8, LocalSlotNum=8, IsNoSplit=false>Per
include/pto/npu/a2a3/TPush.hpp:28, the C++ default forLocalSlotNumis 2. Therefore the manual P pipe, although it does not spell out the fifth template argument, also has effectiveLocalSlotNum=2.Note: QK/PV also differ in
EN_UNIT_FLAG(truein the manual reference, defaultfalsein DSL-generated code). The experiment below specifically isolatesLocalSlotNumby only rewriting8 -> 2; however, this issue should not claim thatLocalSlotNumis the only template-level difference.DSL gets
LocalSlotNum=8because three places in ptoas conspire to drop the manual/defaultLocalSlotNum=2behavior:Defect A — verifier rejects
local_slot_numon globaltensor pipe initlib/PTO/IR/PTO.cpp(HEADeeeb1f4, lines 10680–10682):The DSL has no legal way to override
LocalSlotNumon the address-based /gm_slot_tensorform added in PTOAS PR #606. PR #569 (legacylocal_slot_numsupport) only coversgm_slot_buffer.Defect B — lowering hard-codes empty
localSlotNumAttrfor the globaltensor branchlib/PTO/Transforms/PTOLowerFrontendPipeOpsPass.cpp(HEADeeeb1f4, lines 123–134):Even if Defect A were lifted, this branch would still drop the user attribute. The non-globaltensor branch (lines 152–156) at least passes
getLocalSlotNumAttr()through.Defect C — EmitC fallback is
getSlotNum()(= 8), not the C++ template default 2lib/PTO/Transforms/PTOToEmitC.cpp(HEADeeeb1f4, lines 628–630):When the attr is absent, EmitC writes
LocalSlotNum=SlotNumexplicitly. This does not match the C++ API default (LocalSlotNum=2) that the manual reference relies on.What the source code proves, and the likely timeout mechanism
The source-level mismatch is clear:
TPipedefaultsLocalSlotNumto2.LocalSlotNum=2on QK/P/PV.gm_slot_tensorfrontend form cannot legally carrylocal_slot_num.localSlotNumAttr.getSlotNum()when the attr is absent, soSlotNum=8becomesLocalSlotNum=8.What can be directly seen in
include/pto/npu/a2a3/TPush.hppis thatLocalSlotNumis used throughRingFIFO<SlotSize, SlotNum, LocalSlotNum>and affects the local consumer-buffer address rotation:fifo.C2V_CONSUMER_BUF + (tileIndex % RingFiFo::LOCAL_SLOT_NUM) * ConsM * ConsN * sizeof(T);and similarly for
V2C_CONSUMER_BUF.Therefore the most conservative source-backed statement is:
SlotNum=8).This is a plausible cause of the observed AICore timeout: with
LocalSlotNum=8, the consumer-side local buffer lifetime and the FIFO free/ready synchronization no longer match the manual kernel's intended two-slot ping-pong schedule. When the 8-slot GM ring is reused, stale data, premature reuse, or unmatched producer/consumer progress can lead to a wait that never observes the expected signal.The original explanation involving “8 × 3 = 24 event identities exceeding an 8-event pool” is a possible hypothesis, but it is not directly proven by
TPush.hpp: the visibleTPipeimplementation uses fixedFlagID/FlagID+1-style FFTS messages, whileLocalSlotNumis directly visible in local buffer address rotation. To prove the event-ID explanation, we would need to compare the IR/C++ emitted by--enable-insert-syncand show that event lifetimes or assigned event IDs become conflicting only in theLocalSlotNum=8version.Observed behavior is still consistent with the
LocalSlotNummismatch:NUM_TILES <= 8: the GM ring has not been reused beyond its 8 slots, so the mismatch is less likely to surface.NUM_TILES >= 16: the 8-slot GM ring has been reused multiple times, and the generatedLocalSlotNum=8local-buffer rotation diverges substantially from the manual two-slot ping-pong pattern.Reproduction
Using PR #117 commit
35b35de4(kernels/python/flash_atten-v2/) on A3 withptoas --pto-arch=a3 --enable-insert-syncand bisheng built kernel:Inspecting the emitted
build_artifacts/fa_32.cpp:All three
LocalSlotNum=8. Manually rewriting only the generatedLocalSlotNumfrom8to2and rebuilding allows S1=8192 to run to completion in this setup. This strongly implicates theLocalSlotNummismatch, although QK/PV still differ from the manual reference inEN_UNIT_FLAG, so a full semantic parity fix should treatlocal_slot_numas the primary bug and trackEN_UNIT_FLAGseparately if needed.Related issues / PRs
gm_slot_tensorform; the verifier rejection in Defect A landed in this PR.local_slot_numon legacy pipe init", merged 2026-04-25) — added the attribute ongm_slot_bufferform only.cons_sync_period)") —kFaCvFifoConsSyncPeriod=4is the other manual knob currently missing on globaltensor pipe init; would be natural to expose alongside (1) above.QK_PRELOAD=4deadlock") — closed without code fix; the sameLocalSlotNumchain likely contributed.Additional context
/usr/local/bin/ptoas-bin/bin/ptoas(mtime 2026-04-30 15:01, includes PTOAS PR #606'spto.talloc_to_aiv/pto.talloc_to_aic).hw-native-sys/PTOAS@c3a2395to expose the talloc op classes.Vec_S0=32×kTileFactor=2) so VEC UB stays under 192 KiB atS1_TILE=256; that part is independent of this issue.