Skip to content

[Bug] flash_atten-v2 (PR #117) emits TPipe<...,SlotNum=8,LocalSlotNum=8,...> for gm_slot_tensor pipe init, diverging from manual FA effective LocalSlotNum=2 and causing long-sequence timeout #118

@chenshengxin2026

Description

@chenshengxin2026

Component

PTO Dialect / ODS (include/PTO/IR) and lib/PTO/Transforms/PTOLowerFrontendPipeOpsPass.cpp.

Description

The PTO-DSL FlashAttention v2 example in PR #117 kernels/python/flash_atten-v2/ is structurally aligned with the manual reference kernels/manual/common/flash_atten/fa_performance_kernel.cpp: TILE_S1 = 256, CUBE_S1 = 128, kTileFactor = 2, address-based slot model on all three pipes (pto.aic_initialize_pipe(gm_slot_tensor=...) / talloc_to_aiv / tpush_to_aiv / tpop_from_aic / tfree_from_aic).

Single-call correctness (atol = rtol = 1e-3 against fp32 reference, fresh process per length, on A3):

S1 NUM_TILES status max_err
1024 4 PASSED 4.43e-05
2048 8 PASSED 2.72e-05
4096 16 aicore timeout
8192 32 aicore timeout

The manual C++ reference at the same case_float_H_128_S0_128_S1_8192 shape runs to completion. The DSL-generated kernel differs from the manual reference in the TPipe template parameters used for the three cross-core FIFO pipes:

Source QK pipe P pipe PV pipe
Manual fa_performance_kernel.cpp:790,795,799 TPipe<..., SlotNum=8, LocalSlotNum=2, IsNoSplit=false, EN_UNIT_FLAG=true> TPipe<..., SlotNum=8>; effective LocalSlotNum=2 via C++ default TPipe<..., SlotNum=8, LocalSlotNum=2, IsNoSplit=false, EN_UNIT_FLAG=true>
DSL after ptoas lowering TPipe<..., SlotNum=8, LocalSlotNum=8, IsNoSplit=false> TPipe<..., SlotNum=8, LocalSlotNum=8, IsNoSplit=false> TPipe<..., SlotNum=8, LocalSlotNum=8, IsNoSplit=false>

Per include/pto/npu/a2a3/TPush.hpp:28, the C++ default for LocalSlotNum is 2. Therefore the manual P pipe, although it does not spell out the fifth template argument, also has effective LocalSlotNum=2.

Note: QK/PV also differ in EN_UNIT_FLAG (true in the manual reference, default false in DSL-generated code). The experiment below specifically isolates LocalSlotNum by only rewriting 8 -> 2; however, this issue should not claim that LocalSlotNum is the only template-level difference.

DSL gets LocalSlotNum=8 because three places in ptoas conspire to drop the manual/default LocalSlotNum=2 behavior:

Defect A — verifier rejects local_slot_num on globaltensor pipe init

lib/PTO/IR/PTO.cpp (HEAD eeeb1f4, lines 10680–10682):

if (op.getLocalSlotNumAttr())
  return op.emitOpError(
      "globaltensor pipe init does not use 'local_slot_num'");

The DSL has no legal way to override LocalSlotNum on the address-based / gm_slot_tensor form added in PTOAS PR #606. PR #569 (legacy local_slot_num support) only covers gm_slot_buffer.

Defect B — lowering hard-codes empty localSlotNumAttr for the globaltensor branch

lib/PTO/Transforms/PTOLowerFrontendPipeOpsPass.cpp (HEAD eeeb1f4, lines 123–134):

if (initOp.getGmSlotTensor()) {
  ...
  auto pipe = rewriter.create<InitializeL2G2LPipeOp>(
      loc, pipeTy, dirAttr, slotSizeAttr, slotNumAttr,
      IntegerAttr{},     // ← localSlotNumAttr
      IntegerAttr{},     // ← flagBaseAttr
      noSplitAttr, initOp.getGmSlotTensor(), Value{}, Value{});
  ...
}

Even if Defect A were lifted, this branch would still drop the user attribute. The non-globaltensor branch (lines 152–156) at least passes getLocalSlotNumAttr() through.

Defect C — EmitC fallback is getSlotNum() (= 8), not the C++ template default 2

lib/PTO/Transforms/PTOToEmitC.cpp (HEAD eeeb1f4, lines 628–630):

int32_t localSlotNum = initOp.getLocalSlotNumAttr()
                           ? initOp.getLocalSlotNumAttr().getInt()
                           : initOp.getSlotNum();   // ← =8 in this kernel

When the attr is absent, EmitC writes LocalSlotNum=SlotNum explicitly. This does not match the C++ API default (LocalSlotNum=2) that the manual reference relies on.

What the source code proves, and the likely timeout mechanism

The source-level mismatch is clear:

  1. A2/A3 TPipe defaults LocalSlotNum to 2.
  2. The manual FA reference uses effective LocalSlotNum=2 on QK/P/PV.
  3. The gm_slot_tensor frontend form cannot legally carry local_slot_num.
  4. The globaltensor lowering branch drops/omits localSlotNumAttr.
  5. EmitC falls back to getSlotNum() when the attr is absent, so SlotNum=8 becomes LocalSlotNum=8.

What can be directly seen in include/pto/npu/a2a3/TPush.hpp is that LocalSlotNum is used through RingFIFO<SlotSize, SlotNum, LocalSlotNum> and affects the local consumer-buffer address rotation:

fifo.C2V_CONSUMER_BUF +
    (tileIndex % RingFiFo::LOCAL_SLOT_NUM) * ConsM * ConsN * sizeof(T);

and similarly for V2C_CONSUMER_BUF.

Therefore the most conservative source-backed statement is:

  • Manual FA rotates consumer local buffers with period 2.
  • DSL-generated FA rotates consumer local buffers with period 8.
  • Both use the same GM ring depth (SlotNum=8).
  • For long sequences, GM slots are reused after tile index 8, so the generated kernel exercises ring reuse with a different local-buffer rotation policy than the manual reference.

This is a plausible cause of the observed AICore timeout: with LocalSlotNum=8, the consumer-side local buffer lifetime and the FIFO free/ready synchronization no longer match the manual kernel's intended two-slot ping-pong schedule. When the 8-slot GM ring is reused, stale data, premature reuse, or unmatched producer/consumer progress can lead to a wait that never observes the expected signal.

The original explanation involving “8 × 3 = 24 event identities exceeding an 8-event pool” is a possible hypothesis, but it is not directly proven by TPush.hpp: the visible TPipe implementation uses fixed FlagID / FlagID+1-style FFTS messages, while LocalSlotNum is directly visible in local buffer address rotation. To prove the event-ID explanation, we would need to compare the IR/C++ emitted by --enable-insert-sync and show that event lifetimes or assigned event IDs become conflicting only in the LocalSlotNum=8 version.

Observed behavior is still consistent with the LocalSlotNum mismatch:

  • NUM_TILES <= 8: the GM ring has not been reused beyond its 8 slots, so the mismatch is less likely to surface.
  • NUM_TILES >= 16: the 8-slot GM ring has been reused multiple times, and the generated LocalSlotNum=8 local-buffer rotation diverges substantially from the manual two-slot ping-pong pattern.

Reproduction

Using PR #117 commit 35b35de4 (kernels/python/flash_atten-v2/) on A3 with ptoas --pto-arch=a3 --enable-insert-sync and bisheng built kernel:

cd kernels/python/flash_atten-v2
bash run_fa.sh --tiles 4 --lengths 1024  # PASSED, max_err 4.43e-05
bash run_fa.sh --tiles 8 --lengths 2048  # PASSED, max_err 2.72e-05
bash run_fa.sh --tiles 32 --lengths 8192 # builds, runtime aicore timeout

Inspecting the emitted build_artifacts/fa_32.cpp:

auto v40 = TPipe<0, Direction::DIR_C2V, 131072, 8, 8, false>(v39, v18, v18);
auto v43 = TPipe<2, Direction::DIR_C2V, 65536, 8, 8, false>(v42, v18, v18);
auto v46 = TPipe<4, Direction::DIR_V2C, 65536, 8, 8, false>(v45, v18, v18);

All three LocalSlotNum=8. Manually rewriting only the generated LocalSlotNum from 8 to 2 and rebuilding allows S1=8192 to run to completion in this setup. This strongly implicates the LocalSlotNum mismatch, although QK/PV still differ from the manual reference in EN_UNIT_FLAG, so a full semantic parity fix should treat local_slot_num as the primary bug and track EN_UNIT_FLAG separately if needed.

Related issues / PRs

  • PTOAS PR #606 ("fix global tensor half-slot split pipes", merged 2026-04-29) — introduced the gm_slot_tensor form; the verifier rejection in Defect A landed in this PR.
  • PTOAS PR #569 ("feat: support local_slot_num on legacy pipe init", merged 2026-04-25) — added the attribute on gm_slot_buffer form only.
  • pto-isa #629 ("FA lit regression test with S1_TILE=512 crashes at runtime") — same family, OPEN.
  • pto-isa #621 ("Expose FIFO consumer sync period (cons_sync_period)") — kFaCvFifoConsSyncPeriod=4 is the other manual knob currently missing on globaltensor pipe init; would be natural to expose alongside (1) above.
  • pto-isa #622 ("QK_PRELOAD=4 deadlock") — closed without code fix; the same LocalSlotNum chain likely contributed.

Additional context

  • ptoas binary in use: /usr/local/bin/ptoas-bin/bin/ptoas (mtime 2026-04-30 15:01, includes PTOAS PR #606's pto.talloc_to_aiv/pto.talloc_to_aic).
  • mlir_combined Python bindings rebuilt locally from hw-native-sys/PTOAS@c3a2395 to expose the talloc op classes.
  • v2 kernel reproduces the manual's row_slice loop (Vec_S0=32 × kTileFactor=2) so VEC UB stays under 192 KiB at S1_TILE=256; that part is independent of this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions