[Bug] flash_atten-v2 (PR #117) emits TPipe<...,SlotNum=8,LocalSlotNum=8,...> for gm_slot_tensor pipe init, diverging from manual FA effective LocalSlotNum=2 and causing long-sequence timeout

### Component

PTO Dialect / ODS (`include/PTO/IR`) and `lib/PTO/Transforms/PTOLowerFrontendPipeOpsPass.cpp`.

### Description

The PTO-DSL FlashAttention v2 example in [PR #117 `kernels/python/flash_atten-v2/`](https://github.com/hw-native-sys/pto-isa/pull/117) is structurally aligned with the manual reference [`kernels/manual/common/flash_atten/fa_performance_kernel.cpp`](https://github.com/hw-native-sys/pto-isa/blob/main/kernels/manual/common/flash_atten/fa_performance_kernel.cpp): `TILE_S1 = 256`, `CUBE_S1 = 128`, `kTileFactor = 2`, address-based slot model on all three pipes (`pto.aic_initialize_pipe(gm_slot_tensor=...)` / `talloc_to_aiv` / `tpush_to_aiv` / `tpop_from_aic` / `tfree_from_aic`).

Single-call correctness (`atol = rtol = 1e-3` against fp32 reference, fresh process per length, on A3):

| S1   | NUM_TILES | status         | max_err  |
|-----:|----------:|----------------|----------|
| 1024 |         4 | **PASSED**     | 4.43e-05 |
| 2048 |         8 | **PASSED**     | 2.72e-05 |
| 4096 |        16 | aicore timeout |          |
| 8192 |        32 | aicore timeout |          |

The manual C++ reference at the same `case_float_H_128_S0_128_S1_8192` shape runs to completion. The DSL-generated kernel differs from the manual reference in the `TPipe` template parameters used for the three cross-core FIFO pipes:

| Source | QK pipe | P pipe | PV pipe |
|---|---|---|---|
| Manual `fa_performance_kernel.cpp:790,795,799` | `TPipe<..., SlotNum=8, LocalSlotNum=2, IsNoSplit=false, EN_UNIT_FLAG=true>` | `TPipe<..., SlotNum=8>`; effective `LocalSlotNum=2` via C++ default | `TPipe<..., SlotNum=8, LocalSlotNum=2, IsNoSplit=false, EN_UNIT_FLAG=true>` |
| DSL after ptoas lowering | `TPipe<..., SlotNum=8, LocalSlotNum=8, IsNoSplit=false>` | `TPipe<..., SlotNum=8, LocalSlotNum=8, IsNoSplit=false>` | `TPipe<..., SlotNum=8, LocalSlotNum=8, IsNoSplit=false>` |

Per [`include/pto/npu/a2a3/TPush.hpp:28`](https://github.com/hw-native-sys/pto-isa/blob/main/include/pto/npu/a2a3/TPush.hpp#L28), the C++ default for `LocalSlotNum` is **2**. Therefore the manual P pipe, although it does not spell out the fifth template argument, also has effective `LocalSlotNum=2`.

Note: QK/PV also differ in `EN_UNIT_FLAG` (`true` in the manual reference, default `false` in DSL-generated code). The experiment below specifically isolates `LocalSlotNum` by only rewriting `8 -> 2`; however, this issue should not claim that `LocalSlotNum` is the only template-level difference.

DSL gets `LocalSlotNum=8` because three places in ptoas conspire to drop the manual/default `LocalSlotNum=2` behavior:

#### Defect A — verifier rejects `local_slot_num` on globaltensor pipe init

[`lib/PTO/IR/PTO.cpp` (HEAD `eeeb1f4`, lines 10680–10682)](https://github.com/hw-native-sys/PTOAS/blob/eeeb1f4/lib/PTO/IR/PTO.cpp#L10680-L10682):

```cpp
if (op.getLocalSlotNumAttr())
  return op.emitOpError(
      "globaltensor pipe init does not use 'local_slot_num'");
```

The DSL has no legal way to override `LocalSlotNum` on the address-based / `gm_slot_tensor` form added in PTOAS PR #606. PR #569 (legacy `local_slot_num` support) only covers `gm_slot_buffer`.

#### Defect B — lowering hard-codes empty `localSlotNumAttr` for the globaltensor branch

[`lib/PTO/Transforms/PTOLowerFrontendPipeOpsPass.cpp` (HEAD `eeeb1f4`, lines 123–134)](https://github.com/hw-native-sys/PTOAS/blob/eeeb1f4/lib/PTO/Transforms/PTOLowerFrontendPipeOpsPass.cpp#L123-L134):

```cpp
if (initOp.getGmSlotTensor()) {
  ...
  auto pipe = rewriter.create<InitializeL2G2LPipeOp>(
      loc, pipeTy, dirAttr, slotSizeAttr, slotNumAttr,
      IntegerAttr{},     // ← localSlotNumAttr
      IntegerAttr{},     // ← flagBaseAttr
      noSplitAttr, initOp.getGmSlotTensor(), Value{}, Value{});
  ...
}
```

Even if Defect A were lifted, this branch would still drop the user attribute. The non-globaltensor branch (lines 152–156) at least passes `getLocalSlotNumAttr()` through.

#### Defect C — EmitC fallback is `getSlotNum()` (= 8), not the C++ template default 2

[`lib/PTO/Transforms/PTOToEmitC.cpp` (HEAD `eeeb1f4`, lines 628–630)](https://github.com/hw-native-sys/PTOAS/blob/eeeb1f4/lib/PTO/Transforms/PTOToEmitC.cpp#L628-L630):

```cpp
int32_t localSlotNum = initOp.getLocalSlotNumAttr()
                           ? initOp.getLocalSlotNumAttr().getInt()
                           : initOp.getSlotNum();   // ← =8 in this kernel
```

When the attr is absent, EmitC writes `LocalSlotNum=SlotNum` explicitly. This does not match the C++ API default (`LocalSlotNum=2`) that the manual reference relies on.

### What the source code proves, and the likely timeout mechanism

The source-level mismatch is clear:

1. A2/A3 `TPipe` defaults `LocalSlotNum` to `2`.
2. The manual FA reference uses effective `LocalSlotNum=2` on QK/P/PV.
3. The `gm_slot_tensor` frontend form cannot legally carry `local_slot_num`.
4. The globaltensor lowering branch drops/omits `localSlotNumAttr`.
5. EmitC falls back to `getSlotNum()` when the attr is absent, so `SlotNum=8` becomes `LocalSlotNum=8`.

What can be directly seen in `include/pto/npu/a2a3/TPush.hpp` is that `LocalSlotNum` is used through `RingFIFO<SlotSize, SlotNum, LocalSlotNum>` and affects the local consumer-buffer address rotation:

```cpp
fifo.C2V_CONSUMER_BUF +
    (tileIndex % RingFiFo::LOCAL_SLOT_NUM) * ConsM * ConsN * sizeof(T);
```

and similarly for `V2C_CONSUMER_BUF`.

Therefore the most conservative source-backed statement is:

- Manual FA rotates consumer local buffers with period 2.
- DSL-generated FA rotates consumer local buffers with period 8.
- Both use the same GM ring depth (`SlotNum=8`).
- For long sequences, GM slots are reused after tile index 8, so the generated kernel exercises ring reuse with a different local-buffer rotation policy than the manual reference.

This is a plausible cause of the observed AICore timeout: with `LocalSlotNum=8`, the consumer-side local buffer lifetime and the FIFO free/ready synchronization no longer match the manual kernel's intended two-slot ping-pong schedule. When the 8-slot GM ring is reused, stale data, premature reuse, or unmatched producer/consumer progress can lead to a wait that never observes the expected signal.

The original explanation involving “8 × 3 = 24 event identities exceeding an 8-event pool” is a possible hypothesis, but it is not directly proven by `TPush.hpp`: the visible `TPipe` implementation uses fixed `FlagID` / `FlagID+1`-style FFTS messages, while `LocalSlotNum` is directly visible in local buffer address rotation. To prove the event-ID explanation, we would need to compare the IR/C++ emitted by `--enable-insert-sync` and show that event lifetimes or assigned event IDs become conflicting only in the `LocalSlotNum=8` version.

Observed behavior is still consistent with the `LocalSlotNum` mismatch:

- `NUM_TILES <= 8`: the GM ring has not been reused beyond its 8 slots, so the mismatch is less likely to surface.
- `NUM_TILES >= 16`: the 8-slot GM ring has been reused multiple times, and the generated `LocalSlotNum=8` local-buffer rotation diverges substantially from the manual two-slot ping-pong pattern.

### Reproduction

Using PR #117 commit `35b35de4` (kernels/python/flash_atten-v2/) on A3 with `ptoas --pto-arch=a3 --enable-insert-sync` and bisheng built kernel:

```bash
cd kernels/python/flash_atten-v2
bash run_fa.sh --tiles 4 --lengths 1024  # PASSED, max_err 4.43e-05
bash run_fa.sh --tiles 8 --lengths 2048  # PASSED, max_err 2.72e-05
bash run_fa.sh --tiles 32 --lengths 8192 # builds, runtime aicore timeout
```

Inspecting the emitted `build_artifacts/fa_32.cpp`:

```cpp
auto v40 = TPipe<0, Direction::DIR_C2V, 131072, 8, 8, false>(v39, v18, v18);
auto v43 = TPipe<2, Direction::DIR_C2V, 65536, 8, 8, false>(v42, v18, v18);
auto v46 = TPipe<4, Direction::DIR_V2C, 65536, 8, 8, false>(v45, v18, v18);
```

All three `LocalSlotNum=8`. Manually rewriting only the generated `LocalSlotNum` from `8` to `2` and rebuilding allows S1=8192 to run to completion in this setup. This strongly implicates the `LocalSlotNum` mismatch, although QK/PV still differ from the manual reference in `EN_UNIT_FLAG`, so a full semantic parity fix should treat `local_slot_num` as the primary bug and track `EN_UNIT_FLAG` separately if needed.

### Related issues / PRs

- PTOAS PR #606 ("fix global tensor half-slot split pipes", merged 2026-04-29) — introduced the `gm_slot_tensor` form; the verifier rejection in Defect A landed in this PR.
- PTOAS PR #569 ("feat: support `local_slot_num` on legacy pipe init", merged 2026-04-25) — added the attribute on `gm_slot_buffer` form only.
- pto-isa #629 ("FA lit regression test with S1_TILE=512 crashes at runtime") — same family, OPEN.
- pto-isa #621 ("Expose FIFO consumer sync period (`cons_sync_period`)") — `kFaCvFifoConsSyncPeriod=4` is the *other* manual knob currently missing on globaltensor pipe init; would be natural to expose alongside (1) above.
- pto-isa #622 ("`QK_PRELOAD=4` deadlock") — closed without code fix; the same `LocalSlotNum` chain likely contributed.

### Additional context

- ptoas binary in use: `/usr/local/bin/ptoas-bin/bin/ptoas` (mtime 2026-04-30 15:01, includes PTOAS PR #606's `pto.talloc_to_aiv`/`pto.talloc_to_aic`).
- mlir_combined Python bindings rebuilt locally from `hw-native-sys/PTOAS@c3a2395` to expose the talloc op classes.
- v2 kernel reproduces the manual's row_slice loop (`Vec_S0=32` × `kTileFactor=2`) so VEC UB stays under 192 KiB at `S1_TILE=256`; that part is independent of this issue.


Source	QK pipe	P pipe	PV pipe
Manual `fa_performance_kernel.cpp:790,795,799`	`TPipe<..., SlotNum=8, LocalSlotNum=2, IsNoSplit=false, EN_UNIT_FLAG=true>`	`TPipe<..., SlotNum=8>`; effective `LocalSlotNum=2` via C++ default	`TPipe<..., SlotNum=8, LocalSlotNum=2, IsNoSplit=false, EN_UNIT_FLAG=true>`
DSL after ptoas lowering	`TPipe<..., SlotNum=8, LocalSlotNum=8, IsNoSplit=false>`	`TPipe<..., SlotNum=8, LocalSlotNum=8, IsNoSplit=false>`	`TPipe<..., SlotNum=8, LocalSlotNum=8, IsNoSplit=false>`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] flash_atten-v2 (PR #117) emits TPipe<...,SlotNum=8,LocalSlotNum=8,...> for gm_slot_tensor pipe init, diverging from manual FA effective LocalSlotNum=2 and causing long-sequence timeout #118

Component

Description

Defect A — verifier rejects `local_slot_num` on globaltensor pipe init

Defect B — lowering hard-codes empty `localSlotNumAttr` for the globaltensor branch

Defect C — EmitC fallback is `getSlotNum()` (= 8), not the C++ template default 2

What the source code proves, and the likely timeout mechanism

Reproduction

Related issues / PRs

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

S1	NUM_TILES	status	max_err
1024	4	PASSED	4.43e-05
2048	8	PASSED	2.72e-05
4096	16	aicore timeout
8192	32	aicore timeout

[Bug] flash_atten-v2 (PR #117) emits TPipe<...,SlotNum=8,LocalSlotNum=8,...> for gm_slot_tensor pipe init, diverging from manual FA effective LocalSlotNum=2 and causing long-sequence timeout #118

Description

Component

Description

Defect A — verifier rejects local_slot_num on globaltensor pipe init

Defect B — lowering hard-codes empty localSlotNumAttr for the globaltensor branch

Defect C — EmitC fallback is getSlotNum() (= 8), not the C++ template default 2

What the source code proves, and the likely timeout mechanism

Reproduction

Related issues / PRs

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Defect A — verifier rejects `local_slot_num` on globaltensor pipe init

Defect B — lowering hard-codes empty `localSlotNumAttr` for the globaltensor branch

Defect C — EmitC fallback is `getSlotNum()` (= 8), not the C++ template default 2