Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions examples/jit_cpp/cross_core_sync_demo/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*.so
232 changes: 232 additions & 0 deletions examples/jit_cpp/cross_core_sync_demo/PTO_API_BUGS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,232 @@
# PTO API Known Bugs

This document records confirmed bugs and their workarounds in the PTO-ISA
library (`pto-isa-master`), found while implementing the kernels in this
`cross_core_sync_demo` directory.

---

## Bug 1 — `TPipe` (TileData TPUSH/TPOP): `tileIndex` shared between Vec sub-blocks breaks multi-round kernels

### Status

**Confirmed present in**:
- `/sources/pto-isa/include` (Ascend CANN 8.5.0 bundled headers)
- `pto-isa-master` HEAD as of 2026-05-12 (commit `933ad5d8`)

The pto-isa maintainers acknowledged the issue by changing their own reference
test (`tests/npu/a2a3/src/st/testcase/tpushpop_cv/tpushpop_cv_kernel.cpp`) from
`FIFO_DEPTH=2` to `FIFO_DEPTH=1` in commit `aef3a004` (PR !895, "optimize
reverse dependency with synchronization period", merged 2026-05-07).

### Affected API

`TPUSH` / `TPOP` — TileData overloads (not the GlobalData / gm_pipe overloads):

```cpp
// Producer side (Cube in C2V, Vec in V2C)
TPUSH<PipeType, TileProd, TileSplitAxis::TILE_UP_DOWN>(pipe, tile);

// Consumer side (Vec in C2V, Cube in V2C)
TPOP<PipeType, TileCons, TileSplitAxis::TILE_UP_DOWN>(pipe, tile);
```

The bug is specific to `TileSplitAxis::TILE_UP_DOWN` (or any split that causes
2 Vec sub-blocks to call TPUSH or TPOP independently). `TILE_NO_SPLIT` is
believed to be unaffected.

### Root Cause

`TPipe<FlagID, Dir, SlotSize, FIFO_DEPTH>` stores a single `tileIndex` counter
per `Producer` and per `Consumer` struct (`pipe.prod.tileIndex` /
`pipe.cons.tileIndex`). With `TILE_UP_DOWN`, a single core has **two** Vec
sub-blocks (`vid = 0` and `vid = 1`); each sub-block independently calls TPUSH
or TPOP in its own code path.

Because `tileIndex++` fires once per TPUSH/TPOP call:

| Direction | Who calls TPUSH | Who calls TPOP | Effect |
|-----------|-----------------|----------------|--------|
| C2V | 1 Cube core — once per round | 2 Vec sub-blocks — once each per round | `cons.tileIndex` advances by **2** per round; `prod.tileIndex` advances by 1 → desync after round 1 |
| V2C | 2 Vec sub-blocks — once each per round | 1 Cube core — once per round | `prod.tileIndex` advances by **2** per round; `cons.tileIndex` advances by 1 → desync after round 1 |

After N logical rounds with `FIFO_DEPTH=2`, `SyncPeriod=2`:
- The side with 2 sub-blocks has `tileIndex = 2N`; the other side has `tileIndex = N`
- The slot selected by `tileIndex % SLOT_NUM` drifts: the 2-sub-block side
starts reading/writing the wrong FIFO slot from round 2 onwards
- The `shouldWaitFree` / `shouldNotifyFree` conditions also fire at wrong
intervals, causing the FFTS signal counts to diverge

### Observed Failures

**C2V (`matmul_add_c2v`, `stream_c2v`):**
- `num_rounds = 1`: correct
- `num_rounds = 2`: wrong numerical results (`max_diff ≈ 70` for fp32 output)
- `num_rounds ≥ 4`: hardware exception — `L0C read/write conflict (FIXP reads
l0c, same address as cube write)`

**V2C (`add_matmul_v2c`, `stream_v2c`):**
- `num_rounds = 1`: correct
- `num_rounds ≥ 2`: wrong numerical results and/or hardware exception

Errors are **deterministic** (reproducible on every run with the same seed).

### Minimal Reproduction

```cpp
// C2V direction — fails at num_rounds=2 with FIFO_DEPTH=2
constexpr uint32_t FIFO_DEPTH = 2;
using C2VPipe = TPipe<0, Direction::DIR_C2V, C2V_SLOT_SIZE, FIFO_DEPTH>;
// ...
for (int32_t r = 0; r < num_rounds; ++r) {
TPUSH<C2VPipe, TileL0C, TileSplitAxis::TILE_UP_DOWN>(pipe, c_l0); // Cube
TPOP<C2VPipe, VecTile, TileSplitAxis::TILE_UP_DOWN>(pipe, c_ub); // Vec ×2 sub-blocks
}
```

See `matmul_add/pushpop/matmul_add_c2v.cpp` and `add_matmul_v2c.cpp` for the
full implementations that reproduce the failure.

### Expected Behavior

A kernel with `num_rounds > 1` using `TILE_UP_DOWN` should:
1. Maintain correct FIFO slot selection across all rounds
2. Maintain balanced FFTS signal counts (no accumulation)
3. Produce correct numerical output for any `num_rounds ≥ 1`

### Workarounds

#### Workaround A — `FIFO_DEPTH=1` (pto-isa maintainers' approach)

Change the pipe depth to 1. With `SlotNum=1`, `SyncPeriod=1` (per
`TPipe::SyncPeriod` formula), and the new `shouldWaitFree` code (PR !895)
always returns `true` for `SlotNum == 1`. This forces strict producer↔consumer
alternation — no double-buffering — which avoids the tileIndex desync at the
cost of pipeline overlap:

```cpp
constexpr uint32_t FIFO_DEPTH = 1; // was 2
using C2VPipe = TPipe<0, Direction::DIR_C2V, C2V_SLOT_SIZE, FIFO_DEPTH>;
```

**Important**: the Python-side `fifo_mem` allocation must also reflect
`FIFO_DEPTH=1`:
```python
C2V_FIFO_ELEMS_PER_CORE = 1 * TILE_SIZE * TILE_SIZE # not 2× anymore
```

**Note**: this workaround also requires fresh `fifo_mem` per kernel call in
Python benchmarks. Reusing the same `fifo_mem` tensor across repeated calls
accumulates TPipe head/tail state (stored inside `fifo_mem`) and causes wrong
results or hangs. Pre-allocate one `fifo_mem` per call:
```python
fifos = [torch.zeros(BLOCK_DIM * FIFO_ELEMS_PER_CORE, ...) for _ in range(n_calls)]
for i in range(n_calls):
kernel(A, B, C, D, fifos[i])
```

#### Workaround B — `gm_pipe` variant (GlobalData TPUSH/TPOP + explicit TALLOC/TFREE)

Use the GlobalData overloads of TPUSH/TPOP together with TALLOC/TFREE and
explicit TSTORE/TLOAD. The `gm_pipe` implementation in this demo manages FIFO
slot indices manually via `r % FIFO_DEPTH`, completely bypassing the shared
`tileIndex` counter. This variant supports arbitrary `num_rounds` with
`FIFO_DEPTH=2`.

See `matmul_add/gm_pipe/` and `stream_c2v_v2c/gm_pipe/`.

**Important**: `gm_pipe` requires the newer `pto-isa-master` headers (not the
CANN 8.5.0 bundled headers), because `TALLOC`, `TPOP(GlobalData)`, and `TFREE`
are absent from `/sources/pto-isa/include`.

#### Workaround C — raw FFTS flags (`raw_flag` variant)

Avoid TPipe entirely. Use `ffts_cross_core_sync` / `wait_flag_dev` directly
with explicit workspace memory. Supports arbitrary `num_rounds` with no tileIndex
issue. See `matmul_add/raw_flag/` and `stream_c2v_v2c/raw_flag/`.

### Summary Table

| Variant | API | Multi-round | Notes |
|---------|-----|-------------|-------|
| `pushpop` (FIFO_DEPTH=2) | TileData TPUSH/TPOP | ❌ broken ≥2 rounds | This bug |
| `pushpop` (FIFO_DEPTH=1) | TileData TPUSH/TPOP | ✅ correct | No double-buffer overlap |
| `gm_pipe` | GlobalData TPUSH/TPOP + TALLOC/TFREE | ✅ correct | Newer headers required |
| `raw_flag` | Direct FFTS + manual workspace | ✅ correct | Most portable |

---

## Bug 2 — FFTS flag collision between kernels with the same `FlagID`

### Status

**Design limitation** (not a library bug per se, but a footgun).

### Description

`TPipe<FlagID, ...>` uses FFTS hardware flags `FlagID` (push/data-ready signal)
and `FlagID+1` (free/slot-available signal) internally. When two different
kernels or pipe types use the same `FlagID`, their FFTS signals contaminate
each other if the kernels are called sequentially in the same process on the
same NPU core.

**Example**: `C2VPipe = TPipe<0, DIR_C2V>` and `V2CPipe = TPipe<0, DIR_V2C>`
both occupy FFTS flags 0 and 1. A benchmark that calls the C2V kernel many
times accumulates residual FFTS signals on flags 0/1. The subsequent V2C
kernel's first TPOP fires on a stale signal and reads wrong data.

### Fix

Assign non-overlapping `FlagID` values to pipes that are called from the same
process:

```cpp
using C2VPipe = TPipe<0, Direction::DIR_C2V, ...>; // uses flags 0, 1
using V2CPipe = TPipe<2, Direction::DIR_V2C, ...>; // uses flags 2, 3 — no collision
```

This fix is applied in:
- `stream_c2v_v2c/pushpop/stream_v2c.cpp`
- `stream_c2v_v2c/gm_pipe/stream_v2c.cpp`
- `matmul_add/gm_pipe/add_matmul_v2c.cpp` (uses raw FFTS flags 2/3 instead of 0/1)

---

## Bug 3 — `TSTORE(c_global, c_l0)` (FIX pipe) conflicts with next-call `TMATMUL` (M pipe) in benchmark loops

### Status

**Synchronization omission** in the kernel itself, exposed by benchmark loops.

### Description

`TSTORE(dst_gm, c_l0)` on the FIX pipe initiates a DMA that reads from `c_l0`
(L0C) and writes to global memory. The DMA may still be in-flight when the
kernel "completes" (all pipe instructions issued). If back-to-back kernel calls
are queued in the same NPU stream (as in a benchmark loop), the **next** call's
`TMATMUL` can start writing to `c_l0` (M pipe) before the **previous** call's
FIX DMA finishes reading it → `L0C read/write conflict` hardware exception.

This does NOT manifest in correctness tests (few calls) but reliably crashes
under benchmark load (`REPEATS=30` calls in a tight loop).

### Fix

Add `pipe_barrier(PIPE_ALL)` immediately after the last `TSTORE(c_global, c_l0)`
in the Cube loop to drain the FIX pipe before kernel exit:

```cpp
for (int32_t r = 0; r < num_rounds; ++r) {
// ...
TSTORE(c_global, c_l0);
pipe_barrier(PIPE_ALL); // ← drain FIX before kernel exit / next TMATMUL
}
```

Or use the targeted `SetFlag<PIPE_FIX, PIPE_M>(1); WaitFlag<PIPE_FIX, PIPE_M>(1);`
pair after each TSTORE (requires an additional `SetFlag<PIPE_M, PIPE_MTE1>(1);
WaitFlag<PIPE_M, PIPE_MTE1>(1);` guard on L0A reuse — see
`matmul_add/gm_pipe/add_matmul_v2c.cpp` for the full treatment).

This fix is applied in `matmul_add/raw_flag/add_matmul_v2c.cpp` and the
`gm_pipe` variants.
21 changes: 21 additions & 0 deletions examples/jit_cpp/cross_core_sync_demo/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
Demonstrate different API abstractions for Cube-Vector data exchange and synchronization

There are currently 4 API sets that can express cross-core data passing:
1. `ffts_cross_core_sync` & `wait_flag_dev`
2. `TSYNC`
3. `TPUSH` & `TPOP`
4. `TPUSH` & `TPOP` & `TFREE` & `TALLOC`

Purpose of this demo directory: Use *clear, minimum code* to demonstrate the *syntax and performance* differences between those API styles.

- [stream_c2v_v2c](./stream_c2v_v2c)
- [matmul_add](./matmul_add)
- [linear_attn](./linear_attn)

## Known PTO API Issues

See **[PTO_API_BUGS.md](./PTO_API_BUGS.md)** for confirmed bugs and workarounds:

- **Bug 1**: `TPipe` TileData TPUSH/TPOP with `TILE_UP_DOWN` and 2 Vec sub-blocks — `tileIndex` shared counter causes slot desync for `num_rounds ≥ 2` (`FIFO_DEPTH=2`). Confirmed present in latest `pto-isa-master` (as of 2026-05-12). Workaround: use `FIFO_DEPTH=1`, `gm_pipe`, or `raw_flag`.
- **Bug 2**: FFTS flag collision when two pipes share the same `FlagID` (e.g., `TPipe<0, DIR_C2V>` and `TPipe<0, DIR_V2C>` both use flags 0/1).
- **Bug 3**: `TSTORE(c_global, c_l0)` FIX-pipe DMA in-flight at kernel exit conflicts with next-call `TMATMUL` in benchmark loops → `L0C read/write conflict`.
103 changes: 103 additions & 0 deletions examples/jit_cpp/cross_core_sync_demo/matmul_add/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# matmul_add — Three Cube↔Vector synchronization API styles + naive baseline

Persistent kernels computing `C = A @ B + D` (C2V) and `C = (A + B) @ D` (V2C),
implemented in three pipelined API styles and one non-pipelined naive baseline.

## Variants

| Subdirectory | Sync API | Pipeline | Note |
|---|---|---|---|
| `raw_flag/` | `ffts_cross_core_sync` + `wait_flag_dev` (direct) | round-level overlap | Reference, full multi-round correctness |
| `pushpop/` | `TPipe` TileData — sync + data-move in one call | round-level overlap | `num_rounds=1` scope; multi-round has shared tileIndex issue |
| `gm_pipe/` | `TPipe` GlobalData — `TPUSH`/`TPOP` signal only | round-level overlap | Full multi-round; requires pto-isa-master headers |
| `naive_separate/` | `ffts_cross_core_sync` (one signal per stage) | **none** — stages are sequential | Slower baseline; shows pipeline benefit |

## Kernel Algorithms

| Kernel | Operation | C2V or V2C |
|--------|-----------|------------|
| `matmul_add_c2v` | `C = A @ B + D` | Cube GEMM → workspace → Vec add |
| `add_matmul_v2c` | `C = (A + B) @ D` | Vec add → workspace → Cube GEMM |

## Files

| Subdirectory | Kernels | Python | Note |
|---|---|---|---|
| `raw_flag/` | `matmul_add_c2v.cpp`, `add_matmul_v2c.cpp` | `jit_util_*.py`, `run_*.py` | Reference |
| `pushpop/` | same | `jit_util.py`, `run.py` | Single-round scope |
| `gm_pipe/` | same | `jit_util.py`, `run.py` | pto-isa-master headers |
| `naive_separate/` | `naive_separate.cpp` (both kernels) | `jit_util.py`, `run.py` | No pipeline |

## Reproduce

```bash
BASE=examples/jit_cpp/cross_core_sync_demo/matmul_add

# raw_flag: correctness (30/30 seeds × rounds) + bandwidth
python $BASE/raw_flag/run_matmul_add_c2v.py
python $BASE/raw_flag/run_add_matmul_v2c.py

# pushpop: correctness (num_rounds=1 scope) + bandwidth at batch=3072
python $BASE/pushpop/run.py

# gm_pipe: correctness + bandwidth (both kernels in one script)
python $BASE/gm_pipe/run.py

# naive_separate: correctness + bandwidth vs torch baseline
python $BASE/naive_separate/run.py
```

Each script prints correctness results followed by a bandwidth table.
Set `NPU_DEVICE=npu:N` to select a specific NPU.

## API Syntax Comparison (C2V direction: `C = A @ B + D`)

```
Sync API │ Data API
──────────────────────────────────────────────────────────────────────────
raw_flag Cube: ffts_cross_core_sync(FIX, FLAG_C2V) │ TSTORE(ws_half, c_l0)
Vec: wait_flag_dev(FLAG_C2V) │ TLOAD(c_ub, ws)
ffts_cross_core_sync(MTE3, FLAG_V2C) │

pushpop Cube: TPUSH<C2VPipe, TileL0C, UP_DOWN>(pipe, c_l0) ← sync + data in one call
Vec: TPOP<C2VPipe, VecTile<float>, UP_DOWN>(pipe, c_ub_float)

gm_pipe Cube: TALLOC<C2VPipe, SlotHalf, UP_DOWN>(pipe, slot) ← TPipe allocates slot
TSTORE(slot, c_l0) ← explicit fp32→fp16 (hardware FIX)
TPUSH<C2VPipe, SlotHalf, UP_DOWN>(pipe, slot) ← TPipe signals consumer
Vec: TPOP<C2VPipe, PopHalf, UP_DOWN>(pipe, pop) ← TPipe waits + slot ptr
TLOAD(c_ub, pop) ← explicit load
TFREE<C2VPipe, PopHalf, UP_DOWN>(pipe, pop) ← TPipe notifies free

naive Cube: (all GEMMs) → pipe_barrier → ffts_cross_core_sync(FIX, FLAG_C2V)
Vec: wait_flag_dev(FLAG_C2V) → (all adds)
↑ one signal after ALL rounds, no round overlap
```

## Measured Bandwidth (910B2, TILE_SIZE=128, 24 Cube cores)

Peak effective external bandwidth (read A+B+D, write C; workspace not counted):

| Variant | matmul_add_c2v peak | add_matmul_v2c peak | Notes |
|---------|--------------------|--------------------|-------|
| `raw_flag` | **1357 GB/s** | **1543 GB/s** | Reference pipelined, 64 rounds |
| `pushpop` | **1954 GB/s** (32 rounds, f32 slot) | 45 GB/s (batch=3072) | C2V: FIFO_DEPTH=1 workaround enables multi-round (f32 slot is 2× larger than f16 → 2× bw); V2C: 2-sub-block producer deadlocks with FIFO_DEPTH=1, remains rounds=1 only |
| `gm_pipe` | **1837 GB/s** | **1496 GB/s** | 64 rounds; requires pto-isa-master headers |
| `naive_separate` | 1174 GB/s | 1211 GB/s | No pipeline — **15–30% lower** |
| `torch.mm + torch.add` | ~2000 GB/s\* | ~2100 GB/s\* | Two separate launches |

\* torch bandwidth appears high because torch's GEMM is a highly tuned library kernel
that may cache intermediate results on-chip; the naive kernel instead round-trips
through full-batch HBM workspace, making it slower than torch for large batches.
The pipelined variants are faster than naive because they overlap Cube and Vec
round-by-round, reducing the effective latency of cross-core data movement.

## Known Limitations

- **pushpop multi-round (C2V)**: Applying the `FIFO_DEPTH=1` workaround (forces `SyncPeriod=1`, strict alternation) makes `matmul_add_c2v` work for arbitrary `num_rounds`. Note the C2V slot is `float32` (64 KB), so bandwidth figures are 2× those of the half-slot variants.

- **pushpop multi-round (V2C)**: `add_matmul_v2c` cannot be fixed with `FIFO_DEPTH=1`: with only 1 free signal seeded in the constructor, sub-block 0 consumes it and sub-block 1 deadlocks at `allocate()`. V2C therefore remains scoped to `num_rounds=1`. Use `gm_pipe` for multi-round V2C.

- **gm_pipe header requirement**: `TALLOC`/`TPOP(GlobalData)`/`TFREE` are in `pto-isa-master` headers, not the default `/sources/pto-isa`. The `gm_pipe/jit_util.py` uses `-I/workdir/pto-isa-master/include`.

- **naive_separate workspace**: Uses `workspace[batch, TILE_SIZE]` (full-batch allocation) vs `workspace[num_cores * TILE_SIZE, TILE_SIZE]` for pipelined variants. The larger workspace means more HBM traffic per kernel call at large batch sizes.
Loading
Loading