[Bug] TROWSUM hangs on a2a3 hardware for INT32 tiles (FP32 path works with same shape/layout)

### Platform

a2a3 (Ascend 910 hardware)

### Runtime Variant

tensormap_and_ringbuffer

### Description

`TROWSUM` declared with `bfloat16_t`/`float` tile element types runs correctly on a2a3 hardware, but the **identical kernel structure** with `int32_t` tiles hangs indefinitely. The hang persists with the latest pto-isa main (`687af1a6`).

The pto-isa header `include/pto/npu/a2a3/TRowSum.hpp` (lines 94–133) declares INT32 support — the dtype dispatch falls into a dedicated `vadd`-based integer path. So this looks like the integer code path is reaching the device but not draining a pipeline event correctly, rather than being a missing template instantiation.

The same hang reproduces with both INT16 and INT32; we have not had a chance to test the simpler `int8_t` / unsigned variants.

### Steps to Reproduce

A direct, side-by-side comparison: same compaction trick (TLOAD a wide row × pad grid → TROWSUM along the pad axis → TSTORE column), only changing the element type.

```cpp
// FP32 — works:
//   TLOAD wide_tile (R x W_PAD, RowMajor, float) from [L, R, W_PAD] FP32 GM
//   TROWSUM(sum_tile, wide_tile, tmp_tile)   // sum_tile is R x 1 ColMajor float
//   TSTORE sum_tile to [L, R] FP32 GM (Layout::DN)
//
// INT32 — hangs:  swap `float` → `int32_t` everywhere, all other code identical.

using WWideShape  = pto::Shape<1, 1, 1, R, W_PAD>;
using WWideStride = pto::Stride<R * W_PAD, R * W_PAD, R * W_PAD, W_PAD, 1>;
using WWideG      = pto::GlobalTensor<float /* or int32_t */, WWideShape, WWideStride>;
using WWideTile   = pto::Tile<pto::TileType::Vec, float /* or int32_t */, R, W_PAD,
                              pto::BLayout::RowMajor, R, W_PAD>;
using WSumShape   = pto::Shape<1, 1, 1, R, 1>;
using WSumStride  = pto::Stride<1, 1, 1, 1, 1>;
using WSumG       = pto::GlobalTensor<float /* or int32_t */, WSumShape, WSumStride,
                                       pto::Layout::DN>;
using WSumTile    = pto::Tile<pto::TileType::Vec, float /* or int32_t */, R, 1,
                              pto::BLayout::ColMajor, R, 1>;

WWideTile wide_tile;
WSumTile  sum_tile;
WWideTile tmp_tile;     // same shape as src, per pto-isa convention
TASSIGN(wide_tile, 0x10000);
TASSIGN(sum_tile,  0x20000);
TASSIGN(tmp_tile,  0x21000);

// The pipeline scaffolding is identical for both dtypes:
TLOAD(wide_tile, win_g);
set_flag(PIPE_MTE2, PIPE_V, EVENT_ID1);
wait_flag(PIPE_MTE2, PIPE_V, EVENT_ID1);
pipe_barrier(PIPE_V);
TROWSUM(sum_tile, wide_tile, tmp_tile);     // ← INT32 hangs here
pipe_barrier(PIPE_V);
set_flag(PIPE_V, PIPE_MTE3, EVENT_ID1);
wait_flag(PIPE_V, PIPE_MTE3, EVENT_ID1);
TSTORE(out_g, sum_tile);
```

Constants in our reproducer: `R = 32`, `W_PAD = IDX_PAD = 8`. Total wide tile size = `32 × 8 × sizeof(T) = 1 KB` for both FP32 and INT32.

The full kernel that exhibits the hang is checked in at:

`examples/workers/l3/ep_dispatch_distributed/kernels/aiv/ep_dispatch_kernel.cpp` in the simpler repo (function `kernel_entry`, the Phase 4 stage-out section). Today it uses FP32 TROWSUM for the `recv_w` channel and a scalar GM copy fallback for the `recv_idx` channel; switching the idx fallback to INT32 TROWSUM with otherwise identical code is what triggers the hang.

### Expected Behavior

INT32 TROWSUM compacts the wide row tile and `TSTORE`s the per-row sum to the destination GlobalTensor, identical in shape semantics to the FP32 path. Total runtime should be on the order of microseconds for a 32×8 reduction.

### Actual Behavior

Kernel hangs indefinitely. Task scheduler logs show bootstrap completing on both ranks, the dispatch task finishing, then the kernel sitting in the stage-out task until the watchdog kills it:

```
Resource phase: 1 case(s), pool=[12, 14], max_parallel=2
[scheduler] START standalone test_ep_dispatch_distributed (rt=tensormap_and_ringbuffer, dev=2) pid=... devices=[12, 14]
[taskqueue] task timed out (60s), automatically killed
```

Replacing only the INT32 TROWSUM with a scalar copy loop (or replacing the dtype with `float`) is sufficient to make the kernel return immediately and the test pass.

### Git Commit ID

`687af1a6bdd9ddd6a47a56cea773896d9d494e0f` (latest main as of report time)

### CANN Version

CANN 8.5.0

### Driver Version

25.3.rc1 (ascendhal 7.35.23)

### Host Platform

Linux aarch64 (5.10.0 kernel)

### Additional Context

- Same shape (`R × W_PAD = 32 × 8`), same layout (RowMajor src, ColMajor `R × 1` dst with `Layout::DN` GlobalTensor), same UB tile addresses, same pipe barriers — only the element type differs between the working FP32 path and the hanging INT32 path.
- TRowSum.hpp lines 94–133 contain a dedicated INT32 implementation that uses `vector_dup` + `vadd` + `pipe_barrier(PIPE_V)` + `pipe_barrier(PIPE_ALL)` followed by a final scalar reduction across 8 lanes. The internal `pipe_barrier(PIPE_ALL)` mid-implementation is unusual relative to the FP32 vcadd path and may be implicated.
- Workaround in our codebase: scalar GM copy of column 0 (`out[i] = wide[i * PAD]`). For our usage volume (≤ a few hundred INT32 stores in the final stage-out) the perf cost is negligible, but a working tile-level INT32 TROWSUM would be valuable for higher-volume reductions in production EP combine paths.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] TROWSUM hangs on a2a3 hardware for INT32 tiles (FP32 path works with same shape/layout) #119

Platform

Runtime Variant

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Git Commit ID

CANN Version

Driver Version

Host Platform

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] TROWSUM hangs on a2a3 hardware for INT32 tiles (FP32 path works with same shape/layout) #119

Description

Platform

Runtime Variant

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Git Commit ID

CANN Version

Driver Version

Host Platform

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions