Skip to content

[Bug] TROWSUM hangs on a2a3 hardware for INT32 tiles (FP32 path works with same shape/layout) #119

@zhangqi-chen

Description

@zhangqi-chen

Platform

a2a3 (Ascend 910 hardware)

Runtime Variant

tensormap_and_ringbuffer

Description

TROWSUM declared with bfloat16_t/float tile element types runs correctly on a2a3 hardware, but the identical kernel structure with int32_t tiles hangs indefinitely. The hang persists with the latest pto-isa main (687af1a6).

The pto-isa header include/pto/npu/a2a3/TRowSum.hpp (lines 94–133) declares INT32 support — the dtype dispatch falls into a dedicated vadd-based integer path. So this looks like the integer code path is reaching the device but not draining a pipeline event correctly, rather than being a missing template instantiation.

The same hang reproduces with both INT16 and INT32; we have not had a chance to test the simpler int8_t / unsigned variants.

Steps to Reproduce

A direct, side-by-side comparison: same compaction trick (TLOAD a wide row × pad grid → TROWSUM along the pad axis → TSTORE column), only changing the element type.

// FP32 — works:
//   TLOAD wide_tile (R x W_PAD, RowMajor, float) from [L, R, W_PAD] FP32 GM
//   TROWSUM(sum_tile, wide_tile, tmp_tile)   // sum_tile is R x 1 ColMajor float
//   TSTORE sum_tile to [L, R] FP32 GM (Layout::DN)
//
// INT32 — hangs:  swap `float` → `int32_t` everywhere, all other code identical.

using WWideShape  = pto::Shape<1, 1, 1, R, W_PAD>;
using WWideStride = pto::Stride<R * W_PAD, R * W_PAD, R * W_PAD, W_PAD, 1>;
using WWideG      = pto::GlobalTensor<float /* or int32_t */, WWideShape, WWideStride>;
using WWideTile   = pto::Tile<pto::TileType::Vec, float /* or int32_t */, R, W_PAD,
                              pto::BLayout::RowMajor, R, W_PAD>;
using WSumShape   = pto::Shape<1, 1, 1, R, 1>;
using WSumStride  = pto::Stride<1, 1, 1, 1, 1>;
using WSumG       = pto::GlobalTensor<float /* or int32_t */, WSumShape, WSumStride,
                                       pto::Layout::DN>;
using WSumTile    = pto::Tile<pto::TileType::Vec, float /* or int32_t */, R, 1,
                              pto::BLayout::ColMajor, R, 1>;

WWideTile wide_tile;
WSumTile  sum_tile;
WWideTile tmp_tile;     // same shape as src, per pto-isa convention
TASSIGN(wide_tile, 0x10000);
TASSIGN(sum_tile,  0x20000);
TASSIGN(tmp_tile,  0x21000);

// The pipeline scaffolding is identical for both dtypes:
TLOAD(wide_tile, win_g);
set_flag(PIPE_MTE2, PIPE_V, EVENT_ID1);
wait_flag(PIPE_MTE2, PIPE_V, EVENT_ID1);
pipe_barrier(PIPE_V);
TROWSUM(sum_tile, wide_tile, tmp_tile);     // ← INT32 hangs here
pipe_barrier(PIPE_V);
set_flag(PIPE_V, PIPE_MTE3, EVENT_ID1);
wait_flag(PIPE_V, PIPE_MTE3, EVENT_ID1);
TSTORE(out_g, sum_tile);

Constants in our reproducer: R = 32, W_PAD = IDX_PAD = 8. Total wide tile size = 32 × 8 × sizeof(T) = 1 KB for both FP32 and INT32.

The full kernel that exhibits the hang is checked in at:

examples/workers/l3/ep_dispatch_distributed/kernels/aiv/ep_dispatch_kernel.cpp in the simpler repo (function kernel_entry, the Phase 4 stage-out section). Today it uses FP32 TROWSUM for the recv_w channel and a scalar GM copy fallback for the recv_idx channel; switching the idx fallback to INT32 TROWSUM with otherwise identical code is what triggers the hang.

Expected Behavior

INT32 TROWSUM compacts the wide row tile and TSTOREs the per-row sum to the destination GlobalTensor, identical in shape semantics to the FP32 path. Total runtime should be on the order of microseconds for a 32×8 reduction.

Actual Behavior

Kernel hangs indefinitely. Task scheduler logs show bootstrap completing on both ranks, the dispatch task finishing, then the kernel sitting in the stage-out task until the watchdog kills it:

Resource phase: 1 case(s), pool=[12, 14], max_parallel=2
[scheduler] START standalone test_ep_dispatch_distributed (rt=tensormap_and_ringbuffer, dev=2) pid=... devices=[12, 14]
[taskqueue] task timed out (60s), automatically killed

Replacing only the INT32 TROWSUM with a scalar copy loop (or replacing the dtype with float) is sufficient to make the kernel return immediately and the test pass.

Git Commit ID

687af1a6bdd9ddd6a47a56cea773896d9d494e0f (latest main as of report time)

CANN Version

CANN 8.5.0

Driver Version

25.3.rc1 (ascendhal 7.35.23)

Host Platform

Linux aarch64 (5.10.0 kernel)

Additional Context

  • Same shape (R × W_PAD = 32 × 8), same layout (RowMajor src, ColMajor R × 1 dst with Layout::DN GlobalTensor), same UB tile addresses, same pipe barriers — only the element type differs between the working FP32 path and the hanging INT32 path.
  • TRowSum.hpp lines 94–133 contain a dedicated INT32 implementation that uses vector_dup + vadd + pipe_barrier(PIPE_V) + pipe_barrier(PIPE_ALL) followed by a final scalar reduction across 8 lanes. The internal pipe_barrier(PIPE_ALL) mid-implementation is unusual relative to the FP32 vcadd path and may be implicated.
  • Workaround in our codebase: scalar GM copy of column 0 (out[i] = wide[i * PAD]). For our usage volume (≤ a few hundred INT32 stores in the final stage-out) the perf cost is negligible, but a working tile-level INT32 TROWSUM would be valuable for higher-volume reductions in production EP combine paths.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions