Platform
a2a3 (Ascend 910 hardware)
Runtime Variant
tensormap_and_ringbuffer
Description
TROWSUM declared with bfloat16_t/float tile element types runs correctly on a2a3 hardware, but the identical kernel structure with int32_t tiles hangs indefinitely. The hang persists with the latest pto-isa main (687af1a6).
The pto-isa header include/pto/npu/a2a3/TRowSum.hpp (lines 94–133) declares INT32 support — the dtype dispatch falls into a dedicated vadd-based integer path. So this looks like the integer code path is reaching the device but not draining a pipeline event correctly, rather than being a missing template instantiation.
The same hang reproduces with both INT16 and INT32; we have not had a chance to test the simpler int8_t / unsigned variants.
Steps to Reproduce
A direct, side-by-side comparison: same compaction trick (TLOAD a wide row × pad grid → TROWSUM along the pad axis → TSTORE column), only changing the element type.
// FP32 — works:
// TLOAD wide_tile (R x W_PAD, RowMajor, float) from [L, R, W_PAD] FP32 GM
// TROWSUM(sum_tile, wide_tile, tmp_tile) // sum_tile is R x 1 ColMajor float
// TSTORE sum_tile to [L, R] FP32 GM (Layout::DN)
//
// INT32 — hangs: swap `float` → `int32_t` everywhere, all other code identical.
using WWideShape = pto::Shape<1, 1, 1, R, W_PAD>;
using WWideStride = pto::Stride<R * W_PAD, R * W_PAD, R * W_PAD, W_PAD, 1>;
using WWideG = pto::GlobalTensor<float /* or int32_t */, WWideShape, WWideStride>;
using WWideTile = pto::Tile<pto::TileType::Vec, float /* or int32_t */, R, W_PAD,
pto::BLayout::RowMajor, R, W_PAD>;
using WSumShape = pto::Shape<1, 1, 1, R, 1>;
using WSumStride = pto::Stride<1, 1, 1, 1, 1>;
using WSumG = pto::GlobalTensor<float /* or int32_t */, WSumShape, WSumStride,
pto::Layout::DN>;
using WSumTile = pto::Tile<pto::TileType::Vec, float /* or int32_t */, R, 1,
pto::BLayout::ColMajor, R, 1>;
WWideTile wide_tile;
WSumTile sum_tile;
WWideTile tmp_tile; // same shape as src, per pto-isa convention
TASSIGN(wide_tile, 0x10000);
TASSIGN(sum_tile, 0x20000);
TASSIGN(tmp_tile, 0x21000);
// The pipeline scaffolding is identical for both dtypes:
TLOAD(wide_tile, win_g);
set_flag(PIPE_MTE2, PIPE_V, EVENT_ID1);
wait_flag(PIPE_MTE2, PIPE_V, EVENT_ID1);
pipe_barrier(PIPE_V);
TROWSUM(sum_tile, wide_tile, tmp_tile); // ← INT32 hangs here
pipe_barrier(PIPE_V);
set_flag(PIPE_V, PIPE_MTE3, EVENT_ID1);
wait_flag(PIPE_V, PIPE_MTE3, EVENT_ID1);
TSTORE(out_g, sum_tile);
Constants in our reproducer: R = 32, W_PAD = IDX_PAD = 8. Total wide tile size = 32 × 8 × sizeof(T) = 1 KB for both FP32 and INT32.
The full kernel that exhibits the hang is checked in at:
examples/workers/l3/ep_dispatch_distributed/kernels/aiv/ep_dispatch_kernel.cpp in the simpler repo (function kernel_entry, the Phase 4 stage-out section). Today it uses FP32 TROWSUM for the recv_w channel and a scalar GM copy fallback for the recv_idx channel; switching the idx fallback to INT32 TROWSUM with otherwise identical code is what triggers the hang.
Expected Behavior
INT32 TROWSUM compacts the wide row tile and TSTOREs the per-row sum to the destination GlobalTensor, identical in shape semantics to the FP32 path. Total runtime should be on the order of microseconds for a 32×8 reduction.
Actual Behavior
Kernel hangs indefinitely. Task scheduler logs show bootstrap completing on both ranks, the dispatch task finishing, then the kernel sitting in the stage-out task until the watchdog kills it:
Resource phase: 1 case(s), pool=[12, 14], max_parallel=2
[scheduler] START standalone test_ep_dispatch_distributed (rt=tensormap_and_ringbuffer, dev=2) pid=... devices=[12, 14]
[taskqueue] task timed out (60s), automatically killed
Replacing only the INT32 TROWSUM with a scalar copy loop (or replacing the dtype with float) is sufficient to make the kernel return immediately and the test pass.
Git Commit ID
687af1a6bdd9ddd6a47a56cea773896d9d494e0f (latest main as of report time)
CANN Version
CANN 8.5.0
Driver Version
25.3.rc1 (ascendhal 7.35.23)
Host Platform
Linux aarch64 (5.10.0 kernel)
Additional Context
- Same shape (
R × W_PAD = 32 × 8), same layout (RowMajor src, ColMajor R × 1 dst with Layout::DN GlobalTensor), same UB tile addresses, same pipe barriers — only the element type differs between the working FP32 path and the hanging INT32 path.
- TRowSum.hpp lines 94–133 contain a dedicated INT32 implementation that uses
vector_dup + vadd + pipe_barrier(PIPE_V) + pipe_barrier(PIPE_ALL) followed by a final scalar reduction across 8 lanes. The internal pipe_barrier(PIPE_ALL) mid-implementation is unusual relative to the FP32 vcadd path and may be implicated.
- Workaround in our codebase: scalar GM copy of column 0 (
out[i] = wide[i * PAD]). For our usage volume (≤ a few hundred INT32 stores in the final stage-out) the perf cost is negligible, but a working tile-level INT32 TROWSUM would be valuable for higher-volume reductions in production EP combine paths.
Platform
a2a3 (Ascend 910 hardware)
Runtime Variant
tensormap_and_ringbuffer
Description
TROWSUMdeclared withbfloat16_t/floattile element types runs correctly on a2a3 hardware, but the identical kernel structure withint32_ttiles hangs indefinitely. The hang persists with the latest pto-isa main (687af1a6).The pto-isa header
include/pto/npu/a2a3/TRowSum.hpp(lines 94–133) declares INT32 support — the dtype dispatch falls into a dedicatedvadd-based integer path. So this looks like the integer code path is reaching the device but not draining a pipeline event correctly, rather than being a missing template instantiation.The same hang reproduces with both INT16 and INT32; we have not had a chance to test the simpler
int8_t/ unsigned variants.Steps to Reproduce
A direct, side-by-side comparison: same compaction trick (TLOAD a wide row × pad grid → TROWSUM along the pad axis → TSTORE column), only changing the element type.
Constants in our reproducer:
R = 32,W_PAD = IDX_PAD = 8. Total wide tile size =32 × 8 × sizeof(T) = 1 KBfor both FP32 and INT32.The full kernel that exhibits the hang is checked in at:
examples/workers/l3/ep_dispatch_distributed/kernels/aiv/ep_dispatch_kernel.cppin the simpler repo (functionkernel_entry, the Phase 4 stage-out section). Today it uses FP32 TROWSUM for therecv_wchannel and a scalar GM copy fallback for therecv_idxchannel; switching the idx fallback to INT32 TROWSUM with otherwise identical code is what triggers the hang.Expected Behavior
INT32 TROWSUM compacts the wide row tile and
TSTOREs the per-row sum to the destination GlobalTensor, identical in shape semantics to the FP32 path. Total runtime should be on the order of microseconds for a 32×8 reduction.Actual Behavior
Kernel hangs indefinitely. Task scheduler logs show bootstrap completing on both ranks, the dispatch task finishing, then the kernel sitting in the stage-out task until the watchdog kills it:
Replacing only the INT32 TROWSUM with a scalar copy loop (or replacing the dtype with
float) is sufficient to make the kernel return immediately and the test pass.Git Commit ID
687af1a6bdd9ddd6a47a56cea773896d9d494e0f(latest main as of report time)CANN Version
CANN 8.5.0
Driver Version
25.3.rc1 (ascendhal 7.35.23)
Host Platform
Linux aarch64 (5.10.0 kernel)
Additional Context
R × W_PAD = 32 × 8), same layout (RowMajor src, ColMajorR × 1dst withLayout::DNGlobalTensor), same UB tile addresses, same pipe barriers — only the element type differs between the working FP32 path and the hanging INT32 path.vector_dup+vadd+pipe_barrier(PIPE_V)+pipe_barrier(PIPE_ALL)followed by a final scalar reduction across 8 lanes. The internalpipe_barrier(PIPE_ALL)mid-implementation is unusual relative to the FP32 vcadd path and may be implicated.out[i] = wide[i * PAD]). For our usage volume (≤ a few hundred INT32 stores in the final stage-out) the perf cost is negligible, but a working tile-level INT32 TROWSUM would be valuable for higher-volume reductions in production EP combine paths.