Summary
Commit 687af1a "Optimize reverse dependencies with sync periods" introduces a signal-count imbalance in TPipe<> on both a2a3 (include/pto/npu/a2a3/TPush.hpp) and a5 (include/pto/npu/a5/TPush.hpp). For any pipeline with SlotNum > SyncPeriod (e.g. the common SlotNum=8, SyncPeriod=4 case), each kernel invocation leaks SlotNum / SyncPeriod FFTS free signals into the cross-core flag register. On NPU hardware (A2/A3) this manifests as CANN runtime error 507018 (ACL_ERROR_RT_AICPU_EXCEPTION) after one or more invocations.
The bug is still present on main at d779cd01 — no commits between 687af1a6..d779cd01 modify the relevant headers.
Reproduction
Downstream PyPTO test (NPU A2/A3, real hardware):
pytest tests/st/runtime/test_qwen3_decode_scope3_mixed.py \
--platform a2a3 --device <N> --save-kernels \
--pto-isa-commit=d779cd0
Result on d779cd01:
RuntimeError: run_prepared failed with code 507018
[ERROR] aclrtSynchronizeStreamWithTimeout (AICPU) failed: 507018
Reverting to the prior commit fcc6f420 makes the test pass. Re-applying the patch in §Proposed Fix on top of d779cd01 also makes the test pass.
The failing kernels (scope3_incore_{1,4,5}) use:
TPipe<0, Direction::DIR_C2V, /*SlotSize=*/4096, /*SlotNum=*/8,
/*LocalSlotNum=*/8, /*IsNoSplit=*/false>
// => SyncPeriod = SlotNum / 2 = 4
Root Cause (numerical)
Consider a producer/consumer loop of T tiles with SlotNum = 8, SyncPeriod = 4:
| Stage |
Code site |
Signals (free pulses to FlagID+1) |
| Constructor (consumer) |
TPipe::TPipe runs cons.free() × SyncPeriod |
+4 |
| In-loop consumer |
TPOP_IMPL calls cons.free() iff shouldNotifyFree(i) = ((i+1) % SP) == 0 |
+T / 4 |
| In-loop producer |
TPUSH_IMPL calls prod.allocate() iff shouldWaitFree(i) = (i ≥ SlotNum) && (i % SP) == 0 |
−(T − 8) / 4 |
| Destructor (producer) |
TPipe::~TPipe runs prod.allocate() × SyncPeriod |
−4 |
Net residual on FlagID+1 per kernel invocation:
(SyncPeriod + T/SP) - ((T - SlotNum)/SP + SyncPeriod)
= SlotNum / SyncPeriod
= 8 / 4
= 2 (leaked free signals)
The asymmetry comes from shouldWaitFree skipping the first SlotNum tiles (the "startup protection") while shouldNotifyFree does not skip the matching tail; the destructor doesn't compensate either.
Relevant code (a2a3, include/pto/npu/a2a3/TPush.hpp):
// L53-72
PTO_INTERNAL static bool shouldWaitFree(uint32_t tileIndex) {
if constexpr (SlotNum == 1) return true;
else {
if (tileIndex < SlotNum) return false; // <-- skips first SlotNum
return (tileIndex % SyncPeriod) == 0;
}
}
PTO_INTERNAL static bool shouldNotifyFree(uint32_t tileIndex) {
if constexpr (SlotNum == 1) return true;
else return ((tileIndex + 1) % SyncPeriod) == 0; // <-- no matching skip
}
// L444-450
PTO_INTERNAL ~TPipe() {
for (uint32_t i = 0; i < SyncPeriod; ++i) {
prod.allocate(); // <-- drains SyncPeriod, but needs SyncPeriod + SlotNum/SyncPeriod
}
}
The a5 backend has the identical pattern at include/pto/npu/a5/TPush.hpp:611-621.
Proposed Fix
Increase the destructor drain count to balance the constructor + in-loop pulse counts:
PTO_INTERNAL ~TPipe()
{
constexpr uint32_t kSkippedBatches = (SlotNum > 1) ? (SlotNum / SyncPeriod) : 0;
constexpr uint32_t kDestructorWaits = SyncPeriod + kSkippedBatches;
for (uint32_t i = 0; i < kDestructorWaits; ++i) {
prod.allocate();
}
}
Verified on real NPU hardware: with this patch applied on top of d779cd01, the failing PyPTO test passes (1 passed in 9.77s).
The same change should be mirrored in include/pto/npu/a5/TPush.hpp:~TPipe() (same algebra, same template pattern using the split-axis allocate<> overload).
Why the CV regression test didn't catch this
The new test in tests/npu/a2a3/src/st/testcase/tpushpop_cv/tpushpop_cv_kernel.cpp exercises only FIFO_DEPTH = 1 (i.e. SlotNum == 1), which short-circuits both shouldWaitFree and shouldNotifyFree to true and trivially preserves balance. The leak only appears for SlotNum ≥ 4.
Adding a SlotNum=8, SyncPeriod=4 variant to that test (and asserting absence of residual FFTS counts after a few invocations) would catch this class of bug.
Environment
- CANN: 9.0.0
- Platform: A2/A3 NPU, single device
- pto-isa:
d779cd01 (also reproduces on 687af1a6)
- Downstream: PyPTO
tests/st/runtime/test_qwen3_decode_scope3_mixed.py
Summary
Commit 687af1a "Optimize reverse dependencies with sync periods" introduces a signal-count imbalance in
TPipe<>on both a2a3 (include/pto/npu/a2a3/TPush.hpp) and a5 (include/pto/npu/a5/TPush.hpp). For any pipeline withSlotNum > SyncPeriod(e.g. the commonSlotNum=8, SyncPeriod=4case), each kernel invocation leaksSlotNum / SyncPeriodFFTSfreesignals into the cross-core flag register. On NPU hardware (A2/A3) this manifests as CANN runtime error 507018 (ACL_ERROR_RT_AICPU_EXCEPTION) after one or more invocations.The bug is still present on
mainatd779cd01— no commits between687af1a6..d779cd01modify the relevant headers.Reproduction
Downstream PyPTO test (NPU A2/A3, real hardware):
pytest tests/st/runtime/test_qwen3_decode_scope3_mixed.py \ --platform a2a3 --device <N> --save-kernels \ --pto-isa-commit=d779cd0Result on
d779cd01:Reverting to the prior commit
fcc6f420makes the test pass. Re-applying the patch in §Proposed Fix on top ofd779cd01also makes the test pass.The failing kernels (
scope3_incore_{1,4,5}) use:Root Cause (numerical)
Consider a producer/consumer loop of
Ttiles withSlotNum = 8, SyncPeriod = 4:FlagID+1)TPipe::TPiperunscons.free()×SyncPeriodTPOP_IMPLcallscons.free()iffshouldNotifyFree(i)=((i+1) % SP) == 0TPUSH_IMPLcallsprod.allocate()iffshouldWaitFree(i)=(i ≥ SlotNum) && (i % SP) == 0TPipe::~TPiperunsprod.allocate()×SyncPeriodNet residual on
FlagID+1per kernel invocation:The asymmetry comes from
shouldWaitFreeskipping the firstSlotNumtiles (the "startup protection") whileshouldNotifyFreedoes not skip the matching tail; the destructor doesn't compensate either.Relevant code (a2a3,
include/pto/npu/a2a3/TPush.hpp):The a5 backend has the identical pattern at
include/pto/npu/a5/TPush.hpp:611-621.Proposed Fix
Increase the destructor drain count to balance the constructor + in-loop pulse counts:
Verified on real NPU hardware: with this patch applied on top of
d779cd01, the failing PyPTO test passes (1 passed in 9.77s).The same change should be mirrored in
include/pto/npu/a5/TPush.hpp:~TPipe()(same algebra, same template pattern using the split-axisallocate<>overload).Why the CV regression test didn't catch this
The new test in
tests/npu/a2a3/src/st/testcase/tpushpop_cv/tpushpop_cv_kernel.cppexercises onlyFIFO_DEPTH = 1(i.e.SlotNum == 1), which short-circuits bothshouldWaitFreeandshouldNotifyFreetotrueand trivially preserves balance. The leak only appears forSlotNum ≥ 4.Adding a
SlotNum=8, SyncPeriod=4variant to that test (and asserting absence of residual FFTS counts after a few invocations) would catch this class of bug.Environment
d779cd01(also reproduces on687af1a6)tests/st/runtime/test_qwen3_decode_scope3_mixed.py