Skip to content

TPipe destructor leaks FFTS free signals on a2a3/a5 (regression in 687af1a6) #127

@YunjiQin

Description

@YunjiQin

Summary

Commit 687af1a "Optimize reverse dependencies with sync periods" introduces a signal-count imbalance in TPipe<> on both a2a3 (include/pto/npu/a2a3/TPush.hpp) and a5 (include/pto/npu/a5/TPush.hpp). For any pipeline with SlotNum > SyncPeriod (e.g. the common SlotNum=8, SyncPeriod=4 case), each kernel invocation leaks SlotNum / SyncPeriod FFTS free signals into the cross-core flag register. On NPU hardware (A2/A3) this manifests as CANN runtime error 507018 (ACL_ERROR_RT_AICPU_EXCEPTION) after one or more invocations.

The bug is still present on main at d779cd01 — no commits between 687af1a6..d779cd01 modify the relevant headers.

Reproduction

Downstream PyPTO test (NPU A2/A3, real hardware):

pytest tests/st/runtime/test_qwen3_decode_scope3_mixed.py \
       --platform a2a3 --device <N> --save-kernels \
       --pto-isa-commit=d779cd0

Result on d779cd01:

RuntimeError: run_prepared failed with code 507018
[ERROR] aclrtSynchronizeStreamWithTimeout (AICPU) failed: 507018

Reverting to the prior commit fcc6f420 makes the test pass. Re-applying the patch in §Proposed Fix on top of d779cd01 also makes the test pass.

The failing kernels (scope3_incore_{1,4,5}) use:

TPipe<0, Direction::DIR_C2V, /*SlotSize=*/4096, /*SlotNum=*/8,
      /*LocalSlotNum=*/8, /*IsNoSplit=*/false>
// => SyncPeriod = SlotNum / 2 = 4

Root Cause (numerical)

Consider a producer/consumer loop of T tiles with SlotNum = 8, SyncPeriod = 4:

Stage Code site Signals (free pulses to FlagID+1)
Constructor (consumer) TPipe::TPipe runs cons.free() × SyncPeriod +4
In-loop consumer TPOP_IMPL calls cons.free() iff shouldNotifyFree(i) = ((i+1) % SP) == 0 +T / 4
In-loop producer TPUSH_IMPL calls prod.allocate() iff shouldWaitFree(i) = (i ≥ SlotNum) && (i % SP) == 0 −(T − 8) / 4
Destructor (producer) TPipe::~TPipe runs prod.allocate() × SyncPeriod −4

Net residual on FlagID+1 per kernel invocation:

(SyncPeriod + T/SP) - ((T - SlotNum)/SP + SyncPeriod)
= SlotNum / SyncPeriod
= 8 / 4
= 2  (leaked free signals)

The asymmetry comes from shouldWaitFree skipping the first SlotNum tiles (the "startup protection") while shouldNotifyFree does not skip the matching tail; the destructor doesn't compensate either.

Relevant code (a2a3, include/pto/npu/a2a3/TPush.hpp):

// L53-72
PTO_INTERNAL static bool shouldWaitFree(uint32_t tileIndex) {
    if constexpr (SlotNum == 1) return true;
    else {
        if (tileIndex < SlotNum) return false;   // <-- skips first SlotNum
        return (tileIndex % SyncPeriod) == 0;
    }
}
PTO_INTERNAL static bool shouldNotifyFree(uint32_t tileIndex) {
    if constexpr (SlotNum == 1) return true;
    else return ((tileIndex + 1) % SyncPeriod) == 0;   // <-- no matching skip
}

// L444-450
PTO_INTERNAL ~TPipe() {
    for (uint32_t i = 0; i < SyncPeriod; ++i) {
        prod.allocate();    // <-- drains SyncPeriod, but needs SyncPeriod + SlotNum/SyncPeriod
    }
}

The a5 backend has the identical pattern at include/pto/npu/a5/TPush.hpp:611-621.

Proposed Fix

Increase the destructor drain count to balance the constructor + in-loop pulse counts:

PTO_INTERNAL ~TPipe()
{
    constexpr uint32_t kSkippedBatches = (SlotNum > 1) ? (SlotNum / SyncPeriod) : 0;
    constexpr uint32_t kDestructorWaits = SyncPeriod + kSkippedBatches;
    for (uint32_t i = 0; i < kDestructorWaits; ++i) {
        prod.allocate();
    }
}

Verified on real NPU hardware: with this patch applied on top of d779cd01, the failing PyPTO test passes (1 passed in 9.77s).

The same change should be mirrored in include/pto/npu/a5/TPush.hpp:~TPipe() (same algebra, same template pattern using the split-axis allocate<> overload).

Why the CV regression test didn't catch this

The new test in tests/npu/a2a3/src/st/testcase/tpushpop_cv/tpushpop_cv_kernel.cpp exercises only FIFO_DEPTH = 1 (i.e. SlotNum == 1), which short-circuits both shouldWaitFree and shouldNotifyFree to true and trivially preserves balance. The leak only appears for SlotNum ≥ 4.

Adding a SlotNum=8, SyncPeriod=4 variant to that test (and asserting absence of residual FFTS counts after a few invocations) would catch this class of bug.

Environment

  • CANN: 9.0.0
  • Platform: A2/A3 NPU, single device
  • pto-isa: d779cd01 (also reproduces on 687af1a6)
  • Downstream: PyPTO tests/st/runtime/test_qwen3_decode_scope3_mixed.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions