Skip to content

[Bug] Stall diagnostics print kernel_id=-1 for almost all stuck tasks #731

@zhangqi-chen

Description

@zhangqi-chen

Platform

a2a3 (Ascend 910B/C hardware) — also affects a5 (same code path)

Runtime Variant

tensormap_and_ringbuffer

Description

Scheduler::log_stall_diagnostics() reads each stuck task's kernel id from slot_state.task->kernel_id[0] and emits one line per task. For tasks where slot 0 is an epoch / scope sync token (no real kernel attached), kernel_id[0] == -1. In a real stall dump, this means nearly every stuck entry shows kernel_id=-1, even though the underlying graph has 17 named kernels — only the one task whose slot 0 happens to be a real kernel ends up with a useful id.

Concretely, in the orchestration I was debugging (17 incore kernels, ~1099 submitted tasks, ~440 stuck after timeout), the diagnostic dump produced one entry with kernel_id=14 (sh_w2_matmul, AIC) and every other STUCK-WAIT / STUCK-READY entry showed kernel_id=-1. With only task_id as a handle, the user has to manually walk the orchestration cpp's rt_submit_aiv_task(N, ...) / rt_submit_aic_task(N, ...) call sites and reconstruct the per-ring task numbering to find which kernel a stalled task corresponds to. That defeats the purpose of the diagnostic.

Steps to Reproduce

  1. Build any tensormap_and_ringbuffer orchestration that uses scope tokens / epochs (i.e. has tasks whose slot 0 is a sync token rather than a kernel). The deepseek-v4 moe_expert build_output (pypto-lib/models/deepseek/v4/deepseek_v4_decode_moe_expert.py) reproduces it with 17 incore kernels.
  2. Force a stall (e.g. by mis-wiring a ringbuffer notify on the producer side so the consumer never gets fired).
  3. Wait for Thread N: PTO2 timeout after 800001 idle iterations.
  4. Inspect build_output/simpler_log/debug/device-*/device-*.log for the log_stall_diagnostics block (scheduler_cold_path.cpp:119 STUCK-READY / :127 STUCK-WAIT).

Expected Behavior

Each STUCK-WAIT / STUCK-READY line names a real kernel for the task, or — when the slot really is an epoch token with no kernel — names the producing kernel for that token (or labels the entry as `` / `` rather than kernel_id=-1). Either way, the dump should let a user identify which kernel is blocked without cross-referencing the orchestration cpp by hand.

Actual Behavior

In device-1931057_20260509170924881.log (under build_output/simpler_log/debug/device-15/), the stall dump per scheduler thread contained ~440 stuck task lines. Excerpted:

```
log_stall_diagnostics [V9] "[scheduler_cold_path.cpp:127] STUCK-WAIT ring=1 task_id=4294967298 kernel_id=-1 refcount=17 fanin=18 state=0"
log_stall_diagnostics [V9] "[scheduler_cold_path.cpp:119] STUCK-READY ring=2 task_id=8589934630 kernel_id=-1 refcount=5 fanin=5 state=0"
log_stall_diagnostics [V9] "[scheduler_cold_path.cpp:127] STUCK-WAIT ring=2 task_id=8589934631 kernel_id=-1 refcount=3 fanin=4 state=0"
log_stall_diagnostics [V9] "[scheduler_cold_path.cpp:127] STUCK-WAIT ring=2 task_id=8589934657 kernel_id=14 refcount=3 fanin=4 state=0"
log_stall_diagnostics [V9] "[scheduler_cold_path.cpp:127] STUCK-WAIT ring=2 task_id=8589934658 kernel_id=-1 refcount=3 fanin=5 state=0"
log_stall_diagnostics [V9] "[scheduler_cold_path.cpp:119] STUCK-READY ring=3 task_id=12884901889 kernel_id=-1 refcount=2 fanin=2 state=0"
... (repeats with kernel_id=-1 for 100s of entries) ...
log_stall_diagnostics [V9] "[scheduler_cold_path.cpp:132] scan result: stuck_ready=11 stuck_waiting=436 in_flight=0"
```

Of those ~440 entries, exactly one (task_id=8589934657) carried a real kernel id (14). The other 439 are -1, even though every one of them was launched from a kernel via rt_submit_aiv_task(...) / rt_submit_aic_task(...).

Root Cause (source pointer)

runtime/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp line 107 (and the identical line in the a5 variant under the same path):

```cpp
int32_t kid = slot_state.task->kernel_id[0]; // only slot 0
...
LOG_INFO_V9(
" STUCK-WAIT ring=%d task_id=%" PRId64 " kernel_id=%d ...",
..., kid, ...
);
```

Tasks whose slot 0 is a sync/epoch token always print -1 here. The same file later reads kernel_id[diag_slot] for running cores (line 150) using core_exec_states_[cid].running_subslot, so the mechanism for picking the right slot already exists for the running-core dump — it just isn't applied to the slot-state dump.

Proposed Fix Sketch

  1. Iterate kernel_id[0..PTO2_TASK_MAX_SUBSLOTS-1] and print the first non--1 entry (or the whole array), rather than hardcoding [0].
  2. When the entire array is -1, label the slot as a sync/epoch token instead of printing kernel_id=-1 — and, if cheaply available, name the upstream kernel that produced the token (the most useful thing for a user staring at a stall).

Either change is a localized edit to scheduler_cold_path.cpp (two log sites).

Git Commit ID

61cad74dcefd70deeff318e8b538c404dcc407c1 — verified the buggy line is still present on origin/main HEAD.

Host Platform

Linux (aarch64)

Additional Context

  • Log evidence: build_output/simpler_log/debug/device-15/device-1931057_20260509170924881.log (lines 1055-1132, three full stall reports — one per scheduler thread, all show the same pattern).
  • Stall context: AIVs busy, AICs all idle, in_flight=0 — i.e. the diagnostic is exactly the tool a user would lean on, and it's currently uninformative.
  • Reproducer build_output: pypto-lib/build_output/_jit_moe_expert_test_20260509_164046/ from models/deepseek/v4/deepseek_v4_decode_moe_expert.py.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions