Platform
a2a3 (Ascend 910B/C hardware) — also affects a5 (same code path)
Runtime Variant
tensormap_and_ringbuffer
Description
Scheduler::log_stall_diagnostics() reads each stuck task's kernel id from slot_state.task->kernel_id[0] and emits one line per task. For tasks where slot 0 is an epoch / scope sync token (no real kernel attached), kernel_id[0] == -1. In a real stall dump, this means nearly every stuck entry shows kernel_id=-1, even though the underlying graph has 17 named kernels — only the one task whose slot 0 happens to be a real kernel ends up with a useful id.
Concretely, in the orchestration I was debugging (17 incore kernels, ~1099 submitted tasks, ~440 stuck after timeout), the diagnostic dump produced one entry with kernel_id=14 (sh_w2_matmul, AIC) and every other STUCK-WAIT / STUCK-READY entry showed kernel_id=-1. With only task_id as a handle, the user has to manually walk the orchestration cpp's rt_submit_aiv_task(N, ...) / rt_submit_aic_task(N, ...) call sites and reconstruct the per-ring task numbering to find which kernel a stalled task corresponds to. That defeats the purpose of the diagnostic.
Steps to Reproduce
- Build any
tensormap_and_ringbuffer orchestration that uses scope tokens / epochs (i.e. has tasks whose slot 0 is a sync token rather than a kernel). The deepseek-v4 moe_expert build_output (pypto-lib/models/deepseek/v4/deepseek_v4_decode_moe_expert.py) reproduces it with 17 incore kernels.
- Force a stall (e.g. by mis-wiring a ringbuffer notify on the producer side so the consumer never gets fired).
- Wait for
Thread N: PTO2 timeout after 800001 idle iterations.
- Inspect
build_output/simpler_log/debug/device-*/device-*.log for the log_stall_diagnostics block (scheduler_cold_path.cpp:119 STUCK-READY / :127 STUCK-WAIT).
Expected Behavior
Each STUCK-WAIT / STUCK-READY line names a real kernel for the task, or — when the slot really is an epoch token with no kernel — names the producing kernel for that token (or labels the entry as `` / `` rather than kernel_id=-1). Either way, the dump should let a user identify which kernel is blocked without cross-referencing the orchestration cpp by hand.
Actual Behavior
In device-1931057_20260509170924881.log (under build_output/simpler_log/debug/device-15/), the stall dump per scheduler thread contained ~440 stuck task lines. Excerpted:
```
log_stall_diagnostics [V9] "[scheduler_cold_path.cpp:127] STUCK-WAIT ring=1 task_id=4294967298 kernel_id=-1 refcount=17 fanin=18 state=0"
log_stall_diagnostics [V9] "[scheduler_cold_path.cpp:119] STUCK-READY ring=2 task_id=8589934630 kernel_id=-1 refcount=5 fanin=5 state=0"
log_stall_diagnostics [V9] "[scheduler_cold_path.cpp:127] STUCK-WAIT ring=2 task_id=8589934631 kernel_id=-1 refcount=3 fanin=4 state=0"
log_stall_diagnostics [V9] "[scheduler_cold_path.cpp:127] STUCK-WAIT ring=2 task_id=8589934657 kernel_id=14 refcount=3 fanin=4 state=0"
log_stall_diagnostics [V9] "[scheduler_cold_path.cpp:127] STUCK-WAIT ring=2 task_id=8589934658 kernel_id=-1 refcount=3 fanin=5 state=0"
log_stall_diagnostics [V9] "[scheduler_cold_path.cpp:119] STUCK-READY ring=3 task_id=12884901889 kernel_id=-1 refcount=2 fanin=2 state=0"
... (repeats with kernel_id=-1 for 100s of entries) ...
log_stall_diagnostics [V9] "[scheduler_cold_path.cpp:132] scan result: stuck_ready=11 stuck_waiting=436 in_flight=0"
```
Of those ~440 entries, exactly one (task_id=8589934657) carried a real kernel id (14). The other 439 are -1, even though every one of them was launched from a kernel via rt_submit_aiv_task(...) / rt_submit_aic_task(...).
Root Cause (source pointer)
runtime/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp line 107 (and the identical line in the a5 variant under the same path):
```cpp
int32_t kid = slot_state.task->kernel_id[0]; // only slot 0
...
LOG_INFO_V9(
" STUCK-WAIT ring=%d task_id=%" PRId64 " kernel_id=%d ...",
..., kid, ...
);
```
Tasks whose slot 0 is a sync/epoch token always print -1 here. The same file later reads kernel_id[diag_slot] for running cores (line 150) using core_exec_states_[cid].running_subslot, so the mechanism for picking the right slot already exists for the running-core dump — it just isn't applied to the slot-state dump.
Proposed Fix Sketch
- Iterate
kernel_id[0..PTO2_TASK_MAX_SUBSLOTS-1] and print the first non--1 entry (or the whole array), rather than hardcoding [0].
- When the entire array is
-1, label the slot as a sync/epoch token instead of printing kernel_id=-1 — and, if cheaply available, name the upstream kernel that produced the token (the most useful thing for a user staring at a stall).
Either change is a localized edit to scheduler_cold_path.cpp (two log sites).
Git Commit ID
61cad74dcefd70deeff318e8b538c404dcc407c1 — verified the buggy line is still present on origin/main HEAD.
Host Platform
Linux (aarch64)
Additional Context
- Log evidence:
build_output/simpler_log/debug/device-15/device-1931057_20260509170924881.log (lines 1055-1132, three full stall reports — one per scheduler thread, all show the same pattern).
- Stall context: AIVs busy, AICs all idle,
in_flight=0 — i.e. the diagnostic is exactly the tool a user would lean on, and it's currently uninformative.
- Reproducer build_output:
pypto-lib/build_output/_jit_moe_expert_test_20260509_164046/ from models/deepseek/v4/deepseek_v4_decode_moe_expert.py.
Platform
a2a3 (Ascend 910B/C hardware) — also affects a5 (same code path)
Runtime Variant
tensormap_and_ringbuffer
Description
Scheduler::log_stall_diagnostics()reads each stuck task's kernel id fromslot_state.task->kernel_id[0]and emits one line per task. For tasks where slot 0 is an epoch / scope sync token (no real kernel attached),kernel_id[0] == -1. In a real stall dump, this means nearly every stuck entry showskernel_id=-1, even though the underlying graph has 17 named kernels — only the one task whose slot 0 happens to be a real kernel ends up with a useful id.Concretely, in the orchestration I was debugging (17 incore kernels, ~1099 submitted tasks, ~440 stuck after timeout), the diagnostic dump produced one entry with
kernel_id=14(sh_w2_matmul, AIC) and every other STUCK-WAIT / STUCK-READY entry showedkernel_id=-1. With onlytask_idas a handle, the user has to manually walk the orchestration cpp'srt_submit_aiv_task(N, ...)/rt_submit_aic_task(N, ...)call sites and reconstruct the per-ring task numbering to find which kernel a stalled task corresponds to. That defeats the purpose of the diagnostic.Steps to Reproduce
tensormap_and_ringbufferorchestration that uses scope tokens / epochs (i.e. has tasks whose slot 0 is a sync token rather than a kernel). The deepseek-v4 moe_expert build_output (pypto-lib/models/deepseek/v4/deepseek_v4_decode_moe_expert.py) reproduces it with 17 incore kernels.Thread N: PTO2 timeout after 800001 idle iterations.build_output/simpler_log/debug/device-*/device-*.logfor thelog_stall_diagnosticsblock (scheduler_cold_path.cpp:119STUCK-READY /:127STUCK-WAIT).Expected Behavior
Each STUCK-WAIT / STUCK-READY line names a real kernel for the task, or — when the slot really is an epoch token with no kernel — names the producing kernel for that token (or labels the entry as `` / `` rather than
kernel_id=-1). Either way, the dump should let a user identify which kernel is blocked without cross-referencing the orchestration cpp by hand.Actual Behavior
In
device-1931057_20260509170924881.log(underbuild_output/simpler_log/debug/device-15/), the stall dump per scheduler thread contained ~440 stuck task lines. Excerpted:```
log_stall_diagnostics [V9] "[scheduler_cold_path.cpp:127] STUCK-WAIT ring=1 task_id=4294967298 kernel_id=-1 refcount=17 fanin=18 state=0"
log_stall_diagnostics [V9] "[scheduler_cold_path.cpp:119] STUCK-READY ring=2 task_id=8589934630 kernel_id=-1 refcount=5 fanin=5 state=0"
log_stall_diagnostics [V9] "[scheduler_cold_path.cpp:127] STUCK-WAIT ring=2 task_id=8589934631 kernel_id=-1 refcount=3 fanin=4 state=0"
log_stall_diagnostics [V9] "[scheduler_cold_path.cpp:127] STUCK-WAIT ring=2 task_id=8589934657 kernel_id=14 refcount=3 fanin=4 state=0"
log_stall_diagnostics [V9] "[scheduler_cold_path.cpp:127] STUCK-WAIT ring=2 task_id=8589934658 kernel_id=-1 refcount=3 fanin=5 state=0"
log_stall_diagnostics [V9] "[scheduler_cold_path.cpp:119] STUCK-READY ring=3 task_id=12884901889 kernel_id=-1 refcount=2 fanin=2 state=0"
... (repeats with kernel_id=-1 for 100s of entries) ...
log_stall_diagnostics [V9] "[scheduler_cold_path.cpp:132] scan result: stuck_ready=11 stuck_waiting=436 in_flight=0"
```
Of those ~440 entries, exactly one (
task_id=8589934657) carried a real kernel id (14). The other 439 are-1, even though every one of them was launched from a kernel viart_submit_aiv_task(...)/rt_submit_aic_task(...).Root Cause (source pointer)
runtime/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cppline 107 (and the identical line in thea5variant under the same path):```cpp
int32_t kid = slot_state.task->kernel_id[0]; // only slot 0
...
LOG_INFO_V9(
" STUCK-WAIT ring=%d task_id=%" PRId64 " kernel_id=%d ...",
..., kid, ...
);
```
Tasks whose slot 0 is a sync/epoch token always print
-1here. The same file later readskernel_id[diag_slot]for running cores (line 150) usingcore_exec_states_[cid].running_subslot, so the mechanism for picking the right slot already exists for the running-core dump — it just isn't applied to the slot-state dump.Proposed Fix Sketch
kernel_id[0..PTO2_TASK_MAX_SUBSLOTS-1]and print the first non--1entry (or the whole array), rather than hardcoding[0].-1, label the slot as a sync/epoch token instead of printingkernel_id=-1— and, if cheaply available, name the upstream kernel that produced the token (the most useful thing for a user staring at a stall).Either change is a localized edit to
scheduler_cold_path.cpp(two log sites).Git Commit ID
61cad74dcefd70deeff318e8b538c404dcc407c1— verified the buggy line is still present onorigin/mainHEAD.Host Platform
Linux (aarch64)
Additional Context
build_output/simpler_log/debug/device-15/device-1931057_20260509170924881.log(lines 1055-1132, three full stall reports — one per scheduler thread, all show the same pattern).in_flight=0— i.e. the diagnostic is exactly the tool a user would lean on, and it's currently uninformative.pypto-lib/build_output/_jit_moe_expert_test_20260509_164046/frommodels/deepseek/v4/deepseek_v4_decode_moe_expert.py.