[Bug] Stall diagnostics print kernel_id=-1 for almost all stuck tasks

### Platform

a2a3 (Ascend 910B/C hardware) — also affects a5 (same code path)

### Runtime Variant

tensormap_and_ringbuffer

### Description

`Scheduler::log_stall_diagnostics()` reads each stuck task's kernel id from `slot_state.task->kernel_id[0]` and emits one line per task. For tasks where slot 0 is an epoch / scope sync token (no real kernel attached), `kernel_id[0] == -1`. In a real stall dump, this means nearly every stuck entry shows `kernel_id=-1`, even though the underlying graph has 17 named kernels — only the one task whose slot 0 happens to be a real kernel ends up with a useful id.

Concretely, in the orchestration I was debugging (17 incore kernels, ~1099 submitted tasks, ~440 stuck after timeout), the diagnostic dump produced **one** entry with `kernel_id=14` (`sh_w2_matmul`, AIC) and **every other** STUCK-WAIT / STUCK-READY entry showed `kernel_id=-1`. With only `task_id` as a handle, the user has to manually walk the orchestration cpp's `rt_submit_aiv_task(N, ...)` / `rt_submit_aic_task(N, ...)` call sites and reconstruct the per-ring task numbering to find which kernel a stalled task corresponds to. That defeats the purpose of the diagnostic.

### Steps to Reproduce

1. Build any `tensormap_and_ringbuffer` orchestration that uses scope tokens / epochs (i.e. has tasks whose slot 0 is a sync token rather than a kernel). The deepseek-v4 moe_expert build_output (`pypto-lib/models/deepseek/v4/deepseek_v4_decode_moe_expert.py`) reproduces it with 17 incore kernels.
2. Force a stall (e.g. by mis-wiring a ringbuffer notify on the producer side so the consumer never gets fired).
3. Wait for `Thread N: PTO2 timeout after 800001 idle iterations`.
4. Inspect `build_output/simpler_log/debug/device-*/device-*.log` for the `log_stall_diagnostics` block (`scheduler_cold_path.cpp:119` STUCK-READY / `:127` STUCK-WAIT).

### Expected Behavior

Each STUCK-WAIT / STUCK-READY line names a real kernel for the task, or — when the slot really is an epoch token with no kernel — names the producing kernel for that token (or labels the entry as \`<epoch>\` / \`<scope-sync>\` rather than `kernel_id=-1`). Either way, the dump should let a user identify which kernel is blocked without cross-referencing the orchestration cpp by hand.

### Actual Behavior

In `device-1931057_20260509170924881.log` (under `build_output/simpler_log/debug/device-15/`), the stall dump per scheduler thread contained ~440 stuck task lines. Excerpted:

\`\`\`
log_stall_diagnostics [V9] \"[scheduler_cold_path.cpp:127]   STUCK-WAIT   ring=1 task_id=4294967298 kernel_id=-1 refcount=17 fanin=18 state=0\"
log_stall_diagnostics [V9] \"[scheduler_cold_path.cpp:119]   STUCK-READY  ring=2 task_id=8589934630 kernel_id=-1 refcount=5 fanin=5 state=0\"
log_stall_diagnostics [V9] \"[scheduler_cold_path.cpp:127]   STUCK-WAIT   ring=2 task_id=8589934631 kernel_id=-1 refcount=3 fanin=4 state=0\"
log_stall_diagnostics [V9] \"[scheduler_cold_path.cpp:127]   STUCK-WAIT   ring=2 task_id=8589934657 kernel_id=14 refcount=3 fanin=4 state=0\"
log_stall_diagnostics [V9] \"[scheduler_cold_path.cpp:127]   STUCK-WAIT   ring=2 task_id=8589934658 kernel_id=-1 refcount=3 fanin=5 state=0\"
log_stall_diagnostics [V9] \"[scheduler_cold_path.cpp:119]   STUCK-READY  ring=3 task_id=12884901889 kernel_id=-1 refcount=2 fanin=2 state=0\"
... (repeats with kernel_id=-1 for 100s of entries) ...
log_stall_diagnostics [V9] \"[scheduler_cold_path.cpp:132]   scan result: stuck_ready=11 stuck_waiting=436 in_flight=0\"
\`\`\`

Of those ~440 entries, exactly one (`task_id=8589934657`) carried a real kernel id (14). The other 439 are `-1`, even though every one of them was launched from a kernel via `rt_submit_aiv_task(...)` / `rt_submit_aic_task(...)`.

### Root Cause (source pointer)

`runtime/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp` line 107 (and the identical line in the `a5` variant under the same path):

\`\`\`cpp
int32_t kid = slot_state.task->kernel_id[0];   // only slot 0
...
LOG_INFO_V9(
    \"  STUCK-WAIT   ring=%d task_id=%\" PRId64 \" kernel_id=%d ...\",
    ..., kid, ...
);
\`\`\`

Tasks whose slot 0 is a sync/epoch token always print `-1` here. The same file later reads `kernel_id[diag_slot]` for running cores (line 150) using `core_exec_states_[cid].running_subslot`, so the mechanism for picking the right slot already exists for the running-core dump — it just isn't applied to the slot-state dump.

### Proposed Fix Sketch

1. Iterate `kernel_id[0..PTO2_TASK_MAX_SUBSLOTS-1]` and print the first non-`-1` entry (or the whole array), rather than hardcoding `[0]`.
2. When the entire array is `-1`, label the slot as a sync/epoch token instead of printing `kernel_id=-1` — and, if cheaply available, name the upstream kernel that produced the token (the most useful thing for a user staring at a stall).

Either change is a localized edit to `scheduler_cold_path.cpp` (two log sites).

### Git Commit ID

`61cad74dcefd70deeff318e8b538c404dcc407c1` — verified the buggy line is still present on `origin/main` HEAD.

### Host Platform

Linux (aarch64)

### Additional Context

- Log evidence: `build_output/simpler_log/debug/device-15/device-1931057_20260509170924881.log` (lines 1055-1132, three full stall reports — one per scheduler thread, all show the same pattern).
- Stall context: AIVs busy, AICs all idle, `in_flight=0` — i.e. the diagnostic is exactly the tool a user would lean on, and it's currently uninformative.
- Reproducer build_output: `pypto-lib/build_output/_jit_moe_expert_test_20260509_164046/` from `models/deepseek/v4/deepseek_v4_decode_moe_expert.py`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Stall diagnostics print kernel_id=-1 for almost all stuck tasks #731

Platform

Runtime Variant

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause (source pointer)

Proposed Fix Sketch

Git Commit ID

Host Platform

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Stall diagnostics print kernel_id=-1 for almost all stuck tasks #731

Description

Platform

Runtime Variant

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause (source pointer)

Proposed Fix Sketch

Git Commit ID

Host Platform

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions