Multi-cid dispatch broken: two distinct ChipCallables on one chip child either fire wrong kernel or stream-timeout

## Summary

After #710 introduced the `register + run(cid)` ABI, dispatching **two
distinct ChipCallables (different orch SO binaries) to the same chip child
on L3** fails in two flavors:

- **Wrong-kernel-fired**: both cids end up running one of the two kernels.
- **Stream sync timeout**: AICPU stream hangs (`aclrtSynchronizeStreamWithTimeout`
  returns `ACL_ERROR_RT_STREAM_SYNC_TIMEOUT` after 2000ms).

Single-ChipCallable L3 runs work end-to-end. Two cids sharing **the same**
orch SO also work (already covered by
`tests/st/<plat>/<runtime>/prepared_callable/test_prepared_callable.py`).

The case that is **not** covered upstream — and that breaks — is two cids
backed by two **different** orch SO binaries on one chip child.

## Reproducer (in pypto)

Downstream surface lives in PyPTO PR
[hw-native-sys/pypto#1344](https://github.com/hw-native-sys/pypto/pull/1344)
(submodule bumped to `5a76f4f8`). Reproducing tests:

- `tests/st/distributed/test_l3_distributed.py::TestL3Dependency::test_execute_inline`
  — 2 devices, 2 inline `pl.at()` blocks generating 2 ChipCallables.
- `tests/st/distributed/test_l3_parallel_reduce.py::TestL3ParallelReduce::test_execute`
  — 1 device, 2 distinct ChipCallables (`chip_orch_add` + `chip_orch_sub`),
  1 SubWorker reducing both outputs.

Sibling case that **passes**:
`tests/st/distributed/test_l3_distributed.py::TestL3Dependency::test_execute`
— 1 device, single ChipCallable, 1 SubWorker.

## Observed behavior

### `test_l3_parallel_reduce::test_execute`

```text
expected f = (a+b) + (a-b) = 2a = 4.0
got      f = -2.0           = 2·(a-b)
```

Both `sum_ab` and `diff_ab` come back holding `a-b = -1`. The pattern is
consistent with **both submit_next_level dispatches running the second
callable's kernel** (or symmetrically the AICPU resolving both cids to
the same `orch_so_table_` slot).

### `test_l3_distributed::test_execute_inline`

```text
[ERROR] Stream sync timeout: stream=AICPU timeout_ms=2000 device_id=0 block_dim=3
        runtime/src/a2a3/platform/onboard/host/device_runner.cpp:737
[ERROR] PTO2 runtime failed: orch_error_code=0 sched_error_code=100 runtime_status=-100
RuntimeError: WorkerThread::dispatch_process: child failed (code=1):
              chip_process dev=0: RuntimeError: run_prepared failed with code 507046
```

## Expected behavior

Registering two ChipCallables with distinct orch SO binaries on one L3
Worker, then dispatching both to the same chip child via
`orch.submit_next_level(cid, args, cfg)`, should run **each callable's own
kernel** against its own args, with no cross-callable interference.

## Suspected area

PR #710 added `orch_so_table_[MAX_REGISTERED_CALLABLE_IDS]` on the AICPU
and `orch_so_dedup_` (keyed by ELF Build-ID) on the host DeviceRunner.
The upstream coverage in `test_prepared_callable.py` only exercises **same
orch SO under two cids**, so the multi-distinct-SO case has no test
locking the dispatch table down. Likely candidates:

- AICPU `orch_so_table_[callable_id]` indexing / dlopen routing when two
  cids resolve to distinct Build-IDs but share something in state.
- Host `orch_so_dedup_` Build-ID hashing / refcounting when both entries
  land under one chip child.
- `prepare_callable` interleaving in `_chip_process_loop` /
  `_chip_process_loop_with_bootstrap` when the parent prewarms two cids
  back-to-back via `_CTRL_PREPARE`.

## Test gap to add

A new prepared_callable scenario that:

1. Builds **two distinct** ChipCallables (e.g. `kernel_add` + `kernel_sub`).
2. Prepares both under different cids on one chip.
3. Runs `cid_A` then `cid_B` and verifies **each** writes the correct
   independent output (not the other's).

This case is the one the downstream pypto L3 distributed tests actually hit
in production usage; it should pin the contract.

## Context

- Submodule bump in pypto PR #1344 pins simpler to `5a76f4f8`.
- Failing run:
  https://github.com/hw-native-sys/pypto/actions/runs/25718625877/job/75514158855
- Same upstream HEAD as today (newer commits 67a405ea, 0ff1b24e, cf15368b,
  76543e11, d81866a9, 6f022e72 land mostly diagnostics / refactor; haven't
  been verified to address this particular case).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-cid dispatch broken: two distinct ChipCallables on one chip child either fire wrong kernel or stream-timeout #759

Summary

Reproducer (in pypto)

Observed behavior

`test_l3_parallel_reduce::test_execute`

`test_l3_distributed::test_execute_inline`

Expected behavior

Suspected area

Test gap to add

Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-cid dispatch broken: two distinct ChipCallables on one chip child either fire wrong kernel or stream-timeout #759

Description

Summary

Reproducer (in pypto)

Observed behavior

test_l3_parallel_reduce::test_execute

test_l3_distributed::test_execute_inline

Expected behavior

Suspected area

Test gap to add

Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`test_l3_parallel_reduce::test_execute`

`test_l3_distributed::test_execute_inline`