Summary
After #710 introduced the register + run(cid) ABI, dispatching two
distinct ChipCallables (different orch SO binaries) to the same chip child
on L3 fails in two flavors:
- Wrong-kernel-fired: both cids end up running one of the two kernels.
- Stream sync timeout: AICPU stream hangs (
aclrtSynchronizeStreamWithTimeout
returns ACL_ERROR_RT_STREAM_SYNC_TIMEOUT after 2000ms).
Single-ChipCallable L3 runs work end-to-end. Two cids sharing the same
orch SO also work (already covered by
tests/st/<plat>/<runtime>/prepared_callable/test_prepared_callable.py).
The case that is not covered upstream — and that breaks — is two cids
backed by two different orch SO binaries on one chip child.
Reproducer (in pypto)
Downstream surface lives in PyPTO PR
hw-native-sys/pypto#1344
(submodule bumped to 5a76f4f8). Reproducing tests:
tests/st/distributed/test_l3_distributed.py::TestL3Dependency::test_execute_inline
— 2 devices, 2 inline pl.at() blocks generating 2 ChipCallables.
tests/st/distributed/test_l3_parallel_reduce.py::TestL3ParallelReduce::test_execute
— 1 device, 2 distinct ChipCallables (chip_orch_add + chip_orch_sub),
1 SubWorker reducing both outputs.
Sibling case that passes:
tests/st/distributed/test_l3_distributed.py::TestL3Dependency::test_execute
— 1 device, single ChipCallable, 1 SubWorker.
Observed behavior
test_l3_parallel_reduce::test_execute
expected f = (a+b) + (a-b) = 2a = 4.0
got f = -2.0 = 2·(a-b)
Both sum_ab and diff_ab come back holding a-b = -1. The pattern is
consistent with both submit_next_level dispatches running the second
callable's kernel (or symmetrically the AICPU resolving both cids to
the same orch_so_table_ slot).
test_l3_distributed::test_execute_inline
[ERROR] Stream sync timeout: stream=AICPU timeout_ms=2000 device_id=0 block_dim=3
runtime/src/a2a3/platform/onboard/host/device_runner.cpp:737
[ERROR] PTO2 runtime failed: orch_error_code=0 sched_error_code=100 runtime_status=-100
RuntimeError: WorkerThread::dispatch_process: child failed (code=1):
chip_process dev=0: RuntimeError: run_prepared failed with code 507046
Expected behavior
Registering two ChipCallables with distinct orch SO binaries on one L3
Worker, then dispatching both to the same chip child via
orch.submit_next_level(cid, args, cfg), should run each callable's own
kernel against its own args, with no cross-callable interference.
Suspected area
PR #710 added orch_so_table_[MAX_REGISTERED_CALLABLE_IDS] on the AICPU
and orch_so_dedup_ (keyed by ELF Build-ID) on the host DeviceRunner.
The upstream coverage in test_prepared_callable.py only exercises same
orch SO under two cids, so the multi-distinct-SO case has no test
locking the dispatch table down. Likely candidates:
- AICPU
orch_so_table_[callable_id] indexing / dlopen routing when two
cids resolve to distinct Build-IDs but share something in state.
- Host
orch_so_dedup_ Build-ID hashing / refcounting when both entries
land under one chip child.
prepare_callable interleaving in _chip_process_loop /
_chip_process_loop_with_bootstrap when the parent prewarms two cids
back-to-back via _CTRL_PREPARE.
Test gap to add
A new prepared_callable scenario that:
- Builds two distinct ChipCallables (e.g.
kernel_add + kernel_sub).
- Prepares both under different cids on one chip.
- Runs
cid_A then cid_B and verifies each writes the correct
independent output (not the other's).
This case is the one the downstream pypto L3 distributed tests actually hit
in production usage; it should pin the contract.
Context
Summary
After #710 introduced the
register + run(cid)ABI, dispatching twodistinct ChipCallables (different orch SO binaries) to the same chip child
on L3 fails in two flavors:
aclrtSynchronizeStreamWithTimeoutreturns
ACL_ERROR_RT_STREAM_SYNC_TIMEOUTafter 2000ms).Single-ChipCallable L3 runs work end-to-end. Two cids sharing the same
orch SO also work (already covered by
tests/st/<plat>/<runtime>/prepared_callable/test_prepared_callable.py).The case that is not covered upstream — and that breaks — is two cids
backed by two different orch SO binaries on one chip child.
Reproducer (in pypto)
Downstream surface lives in PyPTO PR
hw-native-sys/pypto#1344
(submodule bumped to
5a76f4f8). Reproducing tests:tests/st/distributed/test_l3_distributed.py::TestL3Dependency::test_execute_inline— 2 devices, 2 inline
pl.at()blocks generating 2 ChipCallables.tests/st/distributed/test_l3_parallel_reduce.py::TestL3ParallelReduce::test_execute— 1 device, 2 distinct ChipCallables (
chip_orch_add+chip_orch_sub),1 SubWorker reducing both outputs.
Sibling case that passes:
tests/st/distributed/test_l3_distributed.py::TestL3Dependency::test_execute— 1 device, single ChipCallable, 1 SubWorker.
Observed behavior
test_l3_parallel_reduce::test_executeBoth
sum_abanddiff_abcome back holdinga-b = -1. The pattern isconsistent with both submit_next_level dispatches running the second
callable's kernel (or symmetrically the AICPU resolving both cids to
the same
orch_so_table_slot).test_l3_distributed::test_execute_inlineExpected behavior
Registering two ChipCallables with distinct orch SO binaries on one L3
Worker, then dispatching both to the same chip child via
orch.submit_next_level(cid, args, cfg), should run each callable's ownkernel against its own args, with no cross-callable interference.
Suspected area
PR #710 added
orch_so_table_[MAX_REGISTERED_CALLABLE_IDS]on the AICPUand
orch_so_dedup_(keyed by ELF Build-ID) on the host DeviceRunner.The upstream coverage in
test_prepared_callable.pyonly exercises sameorch SO under two cids, so the multi-distinct-SO case has no test
locking the dispatch table down. Likely candidates:
orch_so_table_[callable_id]indexing / dlopen routing when twocids resolve to distinct Build-IDs but share something in state.
orch_so_dedup_Build-ID hashing / refcounting when both entriesland under one chip child.
prepare_callableinterleaving in_chip_process_loop/_chip_process_loop_with_bootstrapwhen the parent prewarms two cidsback-to-back via
_CTRL_PREPARE.Test gap to add
A new prepared_callable scenario that:
kernel_add+kernel_sub).cid_Athencid_Band verifies each writes the correctindependent output (not the other's).
This case is the one the downstream pypto L3 distributed tests actually hit
in production usage; it should pin the contract.
Context
5a76f4f8.https://github.com/hw-native-sys/pypto/actions/runs/25718625877/job/75514158855
76543e1, d81866a, 6f022e7 land mostly diagnostics / refactor; haven't
been verified to address this particular case).