Skip to content

Multi-cid dispatch broken: two distinct ChipCallables on one chip child either fire wrong kernel or stream-timeout #759

@lyfne123

Description

@lyfne123

Summary

After #710 introduced the register + run(cid) ABI, dispatching two
distinct ChipCallables (different orch SO binaries) to the same chip child
on L3
fails in two flavors:

  • Wrong-kernel-fired: both cids end up running one of the two kernels.
  • Stream sync timeout: AICPU stream hangs (aclrtSynchronizeStreamWithTimeout
    returns ACL_ERROR_RT_STREAM_SYNC_TIMEOUT after 2000ms).

Single-ChipCallable L3 runs work end-to-end. Two cids sharing the same
orch SO also work (already covered by
tests/st/<plat>/<runtime>/prepared_callable/test_prepared_callable.py).

The case that is not covered upstream — and that breaks — is two cids
backed by two different orch SO binaries on one chip child.

Reproducer (in pypto)

Downstream surface lives in PyPTO PR
hw-native-sys/pypto#1344
(submodule bumped to 5a76f4f8). Reproducing tests:

  • tests/st/distributed/test_l3_distributed.py::TestL3Dependency::test_execute_inline
    — 2 devices, 2 inline pl.at() blocks generating 2 ChipCallables.
  • tests/st/distributed/test_l3_parallel_reduce.py::TestL3ParallelReduce::test_execute
    — 1 device, 2 distinct ChipCallables (chip_orch_add + chip_orch_sub),
    1 SubWorker reducing both outputs.

Sibling case that passes:
tests/st/distributed/test_l3_distributed.py::TestL3Dependency::test_execute
— 1 device, single ChipCallable, 1 SubWorker.

Observed behavior

test_l3_parallel_reduce::test_execute

expected f = (a+b) + (a-b) = 2a = 4.0
got      f = -2.0           = 2·(a-b)

Both sum_ab and diff_ab come back holding a-b = -1. The pattern is
consistent with both submit_next_level dispatches running the second
callable's kernel
(or symmetrically the AICPU resolving both cids to
the same orch_so_table_ slot).

test_l3_distributed::test_execute_inline

[ERROR] Stream sync timeout: stream=AICPU timeout_ms=2000 device_id=0 block_dim=3
        runtime/src/a2a3/platform/onboard/host/device_runner.cpp:737
[ERROR] PTO2 runtime failed: orch_error_code=0 sched_error_code=100 runtime_status=-100
RuntimeError: WorkerThread::dispatch_process: child failed (code=1):
              chip_process dev=0: RuntimeError: run_prepared failed with code 507046

Expected behavior

Registering two ChipCallables with distinct orch SO binaries on one L3
Worker, then dispatching both to the same chip child via
orch.submit_next_level(cid, args, cfg), should run each callable's own
kernel
against its own args, with no cross-callable interference.

Suspected area

PR #710 added orch_so_table_[MAX_REGISTERED_CALLABLE_IDS] on the AICPU
and orch_so_dedup_ (keyed by ELF Build-ID) on the host DeviceRunner.
The upstream coverage in test_prepared_callable.py only exercises same
orch SO under two cids
, so the multi-distinct-SO case has no test
locking the dispatch table down. Likely candidates:

  • AICPU orch_so_table_[callable_id] indexing / dlopen routing when two
    cids resolve to distinct Build-IDs but share something in state.
  • Host orch_so_dedup_ Build-ID hashing / refcounting when both entries
    land under one chip child.
  • prepare_callable interleaving in _chip_process_loop /
    _chip_process_loop_with_bootstrap when the parent prewarms two cids
    back-to-back via _CTRL_PREPARE.

Test gap to add

A new prepared_callable scenario that:

  1. Builds two distinct ChipCallables (e.g. kernel_add + kernel_sub).
  2. Prepares both under different cids on one chip.
  3. Runs cid_A then cid_B and verifies each writes the correct
    independent output (not the other's).

This case is the one the downstream pypto L3 distributed tests actually hit
in production usage; it should pin the contract.

Context

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions