The six per-component docs (orchestrator.md, scheduler.md,
worker-manager.md, task-flow.md, chip-level-arch.md,
distributed_level_runtime.md) describe the target design of the
hierarchical runtime. This page tracks what has already landed vs. what is
still in flight, so readers can tell which bits of the design are running
today and which are planned.
If you only read one file to understand "what will this look like when
it's done", read the per-component doc. If you want to know "what do I
get if I pip install main today", this page.
- Component split —
Orchestrator(DAG builder) /Scheduler(DAG executor) /WorkerManager+WorkerThread(execution layer) — lives insrc/common/distributed/. - Level model — L0–L6 as described in distributed_level_runtime.md §1. L2 (single-chip) and L3 (composite over ChipWorker + SubWorker) are implemented; L4+ recursion is not (see below).
- Unified
TaskArgs— vector-backed builder with per-tensorTensorArgTypetags (INPUT/OUTPUT/INOUT/OUTPUT_EXISTING/NO_DEP). Replaces separateTaggedTaskArgs/DynamicTaskArgs. - Tag-driven
submit_*onOrchestrator—submit_next_level/submit_next_level_group/submit_sub/submit_sub_group. Noinputs=/outputs=kwargs; tags inside theTaskArgsdrivetensormap.lookup/insertautomatically. SubmitResult = {slot_id}— downstream consumers reference output tensors by their own data pointers.Workerhas nosubmit/scope/drain— those concepts belong toOrchestrator(accessed viaworker.get_orchestrator()).Orchestrator._scope_begin/_scope_end/_drainare invoked by the PythonWorker.runfacade only.orch.alloc(shape, dtype)— runtime-owned intermediate buffer carved out of the Worker's HeapRing (a singlemmap(MAP_SHARED | MAP_ANONYMOUS)region taken in theDistWorkerctor, before fork, inherited by child workers at the same virtual address). Lifetime follows a synthetic task slot; the slab is reclaimed implicitly by the allocator once all downstream consumers have completed andlast_alivesweeps over it (see orchestrator.md §8b).OUTPUTauto-allocation —OUTPUT-tagged tensors submitted withdata == 0are auto-allocated from the same HeapRing as part of the allocator call that claims the slot (1024-byte aligned).OUTPUTtensors with a pre-setdatapointer are passed through untouched — pure overwrite with no WaW dep on the prior owner. Matching L2 semantics, onlyINPUTandINOUTdo a tensormap lookup ininfer_deps; user code that writes into anorch.alloc()buffer must tag itINOUTso the alloc-slot stays live as a WaW producer (see orchestrator.md §8b "Tag semantics for write-after-write").OUTPUT_EXISTINGis never auto-allocated.heap_ring_sizeknob —Worker(level=3, heap_ring_size=...)selects the HeapRing size (default 1 GiB). The underlyingDistWorker(level, heap_ring_size)ctor also installs fork hygiene (setenv ofOMP/BLIS/OPENBLAS/MKL_NUM_THREADS=1, plusKMP_DUPLICATE_LIB_OK=TRUEon macOS, and apthread_atforklanding pad).
Schedulerdispatches via a single ready queue intoWorkerManagerpools (next-level + sub). Slot storeschip_storage_list(oneChipStorageTaskArgsper group worker) that dispatch passes through aWorkerPayloadhanded toIWorker::run.DistChipProcess/DistSubWorkerare separate classes today; unifiedWorkerThreadwithTHREAD | PROCESSmodes is not yet implemented.- Slot-ring and heap-ring share one
DistRing(merged, matches L2-consistency audit Strict-2). One mutex guards both; FIFO reclamation vialast_aliveadvances both resources at once. There is no partial-failure rollback path between slot and heap acquisition.
IWorker::run(callable, TaskArgsView, config)— noWorkerPayloadwrapper; mailbox encodes a length-prefixed blob ofcallable + config + argsat dispatch.- Slot drops
chip_storage_listand stores theTaskArgsitself. Child assemblesChipStorageTaskArgsfrom the view at the L2 ABI edge only. - Strict-1 (per-scope rings, 4 depth) lands here.
- Fold
DistChipProcess/DistSubWorkerintoWorkerThreadwithMode = THREAD | PROCESS. - Strict-4: 3 ready queues (AIC / AIV / MIX) instead of a single queue.
- Python
Worker.rundrops theif level==2branch. - Callable registry moves fully into C++
(
unordered_map<uint64_t, nb::object>owned byWorker) soChipCallableand Pythonsubcallables share one lookup path. This unblocks L4+ recursion.
- C++
Task { OrchFn orch; TaskArgs task_args; CallConfig config; }so a higher-levelWorkercan register a lower-levelWorkeras a next-level child and dispatch viaIWorker::run.
- Final rename sweep:
DistOrchestrator→Orchestrator, filesdist_*.{h,cpp}→*.{h,cpp}.
DistOrchestrator::release_refthreshold is>= total + 1(not>= total). This matchesDistScheduler::try_consume— the+1accounts for the slot's own self-release contribution. Alloc slots (synthetic, never dispatched) pre-bumpfanout_releasedto1inalloc()so this threshold math works for them too.on_consumeduses a CAS on state to remain idempotent across the two call paths (release_refandtry_consume).- scene_test has two helper functions —
_build_chip_task_argsreturnsChipStorageTaskArgs(POD, for the current L2 path:ChipWorker.run(callable, POD, config)) and_build_l3_task_argsreturns a taggedTaskArgs(fororch.submit_next_level). PR-C will collapse these into one helper whenChipWorker::runtakes aTaskArgsView.