Skip to content

Add multi-communication-domain support#752

Open
uv-xiao wants to merge 9 commits into
hw-native-sys:mainfrom
uv-xiao:multi-comm-domain
Open

Add multi-communication-domain support#752
uv-xiao wants to merge 9 commits into
hw-native-sys:mainfrom
uv-xiao:multi-comm-domain

Conversation

@uv-xiao
Copy link
Copy Markdown
Contributor

@uv-xiao uv-xiao commented May 12, 2026

Summary

  • Add the new multi-communication-domain public surface:
    CommDomain, CommDomainPlan, ChipDomainBootstrapConfig, and
    domain-aware ChipContext.domains[name].
  • Derive per-chip bootstrap configs from one L3-level CommDomainPlan, with
    dense domain ranks defined by CommDomain.worker_indices order.
  • Replace the old public single-domain bootstrap surface with the new
    domain-plan path; single-domain usage is now expressed as one
    CommDomain("default", ...).
  • Keep explicit ChipBootstrapConfig support for per-chip data such as host
    staging, keyed by (domain_name, buffer_name).
  • Extend the runtime/bootstrap path to publish multiple named domain contexts,
    carve symmetric per-domain buffers, and pass domain-local CommContext*
    pointers to PTO-ISA kernels.
  • Extend both onboard/HCCL and sim communication backends for the base-window
    plus derived-domain-context model.
  • Migrate existing communication examples to the new surface and add two
    focused multi-domain examples:
    workers/l3/domain_rank_map and workers/l3/dual_domain_overlap.
  • Add design, implementation, validation, and handoff documentation under
    docs/.

Backend Notes

  • On onboard A2/A3, visible communication domains are slices of one hidden base
    communication window. Kernels communicate with domain-local ranks through
    the derived domain CommContext.
  • On sim, the base window is a POSIX shared-memory segment. The sim backend
    derives host-resident domain CommContext objects by remapping
    windowsIn[]/windowsOut[] through the domain rank list and window offset.
  • HCCL collective kernels are intentionally out of scope. This PR uses
    HCCL/HCOMM resources for windows and PTO-ISA RMA-style communication in
    kernels.

Validation

  • Focused unit and sim tests pass: 30 tests covering domain-plan derivation,
    bootstrap channels, sim bootstrap, worker-level sim orchestration, and error
    cleanup.
  • Hardware bootstrap unit test passes on A2/A3.
  • Public examples pass on hardware when excluding the known
    sdma_async_completion_demo failure.
  • workers/l3/domain_rank_map passes on hardware with CANN 8.5 on devices
    12,13,14.
  • python -m py_compile passes for the updated SDMA demo.
  • markdownlint-cli2 and git diff --check pass for the touched docs.

Known Limitation

  • a2a3/sdma_async_completion_demo still fails and is tracked as a remaining
    SDMA/HCCL resource setup issue, not as a multi-domain bootstrap failure.
  • CANN 8.5 lacks aclnnShmemSdmaStarsQuery when SDMA workspace support is
    enabled.
  • CANN 9.0 beta.2 removes the missing-symbol issue, but
    HcclAllocComResourceByTiling still returns 15 on tested card pairs and
    also affects non-SDMA domain_rank_map.
  • The same SDMA demo failure path also reproduces on origin/main, so it is
    not yet proven to be introduced by this PR.

- Add a single doc covering current single-domain wiring from L3 to PTO-ISA kernels
- Specify the multi-domain plan, per-chip derivation, and sub-communicator bootstrap path
- Keep the change doc-only for this PR
@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@uv-xiao uv-xiao marked this pull request as draft May 12, 2026 04:09
@uv-xiao uv-xiao changed the title Add multi-communication-domain design doc Add multi-communication-domain support May 12, 2026
uv-xiao added 5 commits May 13, 2026 08:50
- Add CommDomainPlan-derived per-chip bootstrap configs and domain-aware contexts
- Derive kernel-visible communication contexts from one hidden base window
- Extend sim and hardware bootstrap paths plus focused unit coverage
- Refit communication examples to CommDomainPlan and ctx.domains access
- Keep explicit bootstrap configs for host-staging examples
- Add tiny rank-map and two-domain overlap L3 examples
- Record implemented and intentionally omitted features
- Document bootstrap flow, host-staging split, and sim scope
- Add validation status for migrated and new examples
- Run one PTO-ISA allreduce per domain in domain_rank_map
- Check transferred data for both overlapping domains
- Update docs to describe hardware communication coverage
@uv-xiao uv-xiao marked this pull request as ready for review May 13, 2026 10:24
puddingfjz and others added 3 commits May 13, 2026 18:37
- Remove unsafe HCCL barrier before MC2 resource allocation

- Link host runtime against the resolved hcomm library path

- Add SDMA workspace preflight and write PR 752 handoff notes
- Allow migrated L3 communication demos to run on a2a3sim

- Keep hardware text-section extraction for onboard runs only

- Preserve ffn allreduce kernel logic while fixing sim include order
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants