[RCCL] [AICOMRCCL-598] Add Device API unit tests (rccl-UnitTestsFixtures)#7770
Draft
speriaswamy-amd wants to merge 2 commits into
Draft
[RCCL] [AICOMRCCL-598] Add Device API unit tests (rccl-UnitTestsFixtures)#7770speriaswamy-amd wants to merge 2 commits into
speriaswamy-amd wants to merge 2 commits into
Conversation
…ixtures
Adds DeviceApi.LsaRemoteRead, DeviceApi.CuMemDisabled, and
DeviceApi.WinDisabled in rccl-UnitTestsFixtures, covering:
- LSA cross-rank peer read: each rank's kernel builds an
ncclLsaBarrierSession<ncclCoopCta>, syncs across ranks, then reads
its peer's symmetric-window buffer via ncclGetLsaPointer, and
- ncclDevCommCreate / symmetric-window gating under
NCCL_CUMEM_ENABLE / NCCL_WIN_ENABLE.
Targets the current device API (NCCL 2.30.4 on develop):
- ncclDevCommRequirements_t is initialized via
NCCL_DEV_COMM_REQUIREMENTS_INITIALIZER (2.30 validates its
size/magic/version header; zero-init is rejected).
- The negative tests accept the unsupported-config rejection at
whichever point the runtime raises it: NCCL 2.30 rejects at
ncclCommWindowRegister, older releases at ncclDevCommCreate.
- Negative configs pin NCCL_IB_DISABLE=1 (single-node 2-GPU tests do
not need IB; otherwise the rejection path can surface an
environment-dependent ncclSystemError from ibv_create_qp that masks
the clean ncclInvalidUsage gating signal).
Device-API helper headers are included directly
(nccl_device/impl/{core,lsa_barrier}__funcs.h); these are HIP-clean
since PR #6259 added the hip_compat.h cuda::memory_order polyfill.
NCCL bootstrap is pinned to loopback (single-process multi-GPU via
ncclCommInitAll). DeviceApiResources use RAII teardown; the prior
AICOMRCCL-835 teardown segfault was fixed by the symMemoryDropRef
drain in the NCCL 2.28.9 sync.
…tTestsFixtures Add a DeviceApi entry under the unit_tests_fixtures block so the test_runner invokes the binary that contains these tests.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds unit tests for the RCCL Device API (AICOMRCCL-598), built into
rccl-UnitTestsFixtures:DeviceApi.LsaRemoteRead— positive, 2-rank: registers symmetric windows, then each rank's kernel builds anncclLsaBarrierSession<ncclCoopCta>, syncs across ranks, and reads its peer's window buffer viancclGetLsaPointer, validating cross-rank LSA access end-to-end.DeviceApi.CuMemDisabled— negative: device API is correctly gated off (ncclInvalidUsage) whenNCCL_CUMEM_ENABLE=0.DeviceApi.WinDisabled— negative: same gating underNCCL_WIN_ENABLE=0.Based on current
develop(NCCL 2.30.4) and the merged process-isolated test-runner refactor (#6523).Design notes for reviewers
nccl_device/impl/{core,lsa_barrier}__funcs.h); these are HIP-clean since [RCCL] Enable nccl_device LSA barrier on HIP #6259 added thehip_compat.hcuda::memory_orderpolyfill.ncclDevCommRequirements_tis initialized viaNCCL_DEV_COMM_REQUIREMENTS_INITIALIZER(NCCL 2.30 validates its size/magic/version header; zero-init is rejected).ncclCommWindowRegister, older releases atncclDevCommCreate.NCCL_IB_DISABLE=1— these are single-node 2-GPU tests that don't need IB/RDMA; otherwise the rejection path can surface an environment-dependentncclSystemError(ibv_create_qp) that masks the cleanncclInvalidUsagegating signal.NCCL_SOCKET_IFNAME=lo) — single-process multi-GPU (ncclCommInitAll), all bootstrap traffic is intra-host, so this keeps the tests self-contained on any host network configuration.withNumGpus(N)declares each test's GPU footprint for the runner's parallel scheduler (positive = 2, negative = 1).symMemoryDropRefdrain in the NCCL 2.28.9 sync.Relationship to #7171 (
GinMPIDeviceTests.BarrierSession_*): that PR validates the higher-levelncclBarrierSessionvia MPI multi-process tests inrccl-UnitTestsMPI. These tests are complementary — single-process multi-GPU, exercising the lower-levelncclLsaBarrierSession+ncclGetLsaPointersymmetric-window path inrccl-UnitTestsFixtures.Test plan / result
Validated in the
rocm:7.13-nightlycontainer on a 2×gfx950 node:Draft PR opened to exercise CI.
Note for maintainers (pre-existing, not from this PR)
A full
rccl-UnitTestsFixturesrun on develop currently aborts before reaching these tests, on develop's own fixtures, independent of this change:u32fpDecode.u32fpDecodeSuccess— host-math assertion failure.PackRoundtripTest.LdStGlobal16(device/TestOp128.cpp) — GPU coredump (execvp failed).This change only adds
DeviceApiTests.cpp(+ its build/test-runner wiring); the DeviceApi tests pass in isolation.JIRA
AICOMRCCL-598