Skip to content

[RCCL] [AICOMRCCL-598] Add Device API unit tests (rccl-UnitTestsFixtures)#7770

Draft
speriaswamy-amd wants to merge 2 commits into
developfrom
users/speriasw/AICOMRCCL-598-rccl-device-api-tests-v4
Draft

[RCCL] [AICOMRCCL-598] Add Device API unit tests (rccl-UnitTestsFixtures)#7770
speriaswamy-amd wants to merge 2 commits into
developfrom
users/speriasw/AICOMRCCL-598-rccl-device-api-tests-v4

Conversation

@speriaswamy-amd

Copy link
Copy Markdown
Contributor

Summary

Adds unit tests for the RCCL Device API (AICOMRCCL-598), built into rccl-UnitTestsFixtures:

  • DeviceApi.LsaRemoteRead — positive, 2-rank: registers symmetric windows, then each rank's kernel builds an ncclLsaBarrierSession<ncclCoopCta>, syncs across ranks, and reads its peer's window buffer via ncclGetLsaPointer, validating cross-rank LSA access end-to-end.
  • DeviceApi.CuMemDisabled — negative: device API is correctly gated off (ncclInvalidUsage) when NCCL_CUMEM_ENABLE=0.
  • DeviceApi.WinDisabled — negative: same gating under NCCL_WIN_ENABLE=0.

Based on current develop (NCCL 2.30.4) and the merged process-isolated test-runner refactor (#6523).

Design notes for reviewers

  • Device-API headers are included directly (nccl_device/impl/{core,lsa_barrier}__funcs.h); these are HIP-clean since [RCCL] Enable nccl_device LSA barrier on HIP #6259 added the hip_compat.h cuda::memory_order polyfill.
  • ncclDevCommRequirements_t is initialized via NCCL_DEV_COMM_REQUIREMENTS_INITIALIZER (NCCL 2.30 validates its size/magic/version header; zero-init is rejected).
  • Negative tests accept the unsupported-config rejection at whichever point the runtime raises it: NCCL 2.30 rejects at ncclCommWindowRegister, older releases at ncclDevCommCreate.
  • Negative configs pin NCCL_IB_DISABLE=1 — these are single-node 2-GPU tests that don't need IB/RDMA; otherwise the rejection path can surface an environment-dependent ncclSystemError (ibv_create_qp) that masks the clean ncclInvalidUsage gating signal.
  • NCCL bootstrap pinned to loopback (NCCL_SOCKET_IFNAME=lo) — single-process multi-GPU (ncclCommInitAll), all bootstrap traffic is intra-host, so this keeps the tests self-contained on any host network configuration.
  • withNumGpus(N) declares each test's GPU footprint for the runner's parallel scheduler (positive = 2, negative = 1).
  • Resources use RAII teardown; the earlier AICOMRCCL-835 teardown segfault was fixed upstream by the symMemoryDropRef drain in the NCCL 2.28.9 sync.

Relationship to #7171 (GinMPIDeviceTests.BarrierSession_*): that PR validates the higher-level ncclBarrierSession via MPI multi-process tests in rccl-UnitTestsMPI. These tests are complementary — single-process multi-GPU, exercising the lower-level ncclLsaBarrierSession + ncclGetLsaPointer symmetric-window path in rccl-UnitTestsFixtures.

Test plan / result

Validated in the rocm:7.13-nightly container on a 2×gfx950 node:

rccl-UnitTestsFixtures --gtest_filter=DeviceApi.*
[ PASSED ] DeviceApi.LsaRemoteRead  (8.06 s)
[ PASSED ] DeviceApi.CuMemDisabled  (2.31 s)
[ PASSED ] DeviceApi.WinDisabled    (2.40 s)

Draft PR opened to exercise CI.

Note for maintainers (pre-existing, not from this PR)

A full rccl-UnitTestsFixtures run on develop currently aborts before reaching these tests, on develop's own fixtures, independent of this change:

  • u32fpDecode.u32fpDecodeSuccess — host-math assertion failure.
  • PackRoundtripTest.LdStGlobal16 (device/TestOp128.cpp) — GPU coredump (execvp failed).

This change only adds DeviceApiTests.cpp (+ its build/test-runner wiring); the DeviceApi tests pass in isolation.

JIRA

AICOMRCCL-598

…ixtures

Adds DeviceApi.LsaRemoteRead, DeviceApi.CuMemDisabled, and
DeviceApi.WinDisabled in rccl-UnitTestsFixtures, covering:
- LSA cross-rank peer read: each rank's kernel builds an
  ncclLsaBarrierSession<ncclCoopCta>, syncs across ranks, then reads
  its peer's symmetric-window buffer via ncclGetLsaPointer, and
- ncclDevCommCreate / symmetric-window gating under
  NCCL_CUMEM_ENABLE / NCCL_WIN_ENABLE.

Targets the current device API (NCCL 2.30.4 on develop):
- ncclDevCommRequirements_t is initialized via
  NCCL_DEV_COMM_REQUIREMENTS_INITIALIZER (2.30 validates its
  size/magic/version header; zero-init is rejected).
- The negative tests accept the unsupported-config rejection at
  whichever point the runtime raises it: NCCL 2.30 rejects at
  ncclCommWindowRegister, older releases at ncclDevCommCreate.
- Negative configs pin NCCL_IB_DISABLE=1 (single-node 2-GPU tests do
  not need IB; otherwise the rejection path can surface an
  environment-dependent ncclSystemError from ibv_create_qp that masks
  the clean ncclInvalidUsage gating signal).

Device-API helper headers are included directly
(nccl_device/impl/{core,lsa_barrier}__funcs.h); these are HIP-clean
since PR #6259 added the hip_compat.h cuda::memory_order polyfill.

NCCL bootstrap is pinned to loopback (single-process multi-GPU via
ncclCommInitAll). DeviceApiResources use RAII teardown; the prior
AICOMRCCL-835 teardown segfault was fixed by the symMemoryDropRef
drain in the NCCL 2.28.9 sync.
…tTestsFixtures

Add a DeviceApi entry under the unit_tests_fixtures block so the
test_runner invokes the binary that contains these tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant