From 5613972f8a20788d5d6daa95ca640b8f5a37602a Mon Sep 17 00:00:00 2001 From: Corey Derochie Date: Wed, 24 Jun 2026 09:00:16 -0600 Subject: [PATCH 1/4] Added missing release sections and missing features. --- projects/rccl/CHANGELOG.md | 193 ++++++++++++++++++++++++++++++++----- 1 file changed, 167 insertions(+), 26 deletions(-) diff --git a/projects/rccl/CHANGELOG.md b/projects/rccl/CHANGELOG.md index a5214eecf94..fc2d835186f 100644 --- a/projects/rccl/CHANGELOG.md +++ b/projects/rccl/CHANGELOG.md @@ -2,18 +2,138 @@ Full documentation for RCCL is available at [https://rccl.readthedocs.io](https://rccl.readthedocs.io) -## Unreleased - RCCL 2.30.4 for ROCm 7.12 +## Unreleased - RCCL 2.30.4 for ROCm 7.14 + +### Added +* Added proxytrace profiler plugin and core proxy-diagnostics hooks (`RCCL_PROXYTRACE`). +* Added `ncclBarrierSession` LSA validation for barrier sessions. +* Enable WarpSpeed auto mode for grow communicators. +* Added symmetric-memory ReduceScatter kernel (`RailA2A_LsaLD`) on gfx942/gfx950. +* Added bias (accumulation) AllReduce on gfx1250 (MI450). +* Added optimized scale-up ReduceScatter, AllGather, and AllToAll kernels. +* Refactored AllGather algorithm selection; hierarchical AllGather now enabled by default for multi-node. +* Added rocprofiler coverage for `ncclCommGrow` and `ncclCommGetUniqueId`. +* P2P batching auto-enabled for gfx950 in combination with non-AINIC NICs. +* Display HIP/ROCm runtime versions in `NCCL_DEBUG` output. +* Parallelize communicator destruction across child processes to reduce teardown latency. +* Symmetric memory kernel tuning. +* Detect ROCm version via core symlink for multi-architecture installs. +* Skip DDA IPC initialization for directMode and MNNVL topologies. +* Load versioned `libamd_smi` SONAME instead of an unversioned symlink. +* Added Pythonic API bindings under `bindings/nccl4py/` (RCCL fork of NVIDIA `nccl4py` v0.2.0). Provides Python access to RCCL collectives via Cython bindings, an on-disk `cuda.core` HIP shim for ROCm hosts without `cuda-bindings` / `cuda-core`, and RCCL-only collective wrappers (`ncclAllReduceWithBias`, `ncclAllToAllv`). +* Added RCCL examples to the repository. ### Changed * Compatibility with NCCL 2.30.4. +* Compatibility with NCCL 2.29.7. +* Compatibility with NCCL 2.28.9. +* Swapped legacy `net_ib` with the `net_ib` implementation from NCCL 2.29. +* Added `RCCL host API` pull-in from NCCL 2.30. +* Skip per-warp channel LDS copy when `warpComm` is disabled. +* Harden proxy RPC setup against malformed peer input. +* Fixed `net_ib_cast`: gate CTS offload path on per-connection state. +* Fixed acquire-tail polling for gfx950 P2P host staging. +* The bootstrap AllGather now uses the bidirectional ring (N/2 steps) by default on the socket OOB path. `NCCL_BOOTSTRAP_BIDIR_ALLGATHER` now defaults to `1`; set it to `0` to fall back to the unidirectional ring. The net OOB path (`NCCL_OOB_NET_ENABLE`) and its bidirectional variant (`NCCL_BOOTSTRAP_BIDIR_NET`) remain off by default. +* `NCCL_PXN_C2C` is kept default-off (`0`); upstream NCCL defaults it to `1` since 2.28. The C2C PXN routing path is NVIDIA-specific and is not currently applicable on AMD hardware. -### Known issues -* Elastic-buffer support for GIN (multi-segment symmetric memory windows backed by a mix of device and CPU/`HOST_NUMA` memory, exposed through `NCCL_ELASTIC_BUFFER_REGISTER` and `NCCL_SYM_REUSE_SYSMEM_HANDLES`) was newly synced from upstream and compiles on ROCm, but is unverified on AMD hardware. +### Removed +* Removed NPKit profiling support (build option ``ENABLE_NPKIT``, headers, device and proxy instrumentation, install script flag ``--npkit-enable``, and related documentation and tooling). Use the profiler plugin API for profiling instead. +* Removed kernel COLLTRACE support, including the `COLLTRACE` build option, device-side collective trace buffers, debug kernel variants, and related install/CI wiring. The host latency profiler is unchanged. +* Removed legacy `ENABLE_PROFILING` device profiling support and the `PROFILE` build option. Use the profiler plugin API instead. + +### Resolved Issues +* Fixed `ncclCommGrow` channel-count divergence causing incorrect collective routing. +* Fixed `ncclCommGrow` hang when growing to an 8-rank single-node communicator. +* Fixed symmetric LDS under-reservation in legacy (non-device-linker) builds. +* Fixed LL128 protocol correctness for gfx1250 (MI450). +* Fixed XGMI topology mapping for multi-system (NPS) nodes. +* Fixed gfx950 collective hang caused by a tuner race condition. +* Fixed `net_ib`: avoid flagging a non-fatal Isend CTS no-match as a fatal error. +* Fixed LDS overflow in device-linker builds. +* Fixed symmetric memory correctness issues. +* Fixed `ncclCommFree` to free symmetric window objects automatically (NCCL 2.29.7 defect). +* Fixed DDA IPC initialization skip on architectures that do not run DDA. +* Fixed static build (`BUILD_SHARED_LIBS=OFF`) failing with `install(EXPORT "rccl-targets" ...)` error when `fmt` is fetched via `FetchContent`. The `fmt-header-only` target is now scoped to the build interface and excluded from RCCL's exported usage requirements. +* Fixed proxy channel staging buffers ignoring the new GDR mode selection on HIP < 7.12 builds. The legacy `#else` branch in `sendProxyConnect` / `recvProxyConnect` now honors `resources->useDmaBuf`, so peermem-equipped hosts on older HIP no longer fall through to `hsa_amd_portable_export_dmabuf` when peermem was selected in `*ProxySetup`. Workaround for affected RCCL builds: `NCCL_DMABUF_ENABLE=0`. + +## RCCL 2.28.3 for ROCm 7.13 + +### Added +* Added CAST network transport (`ncclNetCast` / `net_ib_cast`) for AMD AINIC hardware. +* Added built-in CSV tuner for runtime algorithm/protocol/channel selection without rebuilds. +* Added multi-node hierarchical AllGather algorithm. +* Initial support for symmetric memory kernels on gfx942 and gfx950. +* Added `RCCL_IB_SPLIT_DATA_THRESHOLD` to split payload across multiple QPs/NICs in `ncclIbMultiSend`. +* Round-robin single-QP payload and fifo-head-based QP selection in `ncclIbMultiSend`. +* Added User Buffer and Graph Registration (`NCCL_NVLS_ENABLE` / CUMEM) gated on Linux kernel version. +* Added runtime QP tracking with atomic counters in `net_ib` and `net_ib_cast`. +* Enable Copy Engine (CE) collectives support in RCCL. +* Added gfx1250 (MI450) GPU target support in RCCL and RCCL-Tests. +* Strix-Halo (gfx1151) tuning support. +* Add `amd-smi` wrapper functions for projected scale-up support, fabric capability dumping, and MNNVL fabric checks. +* Added `RCCL_IB_P2P_DISABLE_CTS` to disable CTS offload for P2P connections on AINIC. Defaults to 1 (disabled). When `RCCL_CTS_OFFLOAD_ENABLED=1` is explicitly set, it overrides this flag and forces CTS on all connections including P2P. +* Merged `RCCL_CTS_INLINE_DATA` into `RCCL_CTS_OFFLOAD_ENABLED`. CTS offload and CTS inline data are now controlled by a single tri-state variable: `-1` (default, auto-enable on AINIC), `0` (force disable), `1` (force enable for all connections). + +### Changed +* Removed MSCCL and MSCCL++ collective integration; legacy `mscclLoadAlgo`, `mscclRunAlgo`, and `mscclUnloadAlgo` APIs remain as no-ops for link compatibility. +* Removed roc-obj tools and perl build dependency. +* Disable P2P batching by default on MI350. +* Disable AMD-SMI (`amdsmi_init`) by default due to a concurrency issue in `amdsmi_init`; enable explicitly for ROCm 7.0 and above when the issue is addressed. +* `RCCL_ENABLE_CONTEXT_TRACKING` replaced by `NCCL_LAUNCH_ORDER_IMPLICIT` for controlling launch-order tracking. +* Moved tuning log messages from `NCCL_INIT` to the `NCCL_TUNING` debug subsystem. +* Gate multi-node Direct AllGather on PXN enablement. +* Use 256 threads per block on gfx950 (increased from 512). +* Set algorithm to Ring for Navi4x (gfx1100/gfx1101) AllReduce. +* Proxy busy-spin loop replaced with architecture-specific pause instruction on GDA-eligible topologies. +* RCCL adds a NCCL CMake alias shim layer for CMake-based build compatibility. +* CTS offload is now controlled per-connection rather than globally, allowing P2P connections to fall back to standard RDMA writes while non-P2P traffic continues to use CTS. + +### Resolved Issues +* Fixed `netOverride` being skipped when rail-optimized trees are enabled (restores desired NIC mapping for targeted 4-NIC systems). +* Fixed RCCL Inspector plugin teardown segfault/hang and collective-count correctness. +* Fixed `ncclGroupSimulateEnd` planner state leak and resource cleanup. +* Fixed validation errors with `all_reduce_bias` kernel on gfx950. +* WarpSpeed now errors out with a warning when the requested channel count exceeds the maximum supported. +* Fixed `--generate-sym-kernels` option when used with the default `--device-linker`. +* Fixed `CUCHECK` and `CUCHECKGOTO` macros to clear the HIP error state before returning. +* Fixed `amd-smi`/`rocm-smi` enum mismatch. +* Fixed CTS-offload corner cases in `net_ib_rocm` and `net_ib_cast` (including mutual dependency enforcement with NIC fusion). +* Fixed IPC registration incorrect `#ifdef` guard that disabled registration. +* Fixed symmetric kernels validation errors on gfx942 and gfx950. + +## RCCL 2.28.3 for ROCm 7.12 -## Unreleased - RCCL 2.30.3 for ROCm 7.12 +### Added +* Added gfx1151 (Strix-Halo) GPU target support. +* Added AMD AINIC support within the RCCL default internal network plugin. +* Added `RCCL_P2P_SHIFT_SIZE` environment variable for advanced tuning of P2P channel and part mapping. +* Added Direct Reduce Scatter implementation for improved multi-node performance. +* Added WarpSpeed support for single-node AllGather and ReduceScatter. +* Added virtual device enablement support (minimal changes for virtual GPU topology). +* Added Navi4 (gfx1100) LL protocol enablement and tuning. +* Added add-smi wrapper for firmware version queries (switched from rocm-smi to amd-smi). ### Changed -* Compatibility with NCCL 2.30.3. +* Changed GPU Direct RDMA mode selection logic to prefer peermem over DMAbuf by default. `NCCL_DMABUF_ENABLE` now defaults to 1 (previously 0). When both peermem and DMAbuf are available, RCCL will use peermem. If peermem is unavailable, RCCL will automatically fall back to DMAbuf (if available and enabled). Setting `RCCL_FORCE_ENABLE_DMABUF=1` forces DMAbuf usage exclusively, skipping peermem even if available, and disables GPU Direct RDMA if DMAbuf is unavailable. +* Remove P2P batching node-count cap; P2P batching now applies for all node counts (previously capped at 32 nodes). +* Halved default CU usage for gfx950 single-node all-reduce for better resource efficiency. +* Set default maximum channels to 48 for gfx950 multi-node collectives. +* Set default maximum channels to 48 for MI350 multi-node collectives. +* WarpSpeed auto-mode handling improved; WarpSpeed enabled for MI350 single-node. +* `NCCL_LAUNCH_ORDER_IMPLICIT` replaces `RCCL_ENABLE_CONTEXT_TRACKING` for controlling implicit launch ordering. +* Disable Direct Reduce Scatter automatically when PXN is disabled. +* Tuning: constant values used for CorrectionFactor tables for improved consistency. +* DMABUF disabled configurations now correctly respected in `rocm_net_ib`. + +### Resolved Issues +* Fixed shutdown ordering race condition and use-after-free crash in proxy cleanup. +* Fixed DMABUF support check failure (SWDEV-579889 / ROCM-2855). +* Fixed `qpIndex` selection in `ncclIbIrecv` for AINIC mode. +* Fixed per-device UD map indexing for NIC fusion configurations. +* Fixed potential segfaults from `malloc` failure paths. +* Fixed bfloat16 reduce kernel bug for ROCm >= 6.0. +* Fixed memory leak in `ncclCommInitRankFunc`. +* Fixed memory leaks (ROCM-1721, ROCM-1722). ### Known issues * The upstream one-sided RMA subsystem (`src/rma`) was newly synced and uses RCCL's direct-HIP batch memory-operation path (`hipStreamBatchMemOp`, in place of the upstream CUDA `ncclCuStreamBatchMemOp` driver wrapper which is not built on ROCm). It is unverified at scale on ROCm. @@ -21,49 +141,70 @@ Full documentation for RCCL is available at [https://rccl.readthedocs.io](https: * The Copy-Engine profiler path (`ncclProfiler_v6`) is not enabled; RCCL remains on `ncclProfiler_v5`. The profiler plugin needs to be verified on ROCm. * GIN GDAKI host support now uses the shared InfiniBand context (`ibv_context`/`ibv_pd`) rather than opening its own device. The GDAKI path is DOCA/Mellanox-specific and is unverified on AMD NICs. * The RCCL InfiniBand GIN proxy backend was ported to the reworked NCCL 2.30.3 `ncclGin_v13_t` interface (opaque per-communicator context with mandatory `createContext`/`destroyContext`), but does not implement GIN GET or FLUSH (`iget`/`iflush` are left unset); the GIN host proxy reports an unsupported-op error if a device kernel requests one. +* Elastic-buffer support for GIN (multi-segment symmetric memory windows backed by a mix of device and CPU/`HOST_NUMA` memory, exposed through `NCCL_ELASTIC_BUFFER_REGISTER` and `NCCL_SYM_REUSE_SYSMEM_HANDLES`) was newly synced from upstream and compiles on ROCm, but is unverified on AMD hardware. -## Unreleased - RCCL 2.28.3 for ROCm 7.11 +## RCCL 2.28.3 for ROCm 7.11 ### Known issues +* AllToAllv and AllToAll for a single GPU is hanging. * AllGather regression for small message sizes (less than 1 MB) due to the Direct algorithm. * ROCTx feature needs to be verified. * Profiler plugin needs to be verified. ### Added -* Added `RCCL_IB_P2P_DISABLE_CTS` to disable CTS offload for P2P connections on AINIC. Defaults to 1 (disabled). When `RCCL_CTS_OFFLOAD_ENABLED=1` is explicitly set, it overrides this flag and forces CTS on all connections including P2P. -* Merged `RCCL_CTS_INLINE_DATA` into `RCCL_CTS_OFFLOAD_ENABLED`. CTS offload and CTS inline data are now controlled by a single tri-state variable: `-1` (default, auto-enable on AINIC), `0` (force disable), `1` (force enable for all connections). -* Added Pythonic API bindings under `bindings/nccl4py/` (RCCL fork of NVIDIA `nccl4py` v0.2.0). Provides Python access to RCCL collectives via Cython bindings, an on-disk `cuda.core` HIP shim for ROCm hosts without `cuda-bindings` / `cuda-core`, and RCCL-only collective wrappers (`ncclAllReduceWithBias`, `ncclAllToAllv`). +* Added `ncclAllReduceWithBias` API for fused all-reduce with elementwise accumulation-bias operations. +* Added collective latency profiler tool (`--latency-profiler` in `install.sh`) for per-collective timing analysis. +* Added dynamic pipelining for reduction collectives via the Simple protocol to improve single-node performance. +* Added `unroll=2` device-code variant for gfx950 multi-node collectives. +* Enable LL128 protocol for gfx942 with 4-NIC configurations using a unified tuning table. +* Added reduce/broadcast algorithm and protocol selection tuning table for multi-node gfx940. +* Pass `NET_OPTIONAL_RECV_COMPLETION` hint to the network plugin to enable potential network-side optimizations. +* Expose symbols for RCCL algorithm, protocol, and channels selection functions (`rcclOverrideAlgorithm`, `rcclOverrideProtocol`). +* Added rail-optimized tree topology support for MI3XX nodes with 4 NICs. +* Added single-node AllGather and ReduceScatter performance optimizations. +* Enable GDRCopy option for gfx950. +* Enable single-node one-slice optimization for gfx950 and MI300A. +* Added environment variable to cap the number of QPs created for send/recv collectives. +* Added support for additional paths when loading the RCCL DMABUF kernel configuration file. +* Added `ncclCommDump` API for communicator state inspection. +* Added rocSHMEM GDA alltoall integration (GDA-accelerated alltoall via rocSHMEM). ### Changed +* PIX and PXB are now treated as equivalent GDR distances for more consistent topology detection. +* Optimized AllToAll for 64 or more GPUs on gfx942. +* Optimize `threadfence` for the LL64 protocol on the sender side. +* Disable `__threadfence` on the sender side of the Simple protocol when it is not needed for correctness. +* Use rocm-smi API instead of CLI invocation for firmware version querying. +* Adjusted gfx950 thread-block size to improve LL64 and Simple protocol performance for AllReduce, AllGather, and ReduceScatter. +* `__threadfence` bypass on the multinode gfx950 sender side is now the default. +* Updated multi-node LL/LL128 tuning for gfx950 to improve large-message bandwidth. +* Disabled graph mode memory registration and user buffer registration as unsupported features on current hardware. +* Updated Direct AllGather threshold for single-node and multi-node cases. +* Experimental support for traffic shaping using warp specialization (also known as WarpSpeed) is now available for the Ring algorithm. +* Enabling WarpSpeed in auto mode using RCCL_WARP_SPEED_AUTO optimizes performance and reduces the CU count by 50% on a single node for AllReduce, AllGather from 64MB, and ReduceScatter from 256MB. +* The following configuration knobs control WarpSpeed behavior for debugging purposes: `RCCL_WARP_SPEED_ENABLE`, `RCCL_UNROLL_FACTOR`, `RCCL_WARP_SPEED_CU_COUNT`, and `RCCL_THREADS_PER_BLOCK`. Note that the effective unroll factor is calculated as 2 raised to the value of `RCCL_UNROLL_FACTOR`. * Compatibility with NCCL 2.28.3. -* Changed GPU Direct RDMA mode selection logic to prefer peermem over DMAbuf by default. `NCCL_DMABUF_ENABLE` now defaults to 1 (previously 0). When both peermem and DMAbuf are available, RCCL will use peermem. If peermem is unavailable, RCCL will automatically fall back to DMAbuf (if available and enabled). Setting `RCCL_FORCE_ENABLE_DMABUF=1` forces DMAbuf usage exclusively, skipping peermem even if available, and disables GPU Direct RDMA if DMAbuf is unavailable. -* CTS offload is now controlled per-connection rather than globally, allowing P2P connections to fall back to standard RDMA writes while non-P2P traffic continues to use CTS. -* The bootstrap AllGather now uses the bidirectional ring (N/2 steps) by default on the socket OOB path. `NCCL_BOOTSTRAP_BIDIR_ALLGATHER` now defaults to `1`; set it to `0` to fall back to the unidirectional ring. The net OOB path (`NCCL_OOB_NET_ENABLE`) and its bidirectional variant (`NCCL_BOOTSTRAP_BIDIR_NET`) remain off by default. -* `NCCL_PXN_C2C` is kept default-off (`0`); upstream NCCL defaults it to `1` since 2.28. The C2C PXN routing path is NVIDIA-specific and is not currently applicable on AMD hardware. - -### Removed -* Removed MSCCL and MSCCL++ custom collective integration; legacy ``mscclLoadAlgo``, ``mscclRunAlgo``, and ``mscclUnloadAlgo`` APIs remain as no-ops for link compatibility. -* Removed NPKit profiling support (build option ``ENABLE_NPKIT``, headers, device and proxy instrumentation, install script flag ``--npkit-enable``, and related documentation and tooling). Use the profiler plugin API for profiling instead. -* Removed kernel COLLTRACE support, including the `COLLTRACE` build option, device-side collective trace buffers, debug kernel variants, and related install/CI wiring. The host latency profiler is unchanged. -* Removed legacy `ENABLE_PROFILING` device profiling support and the `PROFILE` build option. Use the profiler plugin API instead. ### Resolved Issues -* Fixed static build (`BUILD_SHARED_LIBS=OFF`) failing with `install(EXPORT "rccl-targets" ...)` error when `fmt` is fetched via `FetchContent`. The `fmt-header-only` target is now scoped to the build interface and excluded from RCCL's exported usage requirements. -* Fixed proxy channel staging buffers ignoring the new GDR mode selection on HIP < 7.12 builds. The legacy `#else` branch in `sendProxyConnect` / `recvProxyConnect` now honors `resources->useDmaBuf`, so peermem-equipped hosts on older HIP no longer fall through to `hsa_amd_portable_export_dmabuf` when peermem was selected in `*ProxySetup`. Workaround for affected RCCL builds: `NCCL_DMABUF_ENABLE=0`. +* Fixed missing memory fence in the LL protocol for gfx950, which caused collective hangs. +* Fixed segmentation fault in the external profiler plugin on communicator teardown. +* Fixed LL128 protocol selection to respect the user's explicit protocol override setting. +* Fixed `rcclNetP2pPolicy` returning incorrect policy for multi-NIC configurations. +* Fixed missing proxy-counter updates in the proxy loop leading to stalled counters. +* Fixed P2P batching hang when using batch operations. +* Fixed P2P self-copy for batched operations to prevent hangs when communicator size exceeds 32 nodes. +* Fixed WarpSpeed auto mode selection bug. -## Unreleased - RCCL 2.27.7 for ROCm 7.2.0 +## RCCL 2.27.7 for ROCm 7.2.0 ### Changed * RCCL error messages have been made more verbose in several cases. RCCL now prints out fatal error messages by default. Fatal error messages can be suppressed by setting `NCCL_DEBUG=NONE`. * Disabled `reduceCopyPacks` pipelining for `gfx950`. -* Experimental support for traffic shaping using warp specialization (also known as WarpSpeed) is now available for the Ring algorithm. -* Enabling WarpSpeed in auto mode using RCCL_WARP_SPEED_AUTO optimizes performance and reduces the CU count by 50% on a single node for AllReduce, AllGather from 64MB, and ReduceScatter from 256MB. -* The following configuration knobs control WarpSpeed behavior for debugging purposes: `RCCL_WARP_SPEED_ENABLE`, `RCCL_UNROLL_FACTOR`, `RCCL_WARP_SPEED_CU_COUNT`, and `RCCL_THREADS_PER_BLOCK`. Note that the effective unroll factor is calculated as 2 raised to the value of `RCCL_UNROLL_FACTOR`. ### Known issues * AllToAllv/AlltoAll for single GPU is hanging. -## Unreleased - RCCL 2.27.7 for ROCm 7.1.1 +## RCCL 2.27.7 for ROCm 7.1.1 ### Changed * Enabling P2P batching with `RCCL_P2P_BATCH_ENABLE=1` is only applicable up to 32 nodes. From 95bf486a36513ce1f626be9954ceb89c62f2603c Mon Sep 17 00:00:00 2001 From: Corey Derochie Date: Wed, 24 Jun 2026 09:50:34 -0600 Subject: [PATCH 2/4] Fixed suggestions --- projects/rccl/CHANGELOG.md | 26 ++++++++++++++------------ 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/projects/rccl/CHANGELOG.md b/projects/rccl/CHANGELOG.md index fc2d835186f..cc03357c408 100644 --- a/projects/rccl/CHANGELOG.md +++ b/projects/rccl/CHANGELOG.md @@ -2,37 +2,33 @@ Full documentation for RCCL is available at [https://rccl.readthedocs.io](https://rccl.readthedocs.io) -## Unreleased - RCCL 2.30.4 for ROCm 7.14 +## RCCL 2.30.4 for ROCm 7.14.0 ### Added +* Compatibility with NCCL 2.30.4. +* Compatibility with NCCL 2.29.7. +* Compatibility with NCCL 2.28.9. * Added proxytrace profiler plugin and core proxy-diagnostics hooks (`RCCL_PROXYTRACE`). * Added `ncclBarrierSession` LSA validation for barrier sessions. -* Enable WarpSpeed auto mode for grow communicators. * Added symmetric-memory ReduceScatter kernel (`RailA2A_LsaLD`) on gfx942/gfx950. * Added bias (accumulation) AllReduce on gfx1250 (MI450). * Added optimized scale-up ReduceScatter, AllGather, and AllToAll kernels. -* Refactored AllGather algorithm selection; hierarchical AllGather now enabled by default for multi-node. * Added rocprofiler coverage for `ncclCommGrow` and `ncclCommGetUniqueId`. * P2P batching auto-enabled for gfx950 in combination with non-AINIC NICs. * Display HIP/ROCm runtime versions in `NCCL_DEBUG` output. -* Parallelize communicator destruction across child processes to reduce teardown latency. -* Symmetric memory kernel tuning. * Detect ROCm version via core symlink for multi-architecture installs. * Skip DDA IPC initialization for directMode and MNNVL topologies. * Load versioned `libamd_smi` SONAME instead of an unversioned symlink. * Added Pythonic API bindings under `bindings/nccl4py/` (RCCL fork of NVIDIA `nccl4py` v0.2.0). Provides Python access to RCCL collectives via Cython bindings, an on-disk `cuda.core` HIP shim for ROCm hosts without `cuda-bindings` / `cuda-core`, and RCCL-only collective wrappers (`ncclAllReduceWithBias`, `ncclAllToAllv`). * Added RCCL examples to the repository. +* Added `RCCL host API` pull-in from NCCL 2.30. ### Changed -* Compatibility with NCCL 2.30.4. -* Compatibility with NCCL 2.29.7. -* Compatibility with NCCL 2.28.9. +* Enable WarpSpeed auto mode for grow communicators. +* Refactored AllGather algorithm selection; hierarchical AllGather now enabled by default for multi-node. * Swapped legacy `net_ib` with the `net_ib` implementation from NCCL 2.29. -* Added `RCCL host API` pull-in from NCCL 2.30. * Skip per-warp channel LDS copy when `warpComm` is disabled. * Harden proxy RPC setup against malformed peer input. -* Fixed `net_ib_cast`: gate CTS offload path on per-connection state. -* Fixed acquire-tail polling for gfx950 P2P host staging. * The bootstrap AllGather now uses the bidirectional ring (N/2 steps) by default on the socket OOB path. `NCCL_BOOTSTRAP_BIDIR_ALLGATHER` now defaults to `1`; set it to `0` to fall back to the unidirectional ring. The net OOB path (`NCCL_OOB_NET_ENABLE`) and its bidirectional variant (`NCCL_BOOTSTRAP_BIDIR_NET`) remain off by default. * `NCCL_PXN_C2C` is kept default-off (`0`); upstream NCCL defaults it to `1` since 2.28. The C2C PXN routing path is NVIDIA-specific and is not currently applicable on AMD hardware. @@ -41,6 +37,10 @@ Full documentation for RCCL is available at [https://rccl.readthedocs.io](https: * Removed kernel COLLTRACE support, including the `COLLTRACE` build option, device-side collective trace buffers, debug kernel variants, and related install/CI wiring. The host latency profiler is unchanged. * Removed legacy `ENABLE_PROFILING` device profiling support and the `PROFILE` build option. Use the profiler plugin API instead. +### Optimized +* Tuned symmetric memory kernels. +* Parallelized communicator destruction across child processes to reduce teardown latency. + ### Resolved Issues * Fixed `ncclCommGrow` channel-count divergence causing incorrect collective routing. * Fixed `ncclCommGrow` hang when growing to an 8-rank single-node communicator. @@ -48,7 +48,9 @@ Full documentation for RCCL is available at [https://rccl.readthedocs.io](https: * Fixed LL128 protocol correctness for gfx1250 (MI450). * Fixed XGMI topology mapping for multi-system (NPS) nodes. * Fixed gfx950 collective hang caused by a tuner race condition. +* Fixed `net_ib_cast`: gate CTS offload path on per-connection state. * Fixed `net_ib`: avoid flagging a non-fatal Isend CTS no-match as a fatal error. +* Fixed acquire-tail polling for gfx950 P2P host staging. * Fixed LDS overflow in device-linker builds. * Fixed symmetric memory correctness issues. * Fixed `ncclCommFree` to free symmetric window objects automatically (NCCL 2.29.7 defect). @@ -152,6 +154,7 @@ Full documentation for RCCL is available at [https://rccl.readthedocs.io](https: * Profiler plugin needs to be verified. ### Added +* Compatibility with NCCL 2.28.3. * Added `ncclAllReduceWithBias` API for fused all-reduce with elementwise accumulation-bias operations. * Added collective latency profiler tool (`--latency-profiler` in `install.sh`) for per-collective timing analysis. * Added dynamic pipelining for reduction collectives via the Simple protocol to improve single-node performance. @@ -183,7 +186,6 @@ Full documentation for RCCL is available at [https://rccl.readthedocs.io](https: * Experimental support for traffic shaping using warp specialization (also known as WarpSpeed) is now available for the Ring algorithm. * Enabling WarpSpeed in auto mode using RCCL_WARP_SPEED_AUTO optimizes performance and reduces the CU count by 50% on a single node for AllReduce, AllGather from 64MB, and ReduceScatter from 256MB. * The following configuration knobs control WarpSpeed behavior for debugging purposes: `RCCL_WARP_SPEED_ENABLE`, `RCCL_UNROLL_FACTOR`, `RCCL_WARP_SPEED_CU_COUNT`, and `RCCL_THREADS_PER_BLOCK`. Note that the effective unroll factor is calculated as 2 raised to the value of `RCCL_UNROLL_FACTOR`. -* Compatibility with NCCL 2.28.3. ### Resolved Issues * Fixed missing memory fence in the LL protocol for gfx950, which caused collective hangs. From c1d09bf0fb9de5b3f132975c00c4ca6ba9f8cfd2 Mon Sep 17 00:00:00 2001 From: Corey Derochie Date: Wed, 24 Jun 2026 09:57:35 -0600 Subject: [PATCH 3/4] Corrected rocprof ambiguity. --- projects/rccl/CHANGELOG.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/projects/rccl/CHANGELOG.md b/projects/rccl/CHANGELOG.md index cc03357c408..06c67d8e32c 100644 --- a/projects/rccl/CHANGELOG.md +++ b/projects/rccl/CHANGELOG.md @@ -13,7 +13,7 @@ Full documentation for RCCL is available at [https://rccl.readthedocs.io](https: * Added symmetric-memory ReduceScatter kernel (`RailA2A_LsaLD`) on gfx942/gfx950. * Added bias (accumulation) AllReduce on gfx1250 (MI450). * Added optimized scale-up ReduceScatter, AllGather, and AllToAll kernels. -* Added rocprofiler coverage for `ncclCommGrow` and `ncclCommGetUniqueId`. +* Added ROCProfiler-SDK coverage for `ncclCommGrow` and `ncclCommGetUniqueId`. * P2P batching auto-enabled for gfx950 in combination with non-AINIC NICs. * Display HIP/ROCm runtime versions in `NCCL_DEBUG` output. * Detect ROCm version via core symlink for multi-architecture installs. From 4d54daf41d36c99d046d4300f5cbceb6aa0b2b0a Mon Sep 17 00:00:00 2001 From: Pratik Basyal Date: Wed, 24 Jun 2026 16:27:39 -0400 Subject: [PATCH 4/4] Update CHANGELOG.md --- projects/rccl/CHANGELOG.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/projects/rccl/CHANGELOG.md b/projects/rccl/CHANGELOG.md index 06c67d8e32c..2b6fba2848f 100644 --- a/projects/rccl/CHANGELOG.md +++ b/projects/rccl/CHANGELOG.md @@ -24,11 +24,11 @@ Full documentation for RCCL is available at [https://rccl.readthedocs.io](https: * Added `RCCL host API` pull-in from NCCL 2.30. ### Changed -* Enable WarpSpeed auto mode for grow communicators. +* Enabled WarpSpeed auto mode for grow communicators. * Refactored AllGather algorithm selection; hierarchical AllGather now enabled by default for multi-node. * Swapped legacy `net_ib` with the `net_ib` implementation from NCCL 2.29. * Skip per-warp channel LDS copy when `warpComm` is disabled. -* Harden proxy RPC setup against malformed peer input. +* Hardened proxy RPC setup against malformed peer input. * The bootstrap AllGather now uses the bidirectional ring (N/2 steps) by default on the socket OOB path. `NCCL_BOOTSTRAP_BIDIR_ALLGATHER` now defaults to `1`; set it to `0` to fall back to the unidirectional ring. The net OOB path (`NCCL_OOB_NET_ENABLE`) and its bidirectional variant (`NCCL_BOOTSTRAP_BIDIR_NET`) remain off by default. * `NCCL_PXN_C2C` is kept default-off (`0`); upstream NCCL defaults it to `1` since 2.28. The C2C PXN routing path is NVIDIA-specific and is not currently applicable on AMD hardware. @@ -41,7 +41,7 @@ Full documentation for RCCL is available at [https://rccl.readthedocs.io](https: * Tuned symmetric memory kernels. * Parallelized communicator destruction across child processes to reduce teardown latency. -### Resolved Issues +### Resolved issues * Fixed `ncclCommGrow` channel-count divergence causing incorrect collective routing. * Fixed `ncclCommGrow` hang when growing to an 8-rank single-node communicator. * Fixed symmetric LDS under-reservation in legacy (non-device-linker) builds.