Skip to content

Add ROCm Support in OpenMPI, + removed deprecated ./configure options#10072

Merged
smuzaffar merged 1 commit intocms-sw:IB/CMSSW_16_0_X/masterfrom
ghyls:devel-ompi-rocm
Oct 21, 2025
Merged

Add ROCm Support in OpenMPI, + removed deprecated ./configure options#10072
smuzaffar merged 1 commit intocms-sw:IB/CMSSW_16_0_X/masterfrom
ghyls:devel-ompi-rocm

Conversation

@ghyls
Copy link
Contributor

@ghyls ghyls commented Sep 12, 2025

This is required to allow MPI to access memory in AMD GPUs, and in particular to perform RDMA to/from AMD GPUs.

Removed also the following configure options that are deprecated in OpenMPI 5:

  • --without-psm (PSM has been removed and is no longer supported [1] )
  • --without-mxm (mxm has been removed, and replaced by UCX support [1] )
  • --with-verbs=$RDMA_CORE_ROOT (verbs support is provided through UCX now [2] )

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @ghyls for branch IB/CMSSW_16_0_X/master.

@akritkbehera, @cmsbuild, @iarspider, @smuzaffar can you please review it and eventually sign? Thanks.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.
cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 12, 2025

cms-bot internal usage

@fwyzard
Copy link
Contributor

fwyzard commented Sep 12, 2025

enable gpu

@fwyzard
Copy link
Contributor

fwyzard commented Sep 12, 2025

please test

@makortel
Copy link
Contributor

How big is the ROCm distribution (that we include) nowadays? Can you tell how much of that gets used by libraries the OpenMPI gets to depend on?

I'm mildly concerned (to the extent I'd like to at least understand it) of the impact to (the size of) the set of libraries used by production jobs. I see the following components depend on OpenMPI

  • boost
    • I guess the use of MPI is limited to boost_mpi tool, which is not used in CMSSW, but I didn't try to understand if any externals would make use of MPI through boost
  • hdf5
    • used by rivet, yoda, and highfive; used further by herwig7 and professor2
      • used by `GeneratorInterface/{LHEInterface,RivetInterface,Herwig7Interface}
  • pytorch
    • not used in production yet
  • sherpa

I now wonder why hdf5 and pytorch depend on OpenMPI and if that could be turned off, but we can move that discussion into a separate issue.

@smuzaffar
Copy link
Contributor

@makortel , currently rocm distribution size is around 3G.
In https://github.com/cms-sw/cmsdist/pull/10056/files#diff-6c5a60ebe90df4a0319bb73aa4561e90267da6781d8efc1315b3cbc0c929078f @iarspider found out that we need to distribute more rocm libs for pytorch rocm. I have not checked yet if we really need those extra rocm libs or not

@makortel
Copy link
Contributor

Thanks @smuzaffar. 3 GB doesn't sounds "too bad" (I was remembering O(10 GB)), although it is not small either.

@smuzaffar
Copy link
Contributor

@makortel , yes it was between 20-30GB (with full distribution) and then @fwyzard trimmed it down to 3GB to distribute the minimum

@fwyzard
Copy link
Contributor

fwyzard commented Sep 12, 2025

Thanks @smuzaffar. 3 GB doesn't sounds "too bad" (I was remembering O(10 GB)), although it is not small either.

We've been adding and removing pieces of ROCm to try and keep the size under control - but if we include the ML libraries needed by PyTorch, it will likely blow up again.

One option that I plan to start experimenting in the coming weeks is to build ROCm from sources (via https://github.com/ROCm/TheRock) to limit the support to only the GPU architectures that we have a use case for: MI250X (Lumi), MI300X (NGT), Radeon Pro W7800/W7900 (NGT), maybe MI300A if CMS has an allocation on El Captain.

No idea if and how much it will help, though.

@makortel
Copy link
Contributor

MI300A if CMS has an allocation on El Captain.

My feeling is that would be unlikely.

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a49e38/48087/summary.html
COMMIT: 03af1c4
CMSSW: CMSSW_16_0_X_2025-09-12-1100/el8_amd64_gcc12
Additional Tests: GPU,AMD_MI300X,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/10072/48087/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a49e38/48087/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a49e38/48087/git-merge-result

Comparison Summary

Summary:

  • You potentially added 11 lines to the logs
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 4113751
  • DQMHistoTests: Total failures: 26
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4113705
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
  • Checked 215 log files, 184 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

AMD_MI300X Comparison Summary

Summary:

NVIDIA_H100 Comparison Summary

Summary:

NVIDIA_L40S Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 7
  • DQMHistoTests: Total histograms compared: 53486
  • DQMHistoTests: Total failures: 7089
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 46397
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 6 files compared)
  • Checked 24 log files, 30 edm output root files, 7 DQM output files
  • TriggerResults: no differences found

NVIDIA_T4 Comparison Summary

Summary:

@mandrenguyen
Copy link

+1

openmpi.spec Outdated
@@ -33,13 +34,11 @@ AUTOMAKE_JOBS=%{compiling_processes} ./autogen.pl
--disable-mpi-java \
--with-zlib=$ZLIB_ROOT \
%{!?without_cuda:--with-cuda=$CUDA_ROOT} \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghyls , looks like our openmpi is not built with cuda [a]. I guess as we build on host without cuda installed on system so it could not find libcuda.so. I suggest to change

%{!?without_cuda:--with-cuda=$CUDA_ROOT} \

to

%{!?without_cuda:--with-cuda=$CUDA_ROOT --with-cuda-libdir=$CUDA_ROOT/lib64/stubs}

this way during build/configure it can pickup libcuda.so from stubs.
@fwyzard , do you have any better suggestion?

[a]

checking for MCA component accelerator:cuda compile mode... dso
checking if --with-cuda is set... found (/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/cuda/12.9.1-ed601c3aacdd4f0b0abc31ff95aeff6e/include/cuda.h)
checking for cuda pkg-config name... /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/cuda/12.9.1-ed601c3aacdd4f0b0abc31ff95aeff6e/lib/pkgconfig/cuda.pc
checking if cuda pkg-config module exists... no
checking for cuda header at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/cuda/12.9.1-ed601c3aacdd4f0b0abc31ff95aeff6e/include... found
checking for cuda library (cuda) in /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/cuda/12.9.1-ed601c3aacdd4f0b0abc31ff95aeff6e... not found
checking whether CU_MEM_LOCATION_TYPE_HOST_NUMA is declared... yes
checking if have cuda support... no
checking if MCA component accelerator:cuda can compile... no

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using the stubs works, that could be a good workaround.

An other possibility could be --with-cuda-libdir=$CUDA_ROOT/drivers to pick the compatibility drivers.

Hopefully neither will hard-code the paths in the binary 🤷🏻

@ghyls do you have access to a machine without CUDA, to check if building with either the stub library or the compatibility library works ?

@smuzaffar
Copy link
Contributor

@ghyls , can you please also add in the %post section so that these files do not contain build paths [a]

%{relocateConfig}/share/pmix/pmixcc-wrapper-data.txt
%{relocateConfig}/include/pmix/src/include/pmix_config.h

[a]

include/pmix/src/include/pmix_config.h:#define PMIX_CONFIGURE_CLI " \'--disable-option-checking\' \'--prefix=/build/muz/d/w/tmp/BUILDROOT/35bea6272e2cbe86c6340f0b67a6dd4f/opt/cmssw/el8_amd64_gcc12/external/openmpi/5.0.8-35bea6272e2cbe86c6340f0b67a6dd4f\' \'--without-tests-examples\'  ......

share/pmix/pmixcc-wrapper-data.txt:preprocessor_flags=-I${includedir} -I${includedir}/pmix  -I/build/muz/d/w/el8_amd64_gcc12/external/hwloc/2.12.2-412235d09f67c300aa559077883316ba/include

@cmsbuild
Copy link
Contributor

REMINDER @mandrenguyen, @sextonkennedy, @ftenchini: This PR was tested with cms-sw/cms-bot#2567, please check if they should be merged together

@cmsbuild
Copy link
Contributor

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a49e38/48689/summary.html
COMMIT: 7d49d3d
CMSSW: CMSSW_16_0_X_2025-10-14-2300/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/10072/48689/install.sh to create a dev area with all the needed externals and cmssw changes.

External Build

I found compilation error when building:

Lustre: no (not found)
PVFS2/OrangeFS: no

+ --with-rocm=/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc13/external/rocm/6.4.3-8bc52e5de186aa7fa61c7d17f290f0df --with-hwloc=/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc13/external/hwloc/2.12.2-0e4be55b06015a70e96883ca65eb3e61 --with-ofi=/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc13/external/libfabric/2.1.0-8de1033f0b20ec964002c4f73fa267b0 --without-portals4 --without-psm2 --with-ucx=/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc13/external/ucx/1.19.0-68aa36405ac1a03bc7eb47fd8708d9a7 --with-cma --without-knem --with-xpmem=/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc13/external/xpmem/v2.6.3-20220308-9b40da6112cf24c0bcdb5df4d025e6d1 --with-pic --disable-io-romio --with-gnu-ld --with-pmix=internal
/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.GdilVg: line 61: --with-rocm=/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc13/external/rocm/6.4.3-8bc52e5de186aa7fa61c7d17f290f0df: No such file or directory
error: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.GdilVg (%prep)

RPM build warnings:
Macro expanded in comment on line 406: %{pkginstroot}

Macro expanded in comment on line 407: %{pkginstroot}


openmpi.spec Outdated
--disable-mpi-java \
--with-zlib=$ZLIB_ROOT \
%{!?without_cuda:--with-cuda=$CUDA_ROOT} \
%{!?without_cuda:--with-cuda=$CUDA_ROOT --with-cuda-libdir=$CUDA_ROOT/lib64/stubs}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghyls , please add \ at the end here

@cmsbuild
Copy link
Contributor

Pull request #10072 was updated.

@smuzaffar
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a49e38/48690/summary.html
COMMIT: dbe497f
CMSSW: CMSSW_16_0_X_2025-10-14-2300/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/10072/48690/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 4 lines to the logs
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 51
  • DQMHistoTests: Total histograms compared: 3940073
  • DQMHistoTests: Total failures: 33
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3940020
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 50 files compared)
  • Checked 218 log files, 188 edm output root files, 51 DQM output files
  • TriggerResults: no differences found

AMD_W7900 Comparison Summary

Summary:

NVIDIA_H100 Comparison Summary

Summary:

NVIDIA_L40S Comparison Summary

Summary:

NVIDIA_T4 Comparison Summary

Summary:

@smuzaffar
Copy link
Contributor

please test for el8_aarch64_gcc13

@smuzaffar
Copy link
Contributor

with latest changes, now openmpi has both cuda and rocm enabled

checking if have cuda support... yes (-I/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc13/external/cuda/12.9.1-cff83d5f72da96ebfea8cafd87a05296/include)
checking if MCA component accelerator:cuda can compile... yes
.....
checking for hip/hip_runtime.h... yes
checking for hipFree... yes
checking if rocm requires libnl v1 or v3... none
checking if MCA component accelerator:rocm can compile... yes
...

Accelerators
-----------------------
CUDA support: yes
ROCm support: yes

@cmsbuild
Copy link
Contributor

-1

Failed Tests: RelVals
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a49e38/48729/summary.html
COMMIT: dbe497f
CMSSW: CMSSW_16_0_X_2025-10-19-2300/el8_aarch64_gcc13
Additional Tests: GPU,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/10072/48729/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a49e38/48729/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a49e38/48729/git-merge-result

RelVals

  • 17034.0A fatal system signal has occurred: segmentation violation

@makortel
Copy link
Contributor

  • 17034.0A fatal system signal has occurred: segmentation violation

The stack trace looks worrisome

[arm-cmsbuild002:1810326:0:1810326] Caught signal 7 (Bus error: invalid address alignment)

Thread 1 (Thread 0x400018bf63f0 (LWP 1810326) "cmsRun"):
#0  0x0000400019b11a64 in poll () from /lib64/libc.so.6
#1  0x0000400022178f24 in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/10072/48729/CMSSW_16_0_X_2025-10-19-2300/lib/el8_aarch64_gcc13/pluginFWCoreServicesPlugins.so
#2  0x0000400022179154 in sig_dostack_then_abort () from /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/10072/48729/CMSSW_16_0_X_2025-10-19-2300/lib/el8_aarch64_gcc13/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00004000197118c4 in aarch64_fallback_frame_state (context=0x4000ebf50c20, fs=0x4000ebf50fe0) at ./md-unwind-support.h:74
#5  uw_frame_state_for (context=context@entry=0x4000ebf50c20, fs=fs@entry=0x4000ebf50fe0) at ../../../libgcc/unwind-dw2.c:1013
#6  0x0000400019713188 in _Unwind_Backtrace (trace=0x400019b26fa8 <backtrace_helper>, trace_argument=0x4000ebf513e0) at ../../../libgcc/unwind.inc:303
#7  0x0000400019b27174 in backtrace () from /lib64/libc.so.6
#8  0x00004000eaba1d98 in ucs_debug_backtrace_create (strip=2, bckt=0x4000ebf51470) at debug/debug.c:600
#9  ucs_debug_backtrace_create (bckt=0x4000ebf51470, strip=2) at debug/debug.c:589
#10 0x00004000eaba2028 in ucs_debug_print_backtrace (stream=0x400019bb83f8 <_IO_2_1_stderr_>, strip=strip@entry=2) at debug/debug.c:659
#11 0x00004000eaba4070 in ucs_handle_error (message=0x4000eabc43a0 "invalid address alignment") at debug/debug.c:1092
#12 0x00004000eaba41c8 in ucs_debug_handle_error_signal (signo=signo@entry=7, cause=0x4000eabc43a0 "invalid address alignment", fmt=fmt@entry=0x4000eabc1570 "") at debug/debug.c:1044
#13 0x00004000eaba4528 in ucs_error_signal_handler (signo=7, info=0x4000ebf517a0, context=<optimized out>) at debug/debug.c:1060
#14 <signal handler called>
#15 0x000000007dd9d76d in ?? ()
#16 0x0000400029c2f158 in ?? ()
#17 0x0000fffff5368c70 in ?? ()

Current Modules:
Module: NoBPTXMonitor:hltNoBPTXL2Mu40Monitoring (crashed)

(also connects to cms-sw/cmssw#48940)

@fwyzard
Copy link
Contributor

fwyzard commented Oct 21, 2025

Is it related to these changes ?
There is no ROCm on ARM 🤷🏻‍♂️

@fwyzard
Copy link
Contributor

fwyzard commented Oct 21, 2025

For what is worth, running the workflow locally on lxplus-arm it passed:

[2025-10-21 07:08:11] fwyzard@lxplus9102:/tmp/fwyzard/CMSSW_16_0_X_2025-10-19-2300/run$ runTheMatrix.py -l 17034.0
...
17034.0_TTbar_14TeV+2025PU Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Tue Oct 21 07:28:12 2025-date Tue Oct 21 07:09:01 2025; exit: 0 0 0 0
1 1 1 1 tests passed, 0 0 0 0 failed

[2025-10-21 07:28:15] fwyzard@lxplus9102:/tmp/fwyzard/CMSSW_16_0_X_2025-10-19-2300/run$ uname -a
Linux lxplus9102.cern.ch 5.14.0-570.46.1.el9_6.aarch64 #1 SMP PREEMPT_DYNAMIC Tue Sep 16 08:29:52 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux

@fwyzard
Copy link
Contributor

fwyzard commented Oct 21, 2025

please test for el8_aarch64_gcc13

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a49e38/48747/summary.html
COMMIT: dbe497f
CMSSW: CMSSW_16_0_X_2025-10-20-2300/el8_aarch64_gcc13
Additional Tests: GPU,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/10072/48747/install.sh to create a dev area with all the needed externals and cmssw changes.

@makortel
Copy link
Contributor

Is it related to these changes ?

Hard to say because there is very little useful information of the problem itself, but I'd guess probably not. Although there appear to be more changes in the build options than just the optional ROCm flags (but I don't know their impact).

@makortel
Copy link
Contributor

Actually on a further thought the stack trace in #10072 (comment) is probably "along expectations" given cms-sw/cmssw#48940

@smuzaffar
Copy link
Contributor

+extrnals

@smuzaffar smuzaffar merged commit f9eceec into cms-sw:IB/CMSSW_16_0_X/master Oct 21, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants