Add ROCm Support in OpenMPI, + removed deprecated ./configure options#10072
Add ROCm Support in OpenMPI, + removed deprecated ./configure options#10072smuzaffar merged 1 commit intocms-sw:IB/CMSSW_16_0_X/masterfrom
./configure options#10072Conversation
|
A new Pull Request was created by @ghyls for branch IB/CMSSW_16_0_X/master. @akritkbehera, @cmsbuild, @iarspider, @smuzaffar can you please review it and eventually sign? Thanks. |
|
cms-bot internal usage |
|
enable gpu |
|
please test |
|
How big is the ROCm distribution (that we include) nowadays? Can you tell how much of that gets used by libraries the OpenMPI gets to depend on? I'm mildly concerned (to the extent I'd like to at least understand it) of the impact to (the size of) the set of libraries used by production jobs. I see the following components depend on OpenMPI
I now wonder why hdf5 and pytorch depend on OpenMPI and if that could be turned off, but we can move that discussion into a separate issue. |
|
@makortel , currently |
|
Thanks @smuzaffar. 3 GB doesn't sounds "too bad" (I was remembering O(10 GB)), although it is not small either. |
We've been adding and removing pieces of ROCm to try and keep the size under control - but if we include the ML libraries needed by PyTorch, it will likely blow up again. One option that I plan to start experimenting in the coming weeks is to build ROCm from sources (via https://github.com/ROCm/TheRock) to limit the support to only the GPU architectures that we have a use case for: MI250X (Lumi), MI300X (NGT), Radeon Pro W7800/W7900 (NGT), maybe MI300A if CMS has an allocation on El Captain. No idea if and how much it will help, though. |
My feeling is that would be unlikely. |
|
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a49e38/48087/summary.html The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic: You can see more details here: Comparison SummarySummary:
AMD_MI300X Comparison SummarySummary:
NVIDIA_H100 Comparison SummarySummary:
NVIDIA_L40S Comparison SummarySummary:
NVIDIA_T4 Comparison SummarySummary:
|
|
+1 |
openmpi.spec
Outdated
| @@ -33,13 +34,11 @@ AUTOMAKE_JOBS=%{compiling_processes} ./autogen.pl | |||
| --disable-mpi-java \ | |||
| --with-zlib=$ZLIB_ROOT \ | |||
| %{!?without_cuda:--with-cuda=$CUDA_ROOT} \ | |||
There was a problem hiding this comment.
@ghyls , looks like our openmpi is not built with cuda [a]. I guess as we build on host without cuda installed on system so it could not find libcuda.so. I suggest to change
%{!?without_cuda:--with-cuda=$CUDA_ROOT} \
to
%{!?without_cuda:--with-cuda=$CUDA_ROOT --with-cuda-libdir=$CUDA_ROOT/lib64/stubs}
this way during build/configure it can pickup libcuda.so from stubs.
@fwyzard , do you have any better suggestion?
[a]
checking for MCA component accelerator:cuda compile mode... dso
checking if --with-cuda is set... found (/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/cuda/12.9.1-ed601c3aacdd4f0b0abc31ff95aeff6e/include/cuda.h)
checking for cuda pkg-config name... /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/cuda/12.9.1-ed601c3aacdd4f0b0abc31ff95aeff6e/lib/pkgconfig/cuda.pc
checking if cuda pkg-config module exists... no
checking for cuda header at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/cuda/12.9.1-ed601c3aacdd4f0b0abc31ff95aeff6e/include... found
checking for cuda library (cuda) in /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/cuda/12.9.1-ed601c3aacdd4f0b0abc31ff95aeff6e... not found
checking whether CU_MEM_LOCATION_TYPE_HOST_NUMA is declared... yes
checking if have cuda support... no
checking if MCA component accelerator:cuda can compile... no
There was a problem hiding this comment.
If using the stubs works, that could be a good workaround.
An other possibility could be --with-cuda-libdir=$CUDA_ROOT/drivers to pick the compatibility drivers.
Hopefully neither will hard-code the paths in the binary 🤷🏻
@ghyls do you have access to a machine without CUDA, to check if building with either the stub library or the compatibility library works ?
|
@ghyls , can you please also add in the [a] |
|
REMINDER @mandrenguyen, @sextonkennedy, @ftenchini: This PR was tested with cms-sw/cms-bot#2567, please check if they should be merged together |
|
-1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a49e38/48689/summary.html External BuildI found compilation error when building: Lustre: no (not found)
PVFS2/OrangeFS: no
+ --with-rocm=/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc13/external/rocm/6.4.3-8bc52e5de186aa7fa61c7d17f290f0df --with-hwloc=/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc13/external/hwloc/2.12.2-0e4be55b06015a70e96883ca65eb3e61 --with-ofi=/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc13/external/libfabric/2.1.0-8de1033f0b20ec964002c4f73fa267b0 --without-portals4 --without-psm2 --with-ucx=/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc13/external/ucx/1.19.0-68aa36405ac1a03bc7eb47fd8708d9a7 --with-cma --without-knem --with-xpmem=/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc13/external/xpmem/v2.6.3-20220308-9b40da6112cf24c0bcdb5df4d025e6d1 --with-pic --disable-io-romio --with-gnu-ld --with-pmix=internal
/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.GdilVg: line 61: --with-rocm=/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc13/external/rocm/6.4.3-8bc52e5de186aa7fa61c7d17f290f0df: No such file or directory
error: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.GdilVg (%prep)
RPM build warnings:
Macro expanded in comment on line 406: %{pkginstroot}
Macro expanded in comment on line 407: %{pkginstroot}
|
openmpi.spec
Outdated
| --disable-mpi-java \ | ||
| --with-zlib=$ZLIB_ROOT \ | ||
| %{!?without_cuda:--with-cuda=$CUDA_ROOT} \ | ||
| %{!?without_cuda:--with-cuda=$CUDA_ROOT --with-cuda-libdir=$CUDA_ROOT/lib64/stubs} |
7d49d3d to
dbe497f
Compare
|
Pull request #10072 was updated. |
|
please test |
|
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a49e38/48690/summary.html Comparison SummarySummary:
AMD_W7900 Comparison SummarySummary:
NVIDIA_H100 Comparison SummarySummary:
NVIDIA_L40S Comparison SummarySummary:
NVIDIA_T4 Comparison SummarySummary:
|
|
please test for el8_aarch64_gcc13 |
|
with latest changes, now openmpi has both |
|
-1 Failed Tests: RelVals The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic: You can see more details here: RelVals
|
The stack trace looks worrisome (also connects to cms-sw/cmssw#48940) |
|
Is it related to these changes ? |
|
For what is worth, running the workflow locally on |
|
please test for el8_aarch64_gcc13 |
|
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a49e38/48747/summary.html |
Hard to say because there is very little useful information of the problem itself, but I'd guess probably not. Although there appear to be more changes in the build options than just the optional ROCm flags (but I don't know their impact). |
|
Actually on a further thought the stack trace in #10072 (comment) is probably "along expectations" given cms-sw/cmssw#48940 |
|
+extrnals |
This is required to allow MPI to access memory in AMD GPUs, and in particular to perform RDMA to/from AMD GPUs.
Removed also the following
configureoptions that are deprecated in OpenMPI 5:--without-psm(PSM has been removed and is no longer supported [1] )--without-mxm(mxm has been removed, and replaced by UCX support [1] )--with-verbs=$RDMA_CORE_ROOT(verbs support is provided through UCX now [2] )