-
Notifications
You must be signed in to change notification settings - Fork 47
[alpaka] Support CUDA or ROCm/HIP #342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
6e7af4a to
a29359b
Compare
d2c175e to
f7829b6
Compare
|
Now supports building either CUDA or ROCm/HIP: $ make -j`nproc` alpaka ROCM_BASE= CUDA_BASE=/usr/local/cuda-11.5
...
$ source env.sh
$ ./alpaka --cuda --numberOfThreads 20 --numberOfStreams 20 --maxEvents 2000
Found 1 device:
- NVIDIA GeForce GTX 1080 Ti
Processing 2000 events, of which 20 concurrently, with 20 threads.
Processed 2000 events in 2.096939e+00 seconds, throughput 953.771 events/s, CPU usage per thread: 64.1%
$ make clean
rm -fR /data/user/fwyzard/pixeltrack-standalone/lib /data/user/fwyzard/pixeltrack-standalone/obj /data/user/fwyzard/pixeltrack-standalone/test alpaka alpakatest cuda cudacompat cudadev cudatest cudauvm fwtest hip hiptest kokkos kokkostest serial sycltest
$ rm env.sh
$ make -j`nproc` alpaka ROCM_BASE=/opt/rocm-5.0.2 CUDA_BASE=
...
$ source env.sh
$ ./alpaka --hip --numberOfThreads 20 --numberOfStreams 20 --maxEvents 2000
Found 1 device:
- Radeon Pro WX 9100
Processing 2000 events, of which 20 concurrently, with 20 threads.
Processed 2000 events in 4.311149e+00 seconds, throughput 463.913 events/s, CPU usage per thread: 73.1% |
f7829b6 to
5fd1098
Compare
|
@makortel this PR has grown to be quite large... let me know if you would rather have it split into smaller ones. |
|
By the way, I've tested that |
The compilation |
|
OK, then I won't worry about it. |
Looking at the commits I think splitting this PR into three could be worth it
|
5fd1098 to
c17df5a
Compare
|
OK, I've split it into
This PR needs to be merged after #347. |
| // cms-patatrack/pixeltrack-standalone#210 | ||
| alpaka::mem_fence(acc, alpaka::memory_scope::Grid{}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should benchmark the changes on an NVIDIA GPU to see if this has any negative impact on the performance there.
|
|
||
| $$($(1)_ROCM_LIB): $$($(1)_ROCM_OBJ) $$(foreach dep,$(EXTERNAL_DEPENDS_H),$$($$(dep)_DEPS)) $$(foreach lib,$$($(1)_DEPENDS),$$($$(lib)_LIB)) $$(foreach lib,$$($(1)_DEPENDS),$$($$(lib)_ROCM_LIB)) | ||
| @[ -d $$(@D) ] || mkdir -p $$(@D) | ||
| $(CXX) $$($(1)_ROCM_OBJ) $(LDFLAGS) -shared $(SO_LDFLAGS) $(LIB_LDFLAGS) $$(foreach lib,$$($(1)_DEPENDS),$$($$(lib)_LDFLAGS)) $$(foreach lib,$$($(1)_DEPENDS),$$($$(lib)_ROCM_LDFLAGS)) $$(foreach dep,$(EXTERNAL_DEPENDS),$$($$(dep)_LDFLAGS)) -o $$@ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nice, linking object files with ROCm device code works automatically with the host compiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that's the case as long as we don't use separate compilation (-fno-gpu-rdc).
If we switch on -fgpu-rdc we probably need some special support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, with Alpaka (almost) all device code will end up in header files, templated on the accelerator type.
So maybe we can get rid of separable compilation for CUDA as well ?
Allow building the "alpakatest" application with support for either of CUDA or ROCm/HIP.
Allow building the "alpaka" application with support for either of CUDA or ROCm/HIP.
c24ea5f to
23428cc
Compare
|
Rebased, and squashed the |




Implement the changes for building the
alpakatestandalpakaapplications with support for either of CUDA or HPI/ROCm.Host code changes
Since I hope to be able to enable both CUDA and HIP/ROCm at some point in the future, I've decided to split already now the relevant Alpaka types.
Alpaka itself does a mixed effort on this:
usingaliases for the common typeI've added these last ones in
src/alpaka/alpaka/alpakaExtra.hpp, with the intention of moving them into Alpaka itself sooner or later.Replace the use of the UniformCudaHipRt types with the explicit CudaRt types
Add similar code paths and definitions for the HipRt types
I've duplicated all CUDA-specific code, that was using either the
alpaka_cuda_asyncnamespace or theALPAKA_ACC_GPU_CUDA_ENABLEDmacro with HIP/ROCm equivalent code, using thealpaka_rocm_asyncnamespace and theALPAKA_ACC_GPU_HIP_ENABLEDmacro.Update the command line options in
main.ccand the list of pluginsUpdate the code under the
.../alpaka/...foldersMostly, I've changed
to
There are also some unrelated changes due to
clangcomplaining about implicitly-deleted default constructors, the inappropriate use ofstd::move, and some missing casts.Device code changes
The two places with a lot of changes are
src/alpaka/AlpakaCore/prefixScan.handsrc/alpaka/AlpakaCore/radixSort.h:HIP does not support the masked warp instructions like
__shfl_up_sync, it still has the pre-CUDA 9 versions like__shfl_up, so I've#ifdefed them... eventually the whole code should be rewritten using the primitives provided by Alpaka, and benchmarked to make sure that does not introduce any regressions.I've also added (unconditionally) the memory fence from #210; this too should be benchmarked to check the impact on the CUDA implementation.