Skip to content

Conversation

@fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Mar 15, 2022

Implement the changes for building the alpakatest and alpaka applications with support for either of CUDA or HPI/ROCm.

Host code changes

  1. Add Cuda and Hip types

Since I hope to be able to enable both CUDA and HIP/ROCm at some point in the future, I've decided to split already now the relevant Alpaka types.
Alpaka itself does a mixed effort on this:

  • the "accelerator" types are already available as two distinct types
  • some other types are just using aliases for the common type
  • some other types are not available at all

I've added these last ones in src/alpaka/alpaka/alpakaExtra.hpp, with the intention of moving them into Alpaka itself sooner or later.

  1. Replace the use of the UniformCudaHipRt types with the explicit CudaRt types

  2. Add similar code paths and definitions for the HipRt types

I've duplicated all CUDA-specific code, that was using either the alpaka_cuda_async namespace or the ALPAKA_ACC_GPU_CUDA_ENABLED macro with HIP/ROCm equivalent code, using the alpaka_rocm_async namespace and the ALPAKA_ACC_GPU_HIP_ENABLED macro.

  1. Update the command line options in main.cc and the list of plugins

  2. Update the code under the .../alpaka/... folders

Mostly, I've changed

#ifdef ALPAKA_ACC_GPU_CUDA_ASYNC_BACKEND

to

#if defined(ALPAKA_ACC_GPU_CUDA_ASYNC_BACKEND) || defined(ALPAKA_ACC_GPU_HIP_ASYNC_BACKEND)
  1. Unrelated changes

There are also some unrelated changes due to clang complaining about implicitly-deleted default constructors, the inappropriate use of std::move, and some missing casts.

Device code changes

The two places with a lot of changes are src/alpaka/AlpakaCore/prefixScan.h and src/alpaka/AlpakaCore/radixSort.h:

HIP does not support the masked warp instructions like __shfl_up_sync, it still has the pre-CUDA 9 versions like __shfl_up, so I've #ifdefed them... eventually the whole code should be rewritten using the primitives provided by Alpaka, and benchmarked to make sure that does not introduce any regressions.

I've also added (unconditionally) the memory fence from #210; this too should be benchmarked to check the impact on the CUDA implementation.

@fwyzard fwyzard marked this pull request as draft March 15, 2022 23:49
@fwyzard fwyzard added the alpaka label Mar 16, 2022
@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 16, 2022

$ ./hip --numberOfThreads 20 --numberOfStreams 20 --maxEvents 2000 --validation
Found 1 devices
Processing 2000 events, of which 20 concurrently, with 20 threads.
CountValidator: all 2000 events passed validation
 Average relative track difference 0.000922907 (all within tolerance)
 Average absolute vertex difference 0.0005 (all within tolerance)
Processed 2000 events in 6.039834e+00 seconds, throughput 331.135 events/s, CPU usage per thread: 31.2%

@fwyzard fwyzard force-pushed the alpaka_HIP_support branch from 6e7af4a to a29359b Compare March 16, 2022 09:27
@fwyzard fwyzard force-pushed the alpaka_HIP_support branch 2 times, most recently from d2c175e to f7829b6 Compare March 16, 2022 17:55
@fwyzard fwyzard changed the title [alpaka] Code changes to support ROCm *instead* of CUDA [alpaka] Support either CUDA or ROCm/HIP Mar 16, 2022
@fwyzard fwyzard marked this pull request as ready for review March 16, 2022 17:59
@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 16, 2022

Now supports building either CUDA or ROCm/HIP:

$ make -j`nproc` alpaka ROCM_BASE= CUDA_BASE=/usr/local/cuda-11.5
...

$ source env.sh
$ ./alpaka --cuda --numberOfThreads 20 --numberOfStreams 20 --maxEvents 2000
Found 1 device:
  - NVIDIA GeForce GTX 1080 Ti
Processing 2000 events, of which 20 concurrently, with 20 threads.
Processed 2000 events in 2.096939e+00 seconds, throughput 953.771 events/s, CPU usage per thread: 64.1%

$ make clean
rm -fR /data/user/fwyzard/pixeltrack-standalone/lib /data/user/fwyzard/pixeltrack-standalone/obj /data/user/fwyzard/pixeltrack-standalone/test alpaka alpakatest cuda cudacompat cudadev cudatest cudauvm fwtest hip hiptest kokkos kokkostest serial sycltest
$ rm env.sh 
$ make -j`nproc` alpaka ROCM_BASE=/opt/rocm-5.0.2 CUDA_BASE=
...

$ source env.sh
$ ./alpaka --hip --numberOfThreads 20 --numberOfStreams 20 --maxEvents 2000
Found 1 device:
  - Radeon Pro WX 9100
Processing 2000 events, of which 20 concurrently, with 20 threads.
Processed 2000 events in 4.311149e+00 seconds, throughput 463.913 events/s, CPU usage per thread: 73.1%

@fwyzard fwyzard force-pushed the alpaka_HIP_support branch from f7829b6 to 5fd1098 Compare March 18, 2022 12:46
@fwyzard fwyzard changed the title [alpaka] Support either CUDA or ROCm/HIP Update ROCm support Mar 18, 2022
@fwyzard fwyzard requested a review from makortel March 18, 2022 12:55
@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 18, 2022

@makortel this PR has grown to be quite large... let me know if you would rather have it split into smaller ones.

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 18, 2022

By the way, I've tested that kokkostest builds and run, but I could not get kokkos to build, as it would get stuck while compiling or linking some of the tests.

@makortel
Copy link
Collaborator

By the way, I've tested that kokkostest builds and run, but I could not get kokkos to build, as it would get stuck while compiling or linking some of the tests.

The compilation kokkos program taking outrageously long for HIP is a known problem, see #178 (comment) and the following discussion (has been reported to Kokkos, link at the bottom of the issue).

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 18, 2022

OK, then I won't worry about it.

@makortel
Copy link
Collaborator

this PR has grown to be quite large... let me know if you would rather have it split into smaller ones.

Looking at the commits I think splitting this PR into three could be worth it

  • Update Makefile, hip', and hiptest` (first three commits)
  • Update Alpaka external and `alpakatest (next two commits)
  • Add CUDA xor ROCm/HIP support for alpaka and alpakatest (last two commits)

@fwyzard fwyzard force-pushed the alpaka_HIP_support branch from 5fd1098 to c17df5a Compare March 18, 2022 22:14
@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 18, 2022

@fwyzard fwyzard changed the title Update ROCm support [alpaka] Support CUDA or ROCm/HIP Mar 18, 2022
@fwyzard fwyzard self-assigned this Mar 18, 2022
Comment on lines +177 to +178
// cms-patatrack/pixeltrack-standalone#210
alpaka::mem_fence(acc, alpaka::memory_scope::Grid{});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should benchmark the changes on an NVIDIA GPU to see if this has any negative impact on the performance there.


$$($(1)_ROCM_LIB): $$($(1)_ROCM_OBJ) $$(foreach dep,$(EXTERNAL_DEPENDS_H),$$($$(dep)_DEPS)) $$(foreach lib,$$($(1)_DEPENDS),$$($$(lib)_LIB)) $$(foreach lib,$$($(1)_DEPENDS),$$($$(lib)_ROCM_LIB))
@[ -d $$(@D) ] || mkdir -p $$(@D)
$(CXX) $$($(1)_ROCM_OBJ) $(LDFLAGS) -shared $(SO_LDFLAGS) $(LIB_LDFLAGS) $$(foreach lib,$$($(1)_DEPENDS),$$($$(lib)_LDFLAGS)) $$(foreach lib,$$($(1)_DEPENDS),$$($$(lib)_ROCM_LDFLAGS)) $$(foreach dep,$(EXTERNAL_DEPENDS),$$($$(dep)_LDFLAGS)) -o $$@
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice, linking object files with ROCm device code works automatically with the host compiler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that's the case as long as we don't use separate compilation (-fno-gpu-rdc).
If we switch on -fgpu-rdc we probably need some special support.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, with Alpaka (almost) all device code will end up in header files, templated on the accelerator type.
So maybe we can get rid of separable compilation for CUDA as well ?

fwyzard added 2 commits March 22, 2022 08:29
Allow building the "alpakatest" application with support for either of
CUDA or ROCm/HIP.
Allow building the "alpaka" application with support for either of CUDA
or ROCm/HIP.
@fwyzard fwyzard force-pushed the alpaka_HIP_support branch from c24ea5f to 23428cc Compare March 22, 2022 07:33
@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 22, 2022

Rebased, and squashed the clang-format changes.

@makortel
Copy link
Collaborator

Here is a comparison on V100 (1 set of 1 minute jobs)
alpaka_cuda_throughput

I'm running another test with longer jobs and repetitions.

@makortel
Copy link
Collaborator

Here is a comparison on V100 with 4 2-minutes jobs
alpaka_cuda_throughput

Within the statistical uncertainty (few %)

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 24, 2022

No differences on a T4, either:
image
image

@fwyzard fwyzard merged commit f48b21c into cms-patatrack:master Mar 24, 2022
@fwyzard fwyzard deleted the alpaka_HIP_support branch March 24, 2022 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants