[alpaka] Caching allocators for host and device #260

waredjeb · 2021-11-02T09:12:06Z

Rebase of [alpaka] Caching allocators for host and device #248 on top of [alpaka] Template varius alpakatools classes on the Queue type #255

Caching Allocator and Async Allocator Enabled

[patatrack02 pixeltrack-standalone]$ CUDA_VISIBLE_DEVICES=0 numactl -N 1 ./alpaka --cuda --numberOfThreads 8 --numberOfStreams 16 --validation --maxEvents 10000; echo; for N in 1 2 3 4; do CUDA_VISIBLE_DEVICES=0 numactl -N 1 ./alpaka --cuda --numberOfThreads 8 --numberOfStreams 16 --maxEvents 10000; done
Processing 10000 events, of which 16 concurrently, with 8 threads.
CountValidator: all 10000 events passed validation
 Average relative track difference 0.00088565 (all within tolerance)
 Average absolute vertex difference 0.0006 (all within tolerance)
Processed 10000 events in 1.472030e+01 seconds, throughput 679.334 events/s.

Processing 10000 events, of which 16 concurrently, with 8 threads.
Processed 10000 events in 4.990729e+00 seconds, throughput 2003.72 events/s.
Processing 10000 events, of which 16 concurrently, with 8 threads.
Processed 10000 events in 4.992631e+00 seconds, throughput 2002.95 events/s.
Processing 10000 events, of which 16 concurrently, with 8 threads.
Processed 10000 events in 4.986976e+00 seconds, throughput 2005.22 events/s.
Processing 10000 events, of which 16 concurrently, with 8 threads.
Processed 10000 events in 4.992776e+00 seconds, throughput 2002.89 events/s.

Disabling the Caching Allocator and the Async Allocator

make alpaka -j 10 CUDA_BASE=/usr/local/cuda-11.2 USER_CXXFLAGS="-DALPAKA_DISABLE_CACHING_ALLOCATOR -DALPAKA_DISABLE_ASYNC_ALLOCATOR"

[patatrack02 pixeltrack-standalone]$ CUDA_VISIBLE_DEVICES=0 numactl -N 1 ./alpaka --cuda --numberOfThreads 8 --numberOfStreams 16 --validation --maxEvents 10000; echo; for N in 1 2 3 4; do CUDA_VISIBLE_DEVICES=0 numactl -N 1 ./alpaka --cuda --numberOfThreads 8 --numberOfStreams 16 --maxEvents 10000; done
Processing 10000 events, of which 16 concurrently, with 8 threads.
CountValidator: all 10000 events passed validation
 Average relative track difference 0.000888953 (all within tolerance)
 Average absolute vertex difference 0.0005 (all within tolerance)
Processed 10000 events in 4.884283e+01 seconds, throughput 204.738 events/s.

Processing 10000 events, of which 16 concurrently, with 8 threads.
Processed 10000 events in 4.017372e+01 seconds, throughput 248.919 events/s.
Processing 10000 events, of which 16 concurrently, with 8 threads.
Processed 10000 events in 4.029434e+01 seconds, throughput 248.174 events/s.
Processing 10000 events, of which 16 concurrently, with 8 threads.
Processed 10000 events in 4.101676e+01 seconds, throughput 243.803 events/s.
Processing 10000 events, of which 16 concurrently, with 8 threads.
Processed 10000 events in 3.935691e+01 seconds, throughput 254.085 events/s.

fwyzard

First round of comments...

fwyzard · 2021-11-02T09:39:50Z

src/alpaka/AlpakaCore/CachingDeviceAllocator.h

+ * and sets a maximum of 6,291,455 cached bytes per device
+ *
+ */
+  struct CachingDeviceAllocator {


This class should be templated on the Device type:

Suggested change

struct CachingDeviceAllocator {

template <typename TDevice>

struct CachingDeviceAllocator {

public:

using Device = TDevice;

fwyzard · 2021-11-02T09:41:39Z

src/alpaka/AlpakaCore/CachingDeviceAllocator.h

+      BlockDescriptor(unsigned int block_bin,
+                      size_t block_bytes,
+                      size_t bytes_requested,
+                      const ::ALPAKA_ACCELERATOR_NAMESPACE::Device& device)


::ALPAKA_ACCELERATOR_NAMESPACE::Device should be replaced by the Device template type:

Suggested change

const ::ALPAKA_ACCELERATOR_NAMESPACE::Device& device)

Device const& device)

fwyzard · 2021-11-02T09:43:47Z

src/alpaka/AlpakaCore/CachingDeviceAllocator.h

+            bin{block_bin} {}
+
+      // Constructor (suitable for searching maps for a specific block, given a device buffer)
+      BlockDescriptor(::ALPAKA_ACCELERATOR_NAMESPACE::AlpakaDeviceBuf<std::byte> buffer)


::ALPAKA_ACCELERATOR_NAMESPACE::AlpakaDeviceBuf<std::byte> should be replaced by a buffer type that depends on the Device template parameter:

Suggested change

BlockDescriptor(::ALPAKA_ACCELERATOR_NAMESPACE::AlpakaDeviceBuf<std::byte> buffer)

using DeviceBuffer = alpaka::Buf<Device, std::byte, Dim1D, Idx>;

...

BlockDescriptor(DeviceBuffer buffer)

makortel · 2021-11-02T14:53:24Z

src/alpaka/AlpakaCore/CachingDeviceAllocator.h

+    using CachedBlocks = std::unordered_multiset<BlockDescriptor, BlockHashByBytes, BlockEqualByBytes>;
+
+    /// Set type for live blocks (hashed by ptr)
+    using BusyBlocks = std::unordered_multiset<BlockDescriptor, BlockHashByPtr, BlockEqualByPtr>;


Was the performance between std::unordered_multiset and the original std::multiset measured for this case?

makortel · 2021-11-02T14:55:51Z

src/alpaka/AlpakaCore/CachingDeviceAllocator.h

+                            device)  ///< [in] The device to be associated with this allocation
+    {
+      std::unique_lock<std::mutex> mutex_locker(mutex, std::defer_lock);
+      int device_idx = getIdxOfDev(device);


There is already a function with similar functionality in https://github.com/cms-patatrack/pixeltrack-standalone/blob/master/src/alpaka/AlpakaCore/getDevIndex.h

Suggested change

int device_idx = getIdxOfDev(device);

int device_idx = getDevIndex(device);

(I won't repeat this comment)

makortel · 2021-11-02T14:56:26Z

src/alpaka/AlpakaCore/CachingDeviceAllocator.h

+      if (!found) {
+        search_key.buf = alpaka::allocBuf<std::byte, alpaka_common::Idx>(
+            device, static_cast<alpaka_common::Extent>(search_key.bytes));
+#if CUDA_VERSION >= 11020


The codebase already requires CUDA >= 11.2, is this check really needed?

makortel · 2021-11-02T15:23:34Z

src/alpaka/AlpakaCore/CachingDeviceAllocator.h

+        search_key.buf = alpaka::allocBuf<std::byte, alpaka_common::Idx>(
+            device, static_cast<alpaka_common::Extent>(search_key.bytes));
+#if CUDA_VERSION >= 11020
+        alpaka::prepareForAsyncCopy(search_key.buf);


What does prepareForAsyncCopy() do? I'd guess it to call cudaHostRegister() or similar under the hood for host memory.

Looking at the code I see this is a no-op for CUDA/HIP buffers
https://github.com/alpaka-group/alpaka/blob/ee525dbb1c3e71490a17df80e7d13b3067619b95/include/alpaka/mem/buf/BufUniformCudaHipRt.hpp#L408-L418
(so in principle this call would not be needed for the device allocator)

For host buffers it indeed ends up calling cudaHostRegister()
https://github.com/alpaka-group/alpaka/blob/ee525dbb1c3e71490a17df80e7d13b3067619b95/include/alpaka/mem/buf/BufCpu.hpp#L363-L378
https://github.com/alpaka-group/alpaka/blob/ee525dbb1c3e71490a17df80e7d13b3067619b95/include/alpaka/mem/buf/BufCpu.hpp#L269-L303

But given the

#if(defined(ALPAKA_ACC_GPU_CUDA_ENABLED) && BOOST_LANG_CUDA) || (defined(ALPAKA_ACC_GPU_HIP_ENABLED) && BOOST_LANG_HIP)

the cudaHostRegister() is called only if the source file including this Alpaka header is compiled with ALPAKA_ACC_GPU_CUDA_ENABLED and compiling with nvcc (which we do when that macro is enabled). I'm afraid mixing that with code using BufCpu.hpp compiled without ALPAKA_ACC_GPU_CUDA_ENABLED would technically violate ODR.

I also see the definition of alpaka::detail::BufCpuImpl depends on these macros, in particular whether the m_bPinned member exists or not
https://github.com/alpaka-group/alpaka/blob/ee525dbb1c3e71490a17df80e7d13b3067619b95/include/alpaka/mem/buf/BufCpu.hpp#L103-L105

makortel · 2021-11-02T15:32:36Z

src/alpaka/AlpakaCore/CachingDeviceAllocator.h

+     */
+    auto DeviceAllocate(size_t bytes,  ///< [in] Minimum no. of bytes for the allocation
+                        const ::ALPAKA_ACCELERATOR_NAMESPACE::Device&
+                            device)  ///< [in] The device to be associated with this allocation


What is the plan towards asynchronous (in a sense garbage-collecting) API?

Do all scopes creating and destroying unique_ptrs have alpaka::wait() before the end of scope?

makortel · 2021-11-02T15:33:58Z

src/alpaka/AlpakaCore/deviceAllocatorStatus.h

+      inline int getIdxOfDev(const ::ALPAKA_ACCELERATOR_NAMESPACE::Device& device) {
+        static const auto devices{alpaka::getDevs<::ALPAKA_ACCELERATOR_NAMESPACE::Platform>()};
+        return (std::find(devices.begin(), devices.end(), device) - devices.begin());
+      }


(just repeating) Given https://github.com/cms-patatrack/pixeltrack-standalone/blob/master/src/alpaka/AlpakaCore/getDevIndex.h this is not needed.

makortel · 2021-11-02T15:34:56Z

src/alpaka/AlpakaCore/device_unique_ptr.h

+        template <typename TData>
+        class DeviceDeleter {
+        public:
+          DeviceDeleter(::ALPAKA_ACCELERATOR_NAMESPACE::AlpakaDeviceBuf<TData> buffer) : buf{std::move(buffer)} {}


This class would need to be templated over Device.

makortel · 2021-11-02T15:44:32Z

src/alpaka/AlpakaCore/device_unique_ptr.h

+      template <typename TData>
+      using unique_ptr =
+#ifdef ALPAKA_ACC_GPU_CUDA_ENABLED
+          std::unique_ptr<TData,
+                          impl::DeviceDeleter<
+                              std::conditional_t<allocator::policy == allocator::Policy::Caching, std::byte, TData>>>;
+#else
+          host::unique_ptr<TData>;
+#endif
+    }  // namespace device


Technically this violates ODR. Templating the unique_ptr also over Device and providing a specialization for DevCudaRt if ALPAKA_ACC_GPU_CUDA_ENABLED is defined would work. In longer term, I think, we should think about hiding the exact deleter type from the type of the unique_ptr (or just go directly to shared_ptr semantics).

makortel · 2021-11-02T15:47:03Z

src/alpaka/AlpakaCore/device_unique_ptr.h

+        if constexpr (allocator::policy == allocator::Policy::Asynchronous) {
+          alpaka::prepareForAsyncCopy(buf);
+        }


The behavior of Asynchronous policy would be different from cuda, right?

fwyzard · 2022-02-01T14:45:50Z

reimplemented in #301

[alpaka] Caching allocators for host and device

85c6fe0

waredjeb changed the title ~~Caching allocators for host and device~~ [alpaka] Caching allocators for host and device Nov 2, 2021

fwyzard added the alpaka label Nov 2, 2021

fwyzard requested changes Nov 2, 2021

View reviewed changes

makortel mentioned this pull request Nov 2, 2021

[alpaka] Caching allocators for host and device #248

Closed

makortel reviewed Nov 2, 2021

View reviewed changes

fwyzard closed this Feb 1, 2022

	const ::ALPAKA_ACCELERATOR_NAMESPACE::Device& device)
	Device const& device)

-      BlockDescriptor(::ALPAKA_ACCELERATOR_NAMESPACE::AlpakaDeviceBuf<std::byte> buffer)
+      using DeviceBuffer = alpaka::Buf<Device, std::byte, Dim1D, Idx>;
+      ...
+      BlockDescriptor(DeviceBuffer buffer)

	int device_idx = getIdxOfDev(device);
	int device_idx = getDevIndex(device);

[alpaka] Caching allocators for host and device #260

[alpaka] Caching allocators for host and device #260

Uh oh!

Conversation

waredjeb commented Nov 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Caching Allocator and Async Allocator Enabled

Disabling the Caching Allocator and the Async Allocator

Uh oh!

fwyzard left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

makortel Nov 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fwyzard commented Feb 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

waredjeb commented Nov 2, 2021 •

edited

Loading

makortel Nov 2, 2021 •

edited

Loading