-
Notifications
You must be signed in to change notification settings - Fork 47
[alpaka] Caching allocators for host and device #260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fwyzard
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First round of comments...
| * and sets a maximum of 6,291,455 cached bytes per device | ||
| * | ||
| */ | ||
| struct CachingDeviceAllocator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class should be templated on the Device type:
| struct CachingDeviceAllocator { | |
| template <typename TDevice> | |
| struct CachingDeviceAllocator { | |
| public: | |
| using Device = TDevice; |
| BlockDescriptor(unsigned int block_bin, | ||
| size_t block_bytes, | ||
| size_t bytes_requested, | ||
| const ::ALPAKA_ACCELERATOR_NAMESPACE::Device& device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
::ALPAKA_ACCELERATOR_NAMESPACE::Device should be replaced by the Device template type:
| const ::ALPAKA_ACCELERATOR_NAMESPACE::Device& device) | |
| Device const& device) |
| bin{block_bin} {} | ||
|
|
||
| // Constructor (suitable for searching maps for a specific block, given a device buffer) | ||
| BlockDescriptor(::ALPAKA_ACCELERATOR_NAMESPACE::AlpakaDeviceBuf<std::byte> buffer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
::ALPAKA_ACCELERATOR_NAMESPACE::AlpakaDeviceBuf<std::byte> should be replaced by a buffer type that depends on the Device template parameter:
| BlockDescriptor(::ALPAKA_ACCELERATOR_NAMESPACE::AlpakaDeviceBuf<std::byte> buffer) | |
| using DeviceBuffer = alpaka::Buf<Device, std::byte, Dim1D, Idx>; | |
| ... | |
| BlockDescriptor(DeviceBuffer buffer) |
| using CachedBlocks = std::unordered_multiset<BlockDescriptor, BlockHashByBytes, BlockEqualByBytes>; | ||
|
|
||
| /// Set type for live blocks (hashed by ptr) | ||
| using BusyBlocks = std::unordered_multiset<BlockDescriptor, BlockHashByPtr, BlockEqualByPtr>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was the performance between std::unordered_multiset and the original std::multiset measured for this case?
| device) ///< [in] The device to be associated with this allocation | ||
| { | ||
| std::unique_lock<std::mutex> mutex_locker(mutex, std::defer_lock); | ||
| int device_idx = getIdxOfDev(device); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already a function with similar functionality in https://github.com/cms-patatrack/pixeltrack-standalone/blob/master/src/alpaka/AlpakaCore/getDevIndex.h
| int device_idx = getIdxOfDev(device); | |
| int device_idx = getDevIndex(device); |
(I won't repeat this comment)
| if (!found) { | ||
| search_key.buf = alpaka::allocBuf<std::byte, alpaka_common::Idx>( | ||
| device, static_cast<alpaka_common::Extent>(search_key.bytes)); | ||
| #if CUDA_VERSION >= 11020 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The codebase already requires CUDA >= 11.2, is this check really needed?
| search_key.buf = alpaka::allocBuf<std::byte, alpaka_common::Idx>( | ||
| device, static_cast<alpaka_common::Extent>(search_key.bytes)); | ||
| #if CUDA_VERSION >= 11020 | ||
| alpaka::prepareForAsyncCopy(search_key.buf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does prepareForAsyncCopy() do? I'd guess it to call cudaHostRegister() or similar under the hood for host memory.
Looking at the code I see this is a no-op for CUDA/HIP buffers
https://github.com/alpaka-group/alpaka/blob/ee525dbb1c3e71490a17df80e7d13b3067619b95/include/alpaka/mem/buf/BufUniformCudaHipRt.hpp#L408-L418
(so in principle this call would not be needed for the device allocator)
For host buffers it indeed ends up calling cudaHostRegister()
https://github.com/alpaka-group/alpaka/blob/ee525dbb1c3e71490a17df80e7d13b3067619b95/include/alpaka/mem/buf/BufCpu.hpp#L363-L378
https://github.com/alpaka-group/alpaka/blob/ee525dbb1c3e71490a17df80e7d13b3067619b95/include/alpaka/mem/buf/BufCpu.hpp#L269-L303
But given the
#if(defined(ALPAKA_ACC_GPU_CUDA_ENABLED) && BOOST_LANG_CUDA) || (defined(ALPAKA_ACC_GPU_HIP_ENABLED) && BOOST_LANG_HIP)the cudaHostRegister() is called only if the source file including this Alpaka header is compiled with ALPAKA_ACC_GPU_CUDA_ENABLED and compiling with nvcc (which we do when that macro is enabled). I'm afraid mixing that with code using BufCpu.hpp compiled without ALPAKA_ACC_GPU_CUDA_ENABLED would technically violate ODR.
I also see the definition of alpaka::detail::BufCpuImpl depends on these macros, in particular whether the m_bPinned member exists or not
https://github.com/alpaka-group/alpaka/blob/ee525dbb1c3e71490a17df80e7d13b3067619b95/include/alpaka/mem/buf/BufCpu.hpp#L103-L105
| */ | ||
| auto DeviceAllocate(size_t bytes, ///< [in] Minimum no. of bytes for the allocation | ||
| const ::ALPAKA_ACCELERATOR_NAMESPACE::Device& | ||
| device) ///< [in] The device to be associated with this allocation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the plan towards asynchronous (in a sense garbage-collecting) API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do all scopes creating and destroying unique_ptrs have alpaka::wait() before the end of scope?
| inline int getIdxOfDev(const ::ALPAKA_ACCELERATOR_NAMESPACE::Device& device) { | ||
| static const auto devices{alpaka::getDevs<::ALPAKA_ACCELERATOR_NAMESPACE::Platform>()}; | ||
| return (std::find(devices.begin(), devices.end(), device) - devices.begin()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(just repeating) Given https://github.com/cms-patatrack/pixeltrack-standalone/blob/master/src/alpaka/AlpakaCore/getDevIndex.h this is not needed.
| template <typename TData> | ||
| class DeviceDeleter { | ||
| public: | ||
| DeviceDeleter(::ALPAKA_ACCELERATOR_NAMESPACE::AlpakaDeviceBuf<TData> buffer) : buf{std::move(buffer)} {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class would need to be templated over Device.
| template <typename TData> | ||
| using unique_ptr = | ||
| #ifdef ALPAKA_ACC_GPU_CUDA_ENABLED | ||
| std::unique_ptr<TData, | ||
| impl::DeviceDeleter< | ||
| std::conditional_t<allocator::policy == allocator::Policy::Caching, std::byte, TData>>>; | ||
| #else | ||
| host::unique_ptr<TData>; | ||
| #endif | ||
| } // namespace device |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically this violates ODR. Templating the unique_ptr also over Device and providing a specialization for DevCudaRt if ALPAKA_ACC_GPU_CUDA_ENABLED is defined would work. In longer term, I think, we should think about hiding the exact deleter type from the type of the unique_ptr (or just go directly to shared_ptr semantics).
| if constexpr (allocator::policy == allocator::Policy::Asynchronous) { | ||
| alpaka::prepareForAsyncCopy(buf); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior of Asynchronous policy would be different from cuda, right?
|
reimplemented in #301 |
Caching Allocator and Async Allocator Enabled
Disabling the Caching Allocator and the Async Allocator
make alpaka -j 10 CUDA_BASE=/usr/local/cuda-11.2 USER_CXXFLAGS="-DALPAKA_DISABLE_CACHING_ALLOCATOR -DALPAKA_DISABLE_ASYNC_ALLOCATOR"