Skip to content

Conversation

@waredjeb
Copy link
Contributor

@waredjeb waredjeb commented Nov 2, 2021

Caching Allocator and Async Allocator Enabled

[patatrack02 pixeltrack-standalone]$ CUDA_VISIBLE_DEVICES=0 numactl -N 1 ./alpaka --cuda --numberOfThreads 8 --numberOfStreams 16 --validation --maxEvents 10000; echo; for N in 1 2 3 4; do CUDA_VISIBLE_DEVICES=0 numactl -N 1 ./alpaka --cuda --numberOfThreads 8 --numberOfStreams 16 --maxEvents 10000; done
Processing 10000 events, of which 16 concurrently, with 8 threads.
CountValidator: all 10000 events passed validation
 Average relative track difference 0.00088565 (all within tolerance)
 Average absolute vertex difference 0.0006 (all within tolerance)
Processed 10000 events in 1.472030e+01 seconds, throughput 679.334 events/s.

Processing 10000 events, of which 16 concurrently, with 8 threads.
Processed 10000 events in 4.990729e+00 seconds, throughput 2003.72 events/s.
Processing 10000 events, of which 16 concurrently, with 8 threads.
Processed 10000 events in 4.992631e+00 seconds, throughput 2002.95 events/s.
Processing 10000 events, of which 16 concurrently, with 8 threads.
Processed 10000 events in 4.986976e+00 seconds, throughput 2005.22 events/s.
Processing 10000 events, of which 16 concurrently, with 8 threads.
Processed 10000 events in 4.992776e+00 seconds, throughput 2002.89 events/s.

Disabling the Caching Allocator and the Async Allocator

make alpaka -j 10 CUDA_BASE=/usr/local/cuda-11.2 USER_CXXFLAGS="-DALPAKA_DISABLE_CACHING_ALLOCATOR -DALPAKA_DISABLE_ASYNC_ALLOCATOR"

[patatrack02 pixeltrack-standalone]$ CUDA_VISIBLE_DEVICES=0 numactl -N 1 ./alpaka --cuda --numberOfThreads 8 --numberOfStreams 16 --validation --maxEvents 10000; echo; for N in 1 2 3 4; do CUDA_VISIBLE_DEVICES=0 numactl -N 1 ./alpaka --cuda --numberOfThreads 8 --numberOfStreams 16 --maxEvents 10000; done
Processing 10000 events, of which 16 concurrently, with 8 threads.
CountValidator: all 10000 events passed validation
 Average relative track difference 0.000888953 (all within tolerance)
 Average absolute vertex difference 0.0005 (all within tolerance)
Processed 10000 events in 4.884283e+01 seconds, throughput 204.738 events/s.

Processing 10000 events, of which 16 concurrently, with 8 threads.
Processed 10000 events in 4.017372e+01 seconds, throughput 248.919 events/s.
Processing 10000 events, of which 16 concurrently, with 8 threads.
Processed 10000 events in 4.029434e+01 seconds, throughput 248.174 events/s.
Processing 10000 events, of which 16 concurrently, with 8 threads.
Processed 10000 events in 4.101676e+01 seconds, throughput 243.803 events/s.
Processing 10000 events, of which 16 concurrently, with 8 threads.
Processed 10000 events in 3.935691e+01 seconds, throughput 254.085 events/s.

@waredjeb waredjeb changed the title Caching allocators for host and device [alpaka] Caching allocators for host and device Nov 2, 2021
@fwyzard fwyzard added the alpaka label Nov 2, 2021
Copy link
Contributor

@fwyzard fwyzard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of comments...

* and sets a maximum of 6,291,455 cached bytes per device
*
*/
struct CachingDeviceAllocator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class should be templated on the Device type:

Suggested change
struct CachingDeviceAllocator {
template <typename TDevice>
struct CachingDeviceAllocator {
public:
using Device = TDevice;

BlockDescriptor(unsigned int block_bin,
size_t block_bytes,
size_t bytes_requested,
const ::ALPAKA_ACCELERATOR_NAMESPACE::Device& device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

::ALPAKA_ACCELERATOR_NAMESPACE::Device should be replaced by the Device template type:

Suggested change
const ::ALPAKA_ACCELERATOR_NAMESPACE::Device& device)
Device const& device)

bin{block_bin} {}

// Constructor (suitable for searching maps for a specific block, given a device buffer)
BlockDescriptor(::ALPAKA_ACCELERATOR_NAMESPACE::AlpakaDeviceBuf<std::byte> buffer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

::ALPAKA_ACCELERATOR_NAMESPACE::AlpakaDeviceBuf<std::byte> should be replaced by a buffer type that depends on the Device template parameter:

Suggested change
BlockDescriptor(::ALPAKA_ACCELERATOR_NAMESPACE::AlpakaDeviceBuf<std::byte> buffer)
using DeviceBuffer = alpaka::Buf<Device, std::byte, Dim1D, Idx>;
...
BlockDescriptor(DeviceBuffer buffer)

using CachedBlocks = std::unordered_multiset<BlockDescriptor, BlockHashByBytes, BlockEqualByBytes>;

/// Set type for live blocks (hashed by ptr)
using BusyBlocks = std::unordered_multiset<BlockDescriptor, BlockHashByPtr, BlockEqualByPtr>;
Copy link
Collaborator

@makortel makortel Nov 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the performance between std::unordered_multiset and the original std::multiset measured for this case?

device) ///< [in] The device to be associated with this allocation
{
std::unique_lock<std::mutex> mutex_locker(mutex, std::defer_lock);
int device_idx = getIdxOfDev(device);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a function with similar functionality in https://github.com/cms-patatrack/pixeltrack-standalone/blob/master/src/alpaka/AlpakaCore/getDevIndex.h

Suggested change
int device_idx = getIdxOfDev(device);
int device_idx = getDevIndex(device);

(I won't repeat this comment)

if (!found) {
search_key.buf = alpaka::allocBuf<std::byte, alpaka_common::Idx>(
device, static_cast<alpaka_common::Extent>(search_key.bytes));
#if CUDA_VERSION >= 11020
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The codebase already requires CUDA >= 11.2, is this check really needed?

search_key.buf = alpaka::allocBuf<std::byte, alpaka_common::Idx>(
device, static_cast<alpaka_common::Extent>(search_key.bytes));
#if CUDA_VERSION >= 11020
alpaka::prepareForAsyncCopy(search_key.buf);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does prepareForAsyncCopy() do? I'd guess it to call cudaHostRegister() or similar under the hood for host memory.

Looking at the code I see this is a no-op for CUDA/HIP buffers
https://github.com/alpaka-group/alpaka/blob/ee525dbb1c3e71490a17df80e7d13b3067619b95/include/alpaka/mem/buf/BufUniformCudaHipRt.hpp#L408-L418
(so in principle this call would not be needed for the device allocator)

For host buffers it indeed ends up calling cudaHostRegister()
https://github.com/alpaka-group/alpaka/blob/ee525dbb1c3e71490a17df80e7d13b3067619b95/include/alpaka/mem/buf/BufCpu.hpp#L363-L378
https://github.com/alpaka-group/alpaka/blob/ee525dbb1c3e71490a17df80e7d13b3067619b95/include/alpaka/mem/buf/BufCpu.hpp#L269-L303

But given the

#if(defined(ALPAKA_ACC_GPU_CUDA_ENABLED) && BOOST_LANG_CUDA) || (defined(ALPAKA_ACC_GPU_HIP_ENABLED) && BOOST_LANG_HIP)

the cudaHostRegister() is called only if the source file including this Alpaka header is compiled with ALPAKA_ACC_GPU_CUDA_ENABLED and compiling with nvcc (which we do when that macro is enabled). I'm afraid mixing that with code using BufCpu.hpp compiled without ALPAKA_ACC_GPU_CUDA_ENABLED would technically violate ODR.

I also see the definition of alpaka::detail::BufCpuImpl depends on these macros, in particular whether the m_bPinned member exists or not
https://github.com/alpaka-group/alpaka/blob/ee525dbb1c3e71490a17df80e7d13b3067619b95/include/alpaka/mem/buf/BufCpu.hpp#L103-L105

*/
auto DeviceAllocate(size_t bytes, ///< [in] Minimum no. of bytes for the allocation
const ::ALPAKA_ACCELERATOR_NAMESPACE::Device&
device) ///< [in] The device to be associated with this allocation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the plan towards asynchronous (in a sense garbage-collecting) API?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all scopes creating and destroying unique_ptrs have alpaka::wait() before the end of scope?

Comment on lines +18 to +21
inline int getIdxOfDev(const ::ALPAKA_ACCELERATOR_NAMESPACE::Device& device) {
static const auto devices{alpaka::getDevs<::ALPAKA_ACCELERATOR_NAMESPACE::Platform>()};
return (std::find(devices.begin(), devices.end(), device) - devices.begin());
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

template <typename TData>
class DeviceDeleter {
public:
DeviceDeleter(::ALPAKA_ACCELERATOR_NAMESPACE::AlpakaDeviceBuf<TData> buffer) : buf{std::move(buffer)} {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class would need to be templated over Device.

Comment on lines +30 to +39
template <typename TData>
using unique_ptr =
#ifdef ALPAKA_ACC_GPU_CUDA_ENABLED
std::unique_ptr<TData,
impl::DeviceDeleter<
std::conditional_t<allocator::policy == allocator::Policy::Caching, std::byte, TData>>>;
#else
host::unique_ptr<TData>;
#endif
} // namespace device
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this violates ODR. Templating the unique_ptr also over Device and providing a specialization for DevCudaRt if ALPAKA_ACC_GPU_CUDA_ENABLED is defined would work. In longer term, I think, we should think about hiding the exact deleter type from the type of the unique_ptr (or just go directly to shared_ptr semantics).

Comment on lines +58 to +60
if constexpr (allocator::policy == allocator::Policy::Asynchronous) {
alpaka::prepareForAsyncCopy(buf);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior of Asynchronous policy would be different from cuda, right?

@fwyzard
Copy link
Contributor

fwyzard commented Feb 1, 2022

reimplemented in #301

@fwyzard fwyzard closed this Feb 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants