Skip to content

[alpaka] Caching allocators for host and device#248

Closed
abhinavramesh8 wants to merge 1 commit intocms-patatrack:masterfrom
abhinavramesh8:consolidated_alpaka_caching_allocator
Closed

[alpaka] Caching allocators for host and device#248
abhinavramesh8 wants to merge 1 commit intocms-patatrack:masterfrom
abhinavramesh8:consolidated_alpaka_caching_allocator

Conversation

@abhinavramesh8
Copy link

@abhinavramesh8 abhinavramesh8 commented Oct 19, 2021

A caching memory allocator each for host and device have been implemented for Alpaka, similar to the CUDA version. Host and device unique pointers have been provided for managing memory allocations. These pointers use the caching allocators by default, but this behavior can be disabled at compile time. The appropriate changes have been made to the existing codebase in order to make use of these unique pointers.

Comment on lines +1 to +2
#ifndef AlpakaDataFormats_SiPixelDigi_interface_SiPixelDigiErrorsCUDA_h
#define AlpakaDataFormats_SiPixelDigi_interface_SiPixelDigiErrorsCUDA_h
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#ifndef AlpakaDataFormats_SiPixelDigi_interface_SiPixelDigiErrorsCUDA_h
#define AlpakaDataFormats_SiPixelDigi_interface_SiPixelDigiErrorsCUDA_h
#ifndef AlpakaDataFormats_SiPixelDigi_interface_SiPixelDigiErrorsAlpaka_h
#define AlpakaDataFormats_SiPixelDigi_interface_SiPixelDigiErrorsAlpaka_h

Comment on lines +1 to +2
#ifndef AlpakaDataFormats_SiPixelDigi_interface_SiPixelDigisCUDA_h
#define AlpakaDataFormats_SiPixelDigi_interface_SiPixelDigisCUDA_h
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#ifndef AlpakaDataFormats_SiPixelDigi_interface_SiPixelDigisCUDA_h
#define AlpakaDataFormats_SiPixelDigi_interface_SiPixelDigisCUDA_h
#ifndef AlpakaDataFormats_SiPixelDigi_interface_SiPixelDigisAlpaka_h
#define AlpakaDataFormats_SiPixelDigi_interface_SiPixelDigisAlpaka_h

* Descriptor for device memory allocations
*/
struct BlockDescriptor {
ALPAKA_ACCELERATOR_NAMESPACE::AlpakaDeviceBuf<std::byte> buf; // Device buffer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These classes should either be templated, or placed in a ALPAKA_ACCELERATOR_NAMESPACE namespace, similar to #236. Anything else violates ODR.

In the long term templating would be preferred over the namespace trick, I believe.

@fwyzard fwyzard added the alpaka label Oct 19, 2021
@makortel
Copy link
Collaborator

I have some general thoughts

  • The approach of this PR could be useful for early performance studies (after the ODR is addressed in some way, regardless of what else I write below)
  • In light of a deployment in CMSSW, I think the Event and EventSetup data formats should be independent of Alpaka (at compile time). This would imply a smart-pointer type similar to std::shared_ptr that erases (or hides) the exact type of the deleter.
  • I would imagine caching allocations for non-pinned host memory to only add overhead
  • Which makes me wonder if, instead of using Alpaka's API in the caching allocator, would it be better to use CUDA directly? Or use the Alpaka's API for caching device (and pinned host) memory allocations, but bypass the caching allocator for host backends?

@makortel
Copy link
Collaborator

makortel commented Nov 2, 2021

In light of #260 can this PR be closed then?

@fwyzard
Copy link
Contributor

fwyzard commented Nov 4, 2021

I would imagine caching allocations for non-pinned host memory to only add overhead

While we likely won't gain in performance from the caching allocator for CPU memory, we do need the stream-ordered behaviour for the "device" memory operations when using a non-blocking queue for a CPU backend (e.g. TBB).

@makortel
Copy link
Collaborator

makortel commented Nov 4, 2021

I would imagine caching allocations for non-pinned host memory to only add overhead

While we likely won't gain in performance from the caching allocator for CPU memory, we do need the stream-ordered behaviour for the "device" memory operations when using a non-blocking queue for a CPU backend (e.g. TBB).

Right, but would a non-blocking queue for a CPU backend be useful for production use case? I'd naively expect non-blocking queue to mostly add overhead, even in the cases where intra-algorithm parallelization would otherwise be useful.

Written that, for testing purposes I agree it can be useful, and then a caching allocator (or other mechanism to keep the temporary memory alive) would be needed.

@fwyzard
Copy link
Contributor

fwyzard commented Nov 4, 2021

Right, but would a non-blocking queue for a CPU backend be useful for production use case? I'd naively expect non-blocking queue to mostly add overhead, even in the cases where intra-algorithm parallelization would otherwise be useful.

No, indeed - from a performance point of view, we do not want to use a non-blocking queue in production... especially with out acquire()/produce() mechanism.

But so far the non-blocking TBB backend has been very useful for finding synchronisation bugs :-)

Written that, for testing purposes I agree it can be useful, and then a caching allocator (or other mechanism to keep the temporary memory alive) would be needed.

@fwyzard
Copy link
Contributor

fwyzard commented Nov 4, 2021

Moved to #260 .

@fwyzard fwyzard closed this Nov 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants