Skip to content

Conversation

@makortel
Copy link
Collaborator

This PR prototypes the Alpaka EDModule API, taking inspiration from #224 and #256. A major tested idea was to see how far the system could be implemented with just forward-declared Alpaka device, queue, and event types in order to minimize the set of source files that need to be compiled with the device compiler (I first crafted this prototype before the ALPAKA_HOST_ONLY macro).

The first commit extends the build rules by adding a new category of source files that need to be compiled for each Alpaka backend, but can be compiled with the host compiler. This functionality might be beneficial also on wider scope than this PR alone (so I could open a separate PR with only it). Here I took the approach of using a new file extension, .acc ("a" for e.g. "accelerated"), for the files that need to be compiled with the device compiler. The .cc files can be compiled with the host compiler. I'm not advocating for this particular choice as I'm not very fond of it, but I needed something to get on with the prototype.

I don't think we should apply this PR as is, but identify the constructs that would be useful, and pick those (and improve the rest).

One idea here was to hide the cms::alpakatools::Product<T> from users (having to explicitly interact with the ScopedContext to get the T is annoying). In addition, for CPU serial backend (synchronous, operates in regular host memory) the Product<T> wrapper is not used (because it is not really needed). In this way the downstream code could use the data products from Serial backend directly. For developers the setup would look like

  • The data products in the memory space of the current backend are consumed with edm::EDGetTokenT<T> and produced with edm::EDPutTokenT<T> (i.e. they look like normal products)
  • The data products from the "non-portable memory space" are consumed with edm::EDGetTokenT<edm::Host<T>> and produced to with edm::EDPutTokenT<edm::Host<T>>
    • The edm::Host<T> is just a "tag", not an actual product type.

Internally this setup works such that for CPU Serial backend the edm::Host<...> part is ignored, and for other backends

  • The edm::EDGetTokenT<T> is mapped to edm::EDGetTokenT<edm::Product<T>>
  • The edm::EDGetTokenT<edm::Host<T>> is mapped to edm::EDGetTokenT<T>

For this setup to work an ALPAKA_ACCELERATOR_NAMESPACE::Event class is defined to be used in the EDModules instead of edm::Event. It wraps the edm::Event, and implements the aforementioned mapping logic (for getting and putting side) with a set of helper classes that are specialized for the backends. The ALPAKA_ACCELERATOR_NAMESPACE::EDProducer(ExternalWork) class implements the (reverse) mapping logic for the consumes() and produces() side.

The cms::alpakatools::Product<TQueue, T> is transformed into edm::Product<T> that can hold arbitrary metadata via type erasure (currently std::any for demonstration purposes). For Alpaka EDModules a ALPAKA_ACCELERATOR_NAMESPACE::ProductMetadata class is defined for the metadata purpose. This class(es) took also some of the functionality of ScopedContext that seems to work there better in this abstraction model (actually the kokkos version has similar structure here).

The ScopedContext class structure is completely reorganized, and is now completely hidden from the developers. There is now an ALPAKA_ACCELERATOR_NAMESPACE::impl::FwkContextBase base class for the common functionality between ED modules and ES modules (although the latter is not exercised in this prototype, so this is what I believe to be the common functionality). The ALPAKA_ACCELERATOR_NAMESPACE::EDContext class derives the FwkContextBase and adds ED specific functionality. I guess the FwkContextBase and EDContext could be implemented also as templates instead of placing them into ALPAKA_ACCELERATOR_NAMESPACE (they are hidden from developers anyway).

A third context class, ALPAKA_ACCELERATOR_NAMESPACE::Context, is defined to be given to the developers (via EDModule::produce() argument). It gives access to the Queue object. Internally it also signals to the FwkContextBase when the Queue has been asked by the developer, so that if the EDModule accesses its input products for the first time after that point, it won't try to re-use the Queue from the input product (because the initially assigned Queue is already being used). This Context class can be later extended e.g. along #256.

One additional piece that would reduce the number of places where the edm::Host<T> would appear in user code, but is not prototyped here, would be automating the (mainly device-to-host) transfers. As long as the type T can be arbitrary, framework needs to be told how to transfer that type between two memory spaces (e.g. something along a plugin factory for functions), but at least these transfers would not have to expressed in the configuration anymore.

@makortel
Copy link
Collaborator Author

@fwyzard This is the prototype I mentioned earlier (and apparently failed to open in a draft mode...).

@makortel makortel force-pushed the alpakatestFramework_v3 branch from 6b4fa18 to 2275971 Compare March 2, 2022 21:40
@makortel
Copy link
Collaborator Author

makortel commented Mar 2, 2022

Rebased on top of master to fix conflicts in src/alpakatest/Makefile.

@makortel makortel added the alpaka label Mar 4, 2022
@fwyzard
Copy link
Contributor

fwyzard commented Sep 4, 2022

Could this be extended to better handle multiple backends with the same memory space ?

Currently we define a backend with

  • the memory space of the "accelerator" (host vs cuda vs rocm)
  • how the accelerator runs a kernel (e.g. CPU serial vs TBB)
  • how the host enqueues the work (blocking/sync vs non-blocking/async)

In principle we should have different execution options for the same memory space: cpu sync vs tbb sync, cuda sync vs cuda async, etc.

Do you think the approach researched here could be used to have a single data product (both in terms of dataformat type, and of underlying memory buffer/soa) shared among different execution cases ?

One concrete example would be having the CPU serial implementation for every module, and the TBB (serial) only for some modules where the extra parallelism makes sense.

@makortel
Copy link
Collaborator Author

makortel commented Sep 8, 2022

Could this be extended to better handle multiple backends with the same memory space ?
...
Do you think the approach researched here could be used to have a single data product (both in terms of dataformat type, and of underlying memory buffer/soa) shared among different execution cases ?

I think this approach would allow such an extension. There would certainly be many details to be worked out (like how to make the framework enough aware of memory and execution spaces, including supporting multiple devices of the same type, but in a generic way). But I'd expect the user-facing interfaces would stay mostly the same.

I have also the CUDA managed memory / SYCL shared memory in mind (for platforms that have a truly unified memory), in which case it would be nice if the downstream, alpaka-independent consumers could use directly the data product wrapped in edm::Product (as it is called here) after a proper synchronization. With edm::Product<T> class template being part of the framework we could peek in there (like with edm::View).

Of course, for any of this "using data products of one memory space in many backends" to work at all, the data product the EDProducer perceives to produce should be exactly the same type in all the backends for which this "sharing" is done (but IIUC you also wrote that).

For Serial/TBB backends using the same product types should, in principle, be trivial (and therefore the setup should be straightforward if the TBB backend uses a synchronous queue).

@fwyzard
Copy link
Contributor

fwyzard commented Sep 8, 2022

OK, so we are thinking about:

  • unified memory / shared memory: different "accelerators" (cpu vs gpu), in the same memory space (unified addressing space accessible from all devices), with different queue types (sync vs async);
  • serial execution vs internal parallelism with TBB: different "accelerators" (cpu serial vs cpu parallel), in the same memory space (host memory), with the same queue type (sync).

At lease for debugging, it might be useful to support also:

  • sync GPU queues, async CPU queues: a given "accelerator", with its given memory space, but with both queue types (sync and async).

I'm starting to see why alpaka keeps the three concepts almost orthogonal...

@makortel
Copy link
Collaborator Author

Made effectively obsolete by cms-sw/cmssw#39428

@makortel makortel closed this Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants