[cudadev][RFC] Prototype an improved model for EventSetup use #257

makortel · 2021-10-27T01:50:11Z

This PR builds on top of #256 (that builds on top of #224). The actual developments of this PR are in the last 4 commits.

This PR prototypes a major change for the EventSetup data model. Previously (currently) the ESProducer produces a wrapper product that holds the necessary (POD) data on the host memory, and the transfers to devices are done on EDModules when they ask for the actual product on the device memory, queued into the CUDA stream of the EDModule. This approach has (at least) the following downsides

Requires CUDA runtime API calls in the EventSetup data formats, spreading the CUDA dependence wider in CMSSW
Effectively prevents kernel calls to produce some of the data (because we haven't succeeded in having device code in library shared objects)
Makes use of CachingHostAllocator complicated as the latter expects each allocation be tied to a CUDA stream, and there is no CUDA stream available at the ESProducer::produce() time. Even if there were (or the allocation API would be improved in some way), the necessary lifetime should be controlled by the produced data product.
The transferred-from host memory is kept alive for the entire duration of the IOV, while it could be sufficient to release them after the data have been transferred to all devices
cms::cuda::ESProduct<T> (that is used as a member data of the aforementioned "wrapper procuct") has complicated synchronization logic to handle concurrent requests on the actual device data

In the proposed model

The transfers (and possible kernel calls) are launched from the ESProducer.
The actual product is wrapped into cms::cudaNew::ESProduct<T> (to be renamed cms::cuda::ESProduct<T> in CMSSW after everything has been migrated to it) that acts similarly to cms::cuda::Product<T> for Event data
- Holds a copy of T for each device, along with CUDA event telling if the asynchronous operations to produce T have completed
- Holds a type-erased object (currently std::any, to be improved) to hold the temporary (pinned) host data used as a source of the transfers at least until the transfers to all devices have completed.
- In case the transfer (and/or kernels) have not finished yet when an EDModule asks data, a cudaStreamWaitEvent() is called to insert a wait on the EDModule's CUDA stream on the ESProduct's CUDA event
All CUDA-related operations need to be done within functors passed to runForHost(<host-functor>).forEachDevice(<device-functor>)
- The <host-functor> is run once, and it must return an arbitrary object (e.g. tuple for multiple objects) holding all temporary host data. This object will be given to the <device-functor> as the first argument. The HostAllocatorContext gives access to cached pinned host memory allocations, but no other CUDA API calls are allowed (really anything that assumes a current device)
- The <device-functor> is called once for each device. The current device is set before calling the function. All CUDA operations are allowed.
- In case no temporary host data are needed, a runForEachDevice(<device-functor>) can be used instead.
Still requires CUDA dependence on the ES data formats because of the unique_ptrs. This dependence could be reduced (removed?) by using a smart pointer that erases the deleter type (similar to std::shared_ptr), which, I think, we should seriously consider (in light of Alpaka).

Add runAcquire(), runProduce() functions

… in better-defined way

…parts

…uda::Context objects instead of cudaStream_t

fwyzard · 2021-10-27T11:10:01Z

@makortel thanks for sharing these proposals.
From my side, I would prefer to complete the Alpaka migration to the current version of the "framework", and then evaluate these proposals taking into account also how they will interplay with the Alpaka related changes.

makortel · 2021-10-27T15:48:28Z

From my side, I would prefer to complete the Alpaka migration to the current version of the "framework", and then evaluate these proposals taking into account also how they will interplay with the Alpaka related changes.

I agree with that. At this point it wouldn't make much sense to evolve the CUDA side in away that would not work well for Alpaka. And for Alpaka I would go first with the "current device framework" rather than trying to change two complicated things at ta time. So at this point these are mostly food for thought.

I'm only worried a bit about the possibly-throwing destructors of current ScopedContext classes to hinder debugging of errors of GPU workflows, but maybe those haven't been too big of an issue so far?

makortel · 2021-10-27T15:51:45Z

src/cudadev/CUDACore/ESProductNew.h

+    template <typename U>
+    void setHostData(U&& data) {
+      // TODO: wrapping to shared_ptr to have a copyable type for std::any...
+      hostData_ = std::make_shared<U>(std::forward<U>(data));


Just to be explicit, if we go with this route, I'd create a custom class for holding move-only types in a type-agnostic way to avoid the extra layer of shared_ptr.

makortel · 2021-10-27T15:51:58Z

src/cudadev/CUDACore/ESProductNew.h

+    mutable std::any
+        hostData_;  // to be kept alive until all asynchronous activity has finished, guarded by AND of complete_


Need to be refornatted.

makortel · 2021-10-27T15:53:22Z

src/cudadev/plugin-SiPixelClusterizer/SiPixelGainCalibrationForHLTGPUESProducer.cc

+  eventSetup.put(cms::cuda::runForHost([&](cms::cuda::HostAllocatorContext& ctx) {
+                   auto gainForHLTonHost = cms::cuda::make_host_unique_uninitialized<SiPixelGainForHLTonGPU>(ctx);
+                   *gainForHLTonHost = gain;
+                   return gainForHLTonHost;
+                 }).forEachDevice([&](auto const& gainForHLTonHost, cms::cuda::ESContext& ctx) {
+    auto gainForHLTonGPU = cms::cuda::make_device_unique_uninitialized<SiPixelGainForHLTonGPU>(ctx);
+    auto gainDataOnGPU = cms::cuda::make_device_unique<char[]>(gainData.size(), ctx);
+    cudaCheck(cudaMemcpyAsync(gainDataOnGPU.get(), gainData.data(), gainData.size(), cudaMemcpyDefault, ctx.stream()));
+    cudaCheck(cudaMemcpyAsync(
+        gainForHLTonGPU.get(), gainForHLTonHost.get(), sizeof(SiPixelGainForHLTonGPU), cudaMemcpyDefault, ctx.stream()));
+    auto ptr = gainDataOnGPU.get();
+    cudaCheck(cudaMemcpyAsync(&(gainForHLTonGPU->v_pedestals_),
+                              &ptr,
+                              sizeof(SiPixelGainForHLTonGPU_DecodingStructure*),
+                              cudaMemcpyDefault,
+                              ctx.stream()));
+    return SiPixelGainCalibrationForHLTGPU(std::move(gainForHLTonGPU), std::move(gainDataOnGPU));
+  }));


Formatting (on clang 11) is not the best. Using | in between runForHost() and forEachDevice() would look a bit better, but would have other implications (including clarity).

makortel · 2022-09-16T19:13:41Z

Made effectively obsolete by cms-sw/cmssw#39428

makortel added 16 commits October 25, 2021 17:05

Add Context class hierarchy

06cc8f2

Add EDProducer base class, and runAcquire() and runProduce() functions

44b1429

Add runAcquire(), runProduce() functions

Migrate code from ScopedContext to Context

70341c6

Avoid using ContextState

df786db

Remove ScopedContext as obsolete

8b3187a

Share the CUDA stream between Event products from the same EDProducer…

4cd8dab

… in better-defined way

Split Context.h into one class per file

55effec

Migrate code away from Context.h

e5b8bdd

Remove Context.h

eaf55b2

Add lightweight Context class with memory allocation oriented counter…

744c94f

…parts

Add TestContext for tests

0429ad5

Migrate make_(device|host)_unique and all calling code to pass cms::c…

c91e0e3

…uda::Context objects instead of cudaStream_t

[cudadev] Add ESContext and ESProductNew

40a1fe7

[cudadev] Migrate SiPixelGainCalibrationForHLTGPU

cd2db27

[cudadev] Migrate SiPixelROCsStatusAndMappingWrapper

e77d781

[cudadev] Migrate PixelCPEFast

5d81aae

makortel added the cuda label Oct 27, 2021

makortel mentioned this pull request Oct 27, 2021

[cudadev][RFC] Next version of CUDA EDModule API #224

Closed

fwyzard added the enhancement New feature or request label Oct 27, 2021

makortel mentioned this pull request Oct 27, 2021

Prototype new pattern for ESProducers and ESProducts in standalone cms-sw/framework-team#238

Closed

makortel commented Oct 27, 2021

View reviewed changes

makortel mentioned this pull request Oct 28, 2021

[alpaka] Template varius alpakatools classes on the Queue type #255

Merged

makortel mentioned this pull request Sep 16, 2022

Evolution of the Alpaka "gpu framework" cms-sw/cmssw#39428

Merged

5 tasks

makortel closed this Sep 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cudadev][RFC] Prototype an improved model for EventSetup use #257

[cudadev][RFC] Prototype an improved model for EventSetup use #257

Uh oh!

makortel commented Oct 27, 2021

Uh oh!

fwyzard commented Oct 27, 2021

Uh oh!

makortel commented Oct 27, 2021

Uh oh!

makortel Oct 27, 2021

Uh oh!

makortel Oct 27, 2021

Uh oh!

makortel Oct 27, 2021

Uh oh!

makortel commented Sep 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		mutable std::any
		hostData_; // to be kept alive until all asynchronous activity has finished, guarded by AND of complete_

[cudadev][RFC] Prototype an improved model for EventSetup use #257

[cudadev][RFC] Prototype an improved model for EventSetup use #257

Uh oh!

Conversation

makortel commented Oct 27, 2021

Uh oh!

fwyzard commented Oct 27, 2021

Uh oh!

makortel commented Oct 27, 2021

Uh oh!

makortel Oct 27, 2021

Choose a reason for hiding this comment

Uh oh!

makortel Oct 27, 2021

Choose a reason for hiding this comment

Uh oh!

makortel Oct 27, 2021

Choose a reason for hiding this comment

Uh oh!

makortel commented Sep 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants