Skip to content

Conversation

@makortel
Copy link
Collaborator

This PR builds on top of #256 (that builds on top of #224). The actual developments of this PR are in the last 4 commits.

This PR prototypes a major change for the EventSetup data model. Previously (currently) the ESProducer produces a wrapper product that holds the necessary (POD) data on the host memory, and the transfers to devices are done on EDModules when they ask for the actual product on the device memory, queued into the CUDA stream of the EDModule. This approach has (at least) the following downsides

  • Requires CUDA runtime API calls in the EventSetup data formats, spreading the CUDA dependence wider in CMSSW
  • Effectively prevents kernel calls to produce some of the data (because we haven't succeeded in having device code in library shared objects)
  • Makes use of CachingHostAllocator complicated as the latter expects each allocation be tied to a CUDA stream, and there is no CUDA stream available at the ESProducer::produce() time. Even if there were (or the allocation API would be improved in some way), the necessary lifetime should be controlled by the produced data product.
  • The transferred-from host memory is kept alive for the entire duration of the IOV, while it could be sufficient to release them after the data have been transferred to all devices
  • cms::cuda::ESProduct<T> (that is used as a member data of the aforementioned "wrapper procuct") has complicated synchronization logic to handle concurrent requests on the actual device data

In the proposed model

  • The transfers (and possible kernel calls) are launched from the ESProducer.
  • The actual product is wrapped into cms::cudaNew::ESProduct<T> (to be renamed cms::cuda::ESProduct<T> in CMSSW after everything has been migrated to it) that acts similarly to cms::cuda::Product<T> for Event data
    • Holds a copy of T for each device, along with CUDA event telling if the asynchronous operations to produce T have completed
    • Holds a type-erased object (currently std::any, to be improved) to hold the temporary (pinned) host data used as a source of the transfers at least until the transfers to all devices have completed.
    • In case the transfer (and/or kernels) have not finished yet when an EDModule asks data, a cudaStreamWaitEvent() is called to insert a wait on the EDModule's CUDA stream on the ESProduct's CUDA event
  • All CUDA-related operations need to be done within functors passed to runForHost(<host-functor>).forEachDevice(<device-functor>)
    • The <host-functor> is run once, and it must return an arbitrary object (e.g. tuple for multiple objects) holding all temporary host data. This object will be given to the <device-functor> as the first argument. The HostAllocatorContext gives access to cached pinned host memory allocations, but no other CUDA API calls are allowed (really anything that assumes a current device)
    • The <device-functor> is called once for each device. The current device is set before calling the function. All CUDA operations are allowed.
    • In case no temporary host data are needed, a runForEachDevice(<device-functor>) can be used instead.
  • Still requires CUDA dependence on the ES data formats because of the unique_ptrs. This dependence could be reduced (removed?) by using a smart pointer that erases the deleter type (similar to std::shared_ptr), which, I think, we should seriously consider (in light of Alpaka).

@fwyzard
Copy link
Contributor

fwyzard commented Oct 27, 2021

@makortel thanks for sharing these proposals.
From my side, I would prefer to complete the Alpaka migration to the current version of the "framework", and then evaluate these proposals taking into account also how they will interplay with the Alpaka related changes.

@fwyzard fwyzard added the enhancement New feature or request label Oct 27, 2021
@makortel
Copy link
Collaborator Author

From my side, I would prefer to complete the Alpaka migration to the current version of the "framework", and then evaluate these proposals taking into account also how they will interplay with the Alpaka related changes.

I agree with that. At this point it wouldn't make much sense to evolve the CUDA side in away that would not work well for Alpaka. And for Alpaka I would go first with the "current device framework" rather than trying to change two complicated things at ta time. So at this point these are mostly food for thought.

I'm only worried a bit about the possibly-throwing destructors of current ScopedContext classes to hinder debugging of errors of GPU workflows, but maybe those haven't been too big of an issue so far?

template <typename U>
void setHostData(U&& data) {
// TODO: wrapping to shared_ptr to have a copyable type for std::any...
hostData_ = std::make_shared<U>(std::forward<U>(data));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be explicit, if we go with this route, I'd create a custom class for holding move-only types in a type-agnostic way to avoid the extra layer of shared_ptr.

Comment on lines +82 to +83
mutable std::any
hostData_; // to be kept alive until all asynchronous activity has finished, guarded by AND of complete_
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to be refornatted.

Comment on lines +33 to +50
eventSetup.put(cms::cuda::runForHost([&](cms::cuda::HostAllocatorContext& ctx) {
auto gainForHLTonHost = cms::cuda::make_host_unique_uninitialized<SiPixelGainForHLTonGPU>(ctx);
*gainForHLTonHost = gain;
return gainForHLTonHost;
}).forEachDevice([&](auto const& gainForHLTonHost, cms::cuda::ESContext& ctx) {
auto gainForHLTonGPU = cms::cuda::make_device_unique_uninitialized<SiPixelGainForHLTonGPU>(ctx);
auto gainDataOnGPU = cms::cuda::make_device_unique<char[]>(gainData.size(), ctx);
cudaCheck(cudaMemcpyAsync(gainDataOnGPU.get(), gainData.data(), gainData.size(), cudaMemcpyDefault, ctx.stream()));
cudaCheck(cudaMemcpyAsync(
gainForHLTonGPU.get(), gainForHLTonHost.get(), sizeof(SiPixelGainForHLTonGPU), cudaMemcpyDefault, ctx.stream()));
auto ptr = gainDataOnGPU.get();
cudaCheck(cudaMemcpyAsync(&(gainForHLTonGPU->v_pedestals_),
&ptr,
sizeof(SiPixelGainForHLTonGPU_DecodingStructure*),
cudaMemcpyDefault,
ctx.stream()));
return SiPixelGainCalibrationForHLTGPU(std::move(gainForHLTonGPU), std::move(gainDataOnGPU));
}));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting (on clang 11) is not the best. Using | in between runForHost() and forEachDevice() would look a bit better, but would have other implications (including clarity).

@makortel
Copy link
Collaborator Author

Made effectively obsolete by cms-sw/cmssw#39428

@makortel makortel closed this Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants