-
Notifications
You must be signed in to change notification settings - Fork 47
[cudadev][RFC] Prototype an improved model for EventSetup use #257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cudadev][RFC] Prototype an improved model for EventSetup use #257
Conversation
Add runAcquire(), runProduce() functions
… in better-defined way
…uda::Context objects instead of cudaStream_t
|
@makortel thanks for sharing these proposals. |
I agree with that. At this point it wouldn't make much sense to evolve the CUDA side in away that would not work well for Alpaka. And for Alpaka I would go first with the "current device framework" rather than trying to change two complicated things at ta time. So at this point these are mostly food for thought. I'm only worried a bit about the possibly-throwing destructors of current |
| template <typename U> | ||
| void setHostData(U&& data) { | ||
| // TODO: wrapping to shared_ptr to have a copyable type for std::any... | ||
| hostData_ = std::make_shared<U>(std::forward<U>(data)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be explicit, if we go with this route, I'd create a custom class for holding move-only types in a type-agnostic way to avoid the extra layer of shared_ptr.
| mutable std::any | ||
| hostData_; // to be kept alive until all asynchronous activity has finished, guarded by AND of complete_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to be refornatted.
| eventSetup.put(cms::cuda::runForHost([&](cms::cuda::HostAllocatorContext& ctx) { | ||
| auto gainForHLTonHost = cms::cuda::make_host_unique_uninitialized<SiPixelGainForHLTonGPU>(ctx); | ||
| *gainForHLTonHost = gain; | ||
| return gainForHLTonHost; | ||
| }).forEachDevice([&](auto const& gainForHLTonHost, cms::cuda::ESContext& ctx) { | ||
| auto gainForHLTonGPU = cms::cuda::make_device_unique_uninitialized<SiPixelGainForHLTonGPU>(ctx); | ||
| auto gainDataOnGPU = cms::cuda::make_device_unique<char[]>(gainData.size(), ctx); | ||
| cudaCheck(cudaMemcpyAsync(gainDataOnGPU.get(), gainData.data(), gainData.size(), cudaMemcpyDefault, ctx.stream())); | ||
| cudaCheck(cudaMemcpyAsync( | ||
| gainForHLTonGPU.get(), gainForHLTonHost.get(), sizeof(SiPixelGainForHLTonGPU), cudaMemcpyDefault, ctx.stream())); | ||
| auto ptr = gainDataOnGPU.get(); | ||
| cudaCheck(cudaMemcpyAsync(&(gainForHLTonGPU->v_pedestals_), | ||
| &ptr, | ||
| sizeof(SiPixelGainForHLTonGPU_DecodingStructure*), | ||
| cudaMemcpyDefault, | ||
| ctx.stream())); | ||
| return SiPixelGainCalibrationForHLTGPU(std::move(gainForHLTonGPU), std::move(gainDataOnGPU)); | ||
| })); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatting (on clang 11) is not the best. Using | in between runForHost() and forEachDevice() would look a bit better, but would have other implications (including clarity).
|
Made effectively obsolete by cms-sw/cmssw#39428 |
This PR builds on top of #256 (that builds on top of #224). The actual developments of this PR are in the last 4 commits.
This PR prototypes a major change for the EventSetup data model. Previously (currently) the ESProducer produces a wrapper product that holds the necessary (POD) data on the host memory, and the transfers to devices are done on EDModules when they ask for the actual product on the device memory, queued into the CUDA stream of the EDModule. This approach has (at least) the following downsides
CachingHostAllocatorcomplicated as the latter expects each allocation be tied to a CUDA stream, and there is no CUDA stream available at theESProducer::produce()time. Even if there were (or the allocation API would be improved in some way), the necessary lifetime should be controlled by the produced data product.cms::cuda::ESProduct<T>(that is used as a member data of the aforementioned "wrapper procuct") has complicated synchronization logic to handle concurrent requests on the actual device dataIn the proposed model
cms::cudaNew::ESProduct<T>(to be renamedcms::cuda::ESProduct<T>in CMSSW after everything has been migrated to it) that acts similarly tocms::cuda::Product<T>for Event dataTfor each device, along with CUDA event telling if the asynchronous operations to produceThave completedstd::any, to be improved) to hold the temporary (pinned) host data used as a source of the transfers at least until the transfers to all devices have completed.cudaStreamWaitEvent()is called to insert a wait on the EDModule's CUDA stream on the ESProduct's CUDA eventrunForHost(<host-functor>).forEachDevice(<device-functor>)<host-functor>is run once, and it must return an arbitrary object (e.g.tuplefor multiple objects) holding all temporary host data. This object will be given to the<device-functor>as the first argument. TheHostAllocatorContextgives access to cached pinned host memory allocations, but no other CUDA API calls are allowed (really anything that assumes a current device)<device-functor>is called once for each device. The current device is set before calling the function. All CUDA operations are allowed.runForEachDevice(<device-functor>)can be used instead.unique_ptrs. This dependence could be reduced (removed?) by using a smart pointer that erases the deleter type (similar tostd::shared_ptr), which, I think, we should seriously consider (in light of Alpaka).