-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Improve the memory usage in the alpaka pixel reconstruction #44458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
cmsbuild
merged 5 commits into
cms-sw:master
from
fwyzard:improve_alpaka_Pixel_memory_usage_141x
Mar 21, 2024
Merged
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
418f6e2
Fix the TrackingRecHitsDevice constructor
fwyzard 4bf2fb9
Remove unused variable from SiPixelRawToCluster::acquire()
fwyzard 5a2feb1
Improve memory usage in SiPixelRawToCluster::acquire()
fwyzard a7047fd
Use cached memory buffers in WordFedAppender
fwyzard 70494ae
Update OneToManyAssocRandomAccess to use the preferred warp size
fwyzard File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid even this is not safe (unless the underlying GPU runtime+driver synchronize on the
memcpy()). TheTrackingRecHitDeviceobject (constructed incmssw/RecoLocalTracker/SiPixelRecHits/plugins/alpaka/PixelRecHitKernels.dev.cc
Line 67 in 28f76ff
)
is moved around (the return from
makeHitsAsync()may get elided, but in the end framework moves the object into theedm::Wrapper<T>), so the address ofoffsetBPIX2_is not stable.But is
offsetBPIX2really needed on the device? The only place I found wascmssw/RecoTracker/PixelSeeding/plugins/alpaka/CAHitNtupletGeneratorKernels.dev.cc
Lines 285 to 298 in 28f76ff
and everywhere else the
offsetBPIX2is used on the host. In this case it would be easy to pass it by value from host (as the calling function has it already and uses it for other purposes). Or are there future developments that would benefit from havingoffsetBPIX2on the device as part ofTrackingRecHitsDevice?If
offsetBPIX2is really needed on the device, I think it should be copied through a host buffer.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mhm.
The CUDA version does not have a synchronisation, either:
cmssw/CUDADataFormats/TrackingRecHit/interface/TrackingRecHitSoADevice.h
Lines 27 to 46 in a7047fd
Which is not to mean that I disagree with your assessment, just that the async copy seems to work anyway ?
(side note: for copying a small number of values, I'm wondering if using a kernel is not faster than calling a
cudaMemcpyvariant ?)As to whether it's needed or not, I agree it needs to be looked into.
For the time being, I'm more interested in recovering the performance of the CUDA implementation, but after that we can also take the opportunity to clean up the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this PR is not worse than what is done in CUDA, so I'm not against merging it now (with the caveat that it is at least theoretically unsafe and should be eventually addressed).
It could be the copy has been so quick that it has completed before the
memcpy()has finished (really along "nothing else has overwritten the memory before CUDA runtime+driver read from the memory location"). I'd expect the problem manifest when "the GPU is at least nearly full", and even then it is unclear whether it would crash the job or lead to incorrect value to be read (and whatever that would imply downstream).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the CUDA documentation is not very clear what "async memcpy" means:
So it's possible that the CUDA runtime may first perform a synchronous copy of
offsetBPIX2_to a staging area in pinned memory, and then an asynchronous copy from there to GPU memory.Or not 🤷🏻♂️ .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, the exact behavior is confusing, and I'd expect to depend on the runtime and driver versions and/or actual hardware. I'd personally assume the weakest guarantees and program around that.
(and with Alpaka we should worry about the behavior of other platforms too)