-
Notifications
You must be signed in to change notification settings - Fork 4.6k
LST followups: better work divisions, concrete kernel dimension, some cleanup and fixes #47084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LST followups: better work divisions, concrete kernel dimension, some cleanup and fixes #47084
Conversation
|
cms-bot internal usage |
|
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47084/43263 |
|
A new Pull Request was created by @ariostas for master. It involves the following packages:
@cmsbuild, @jfernan2, @mandrenguyen can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
|
Tagging @fwyzard since most (if not all) of the comments addressed were his |
|
test parameters:
|
|
@cmsbuild please test |
|
-1 Failed Tests: UnitTests RelVals-GPU Unit TestsI found 1 errors in the following unit tests: ---> test test-das-selected-lumis had ERRORS RelVals-GPU
Comparison SummarySummary:
|
there are a bunch of errors like the same workflow step3 in the baseline ran OK. So, the crash seems related to this PR. |
|
assign heterogeneous |
|
I can have a look in the coming days, but if this is urgent for any reason go ahead and sign it, and I will still have a look after the fact. |
|
hold
Actually, you know what ? |
|
Pull request has been put on hold by @fwyzard |
| Vec3D const blocksPerGrid_crossCleanpT3{1, 4, 20}; | ||
| WorkDiv3D const crossCleanpT3_workDiv = | ||
| createWorkDiv(blocksPerGrid_crossCleanpT3, threadsPerBlock_crossCleanpT3, elementsPerThread); | ||
| auto const crossCleanpT3_workDiv = cms::alpakatools::make_workdiv<Acc2D>({20, 4}, {64, 16}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the X and Y values in the ranges are inverted with respect to before - is it intended ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there were a few places where I flipped the order so that the loops are nested in the recommended order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻
| Vec3D const blocksPerGrid_crossCleanpLS{1, 4, 20}; | ||
| WorkDiv3D const crossCleanpLS_workDiv = | ||
| createWorkDiv(blocksPerGrid_crossCleanpLS, threadsPerBlock_crossCleanpLS, elementsPerThread); | ||
| auto const crossCleanpLS_workDiv = cms::alpakatools::make_workdiv<Acc2D>({20, 4}, {32, 16}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also here (OK, so it's probably intended).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻
| if (slope == | ||
| kVerticalModuleSlope) // Designated for tilted module when the slope is infinity (module lying along y-axis) | ||
| if (slope == kVerticalModuleSlope || | ||
| edm::isNotFinite(slope)) // Designated for tilted module when the slope is infinity (module lying along y-axis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@makortel do you know if edm::isFinite/edm::isNotFinite is guaranteed to in device code ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically they could presently work: the functions are constexpr and presently use union for type punning (that is strictly speaking undefined behavior). I'd like to replace the union with std::bit_cast that is constexpr, but on the other hand e.g. https://stackoverflow.com/a/78232359 kind of suggests to use cuda::std::bit_cast on CUDA 12.8.
I see in GCC 12 the std::bit_cast implementation is just a call to __builtin_bit_cast, and that e.g. in https://github.com/cms-sw/cmssw/blob/master/HeterogeneousCore/AlpakaInterface/interface/atomicMaxF.h we use edm::bit_cast (that just forwards to std::bit_cast or __builtin_bit_cast) only for CPU implementation (I don't remember the exact reason for that though, whether the edm::bit_cast didn't work on device code, or the intrinsics were "easier" on CUDA+HIP).
So perhaps for long term it would be better to define Alpaka-specific functions (ideally Alpaka could provide a portable bit_cast).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It currently does work, but I agree that would be better to be more careful about it (in a separate PR?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a related note, should I also reimplement std::distance to be safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(in a separate PR?).
To me a separate PR would be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For pointers and random access iterators you can just use b - a instead of std::distance(a, b).
Anyway, from looking at the code a while back, I think std::distance(a, b) should be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I don't remember the exact reason for that though, whether the
edm::bit_castdidn't work on device code, or the intrinsics were "easier" on CUDA+HIP).
I'm not sure either.
I've tried writing a simple kernel, and __builtin_bit_cast(float, i) and __int_as_float(i) compile to the same exact PTX code (basically a no-op).
|
unhold |
I see that we have it in Patatrack/CA code for half a year cmssw/RecoTracker/PixelSeeding/plugins/alpaka/CAHitNtupletGeneratorKernelsImpl.h Lines 523 to 527 in dd230c0
#45542 (14_1_X) is it reasonable to rely on this case to let it go in this PR as well? |
|
Could |
we used |
|
OK, thanks, then I don't see reason to request further changes here. I'll see if we can fix |
|
+heterogeneous |
|
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @antoniovilela, @mandrenguyen, @sextonkennedy, @rappoccio (and backports should be raised in the release meeting by the corresponding L2) |
|
+1 |
This PR addresses some of the LST followups that we have listed in #46746.
Here is the list of fixes/changes:
cms::alpakatools::makeworkdiv(instead of our customcreateWorkDiv) and we now usecms::alpakatools::uniform_elementsfor kernel loops.kVerticalModuleSlope(previously namedlst_INF). We're doing this in two steps instead of one since the data files also need to be updated. We ensure a smooth transition by first supporting both options and later removing the legacy one.c.c. @slava77 @VourMa