-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Fixes for Digi Morphing: Limiting Histogram Size and Decoupling for TrackerTraits
#49021
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes for Digi Morphing: Limiting Histogram Size and Decoupling for TrackerTraits
#49021
Conversation
|
cms-bot internal usage |
|
enable gpu |
|
solves #48885 |
|
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-49021/46221
|
|
A new Pull Request was created by @AdrianoDee for master. It involves the following packages:
@Dr15Jones, @bsunanda, @civanch, @cmsbuild, @jfernan2, @kpedro88, @makortel, @mandrenguyen, @mdhildreth can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
|
test parameters:
|
|
please test |
|
type bug-fix |
|
assign heterogeneous |
| #include "HeterogeneousCore/AlpakaInterface/interface/warpsize.h" | ||
|
|
||
| //#define GPU_DEBUG | ||
| // #define GPU_DEBUG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you undo the extra whitespace change ?
| ALPAKA_ASSERT_ACC((alpaka::getWorkDiv<alpaka::Thread, alpaka::Elems>(acc)[0u] <= maxElements)); | ||
|
|
||
| constexpr unsigned int maxIter = maxIterGPU * maxElements; | ||
| const unsigned int maxIter = TrackerTraits::maxIterClustering * maxElements; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The declaration of the arrays nn[maxIter][maxNeighbours] and nnn[maxIter] should not be allowed if maxIter is not constexpr or anyway known at compile time ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the host code we have tolerated variable-length arrays as a non-standard extension (for reasons that can be debated elsewhere). I don't know to what extent VLAs work in nvcc or hipcc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA does not seem to like it, compiling this
__global__
void kernel(bool more) {
const int size = more ? 42 : 21;
float data[size];
if (threadIdx.x < size) {
data[size] = 0;
}
}
int main(void) {
kernel<<<1,1>>>(true);
return 0;
}fails with
$ /usr/local/cuda-12.9/bin/nvcc -c test.cu -o test.o -arch sm_75
test.cu(4): error: expression must have a constant value
float data[size];
^
test.cu(4): note #2689-D: the value of variable "size" (declared at line 3) cannot be used as a constant
float data[size];
^
1 error detected in the compilation of "test.cu".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the compilation of the tests seems to be progressing fine ?
And CUDA does support alloca() 🤔, in fact this compiles:
__global__
void kernel(bool more) {
const int size = more ? 42 : 21;
//float data[size];
float* data = static_cast<float *>(alloca(size * sizeof(float)));
if (threadIdx.x < size) {
data[size] = 0;
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative I see is just sizing it with the maximum possible (so TrackerTraits::maxElementsPerBlockMorph), basically wasting some of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And to be honest, I wasn't expecting this to compile either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking a bit better into it I think I understand why it works:
- on CPU the value of
maxIterdepends on whether morphing is enabled or not, but it works because as Matti pointed out we allow variable sized arrays; - on GPU the value of
maxIteris actually independent whether morphing is enabled or not, so the compiler can determine it at compile time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One smart compiler!
|
|
||
| static constexpr uint32_t maxPixInModule = 6000; | ||
| static constexpr uint32_t maxPixInModuleForMorphing = maxPixInModule; | ||
| static constexpr uint32_t maxIterClustering = 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we derive this from maxPixInModule or maxPixInModuleForMorphing ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we could. But it depends on how we want to handle the number of blocks and threads for FindClus. As is, we fix maxPixInModule, maxIterClustering, blocks, and extrapolate maxElementsPerBlock so that maxElementsPerBlock = maxPixInModule/(maxIterClustering * blocks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I follow the code correctly, now we have
maxPixInModule,maxPixInModuleForMorphingandmaxIterClusteringfixed heremaxElementsPerBlock = maxPixInModule / maxIterClustering, round up to the next multiple of 64maxElementsPerBlockMorph = maxPixInModuleForMorphing / maxIterClustering, round up to the next multiple of 64maxElements- on CPU it is either
maxElementsPerBlockormaxElementsPerBlockMorph - on GPU it is always
1.
- on CPU it is either
maxIter = maxIterClustering × maxElements- on CPU it is
maxPixInModuleormaxPixInModuleForMorphing
rounded up to the next multiple of(maxIterClustering × 64) - on GPU it is
maxIterClustering
- on CPU it is
Which results in
| Phase2 | Phase1 | HIonPhase1 | |
|---|---|---|---|
maxPixInModule |
6000 | 6000 | 10000 |
maxPixInModuleForMorphing |
6000 | 8400 | 11000 |
maxIterClustering |
16 | 24 | 32 |
maxElementsPerBlock |
384 | 256 | 320 |
maxElementsPerBlockMorph |
384 | 384 | 384 |
maxElements (CPU, enableDigiMorphing = false) |
384 | 256 | 320 |
maxElements (CPU, enableDigiMorphing = true) |
384 | 384 | 384 |
maxElements (GPU) |
1 | 1 | 1 |
maxIter (CPU, enableDigiMorphing = false) |
6144 | 4096 | 5120 |
maxIter (CPU, enableDigiMorphing = true) |
6144 | 9216 | 12288 |
maxIter (GPU) |
16 | 24 | 32 |
Then maxIter is used to allocate the arrays of nearest neighbours.
My suggestion would be to
- fix
maxPixInModuleandmaxPixInModuleForMorphinglike in this PR - determine
maxElementsPerBlockbased on what works and gives good performance on the T4 and/or L4 GPUs, and keep it fixed (hopefully using the same value with and without morphing) - derive
maxIterClusteringfrommaxPixInModuleForMorphingormaxPixInModule, depending if morphing is enabled or not.
What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was implementing this (I agree it's a better set of fixed variables), but: does this imply that maxIter is not fixed at compile time on GPU and the issue above would manifest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes... 🤦🏻♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we stay with the current schema for the moment? Just to get it in for the next 15_1_X release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I don't have a better suggestions, so let's keep it as it is for the moment 🤷🏻♂️
| static constexpr uint16_t last_barrel_detIndex = 864; | ||
|
|
||
| static constexpr uint32_t maxPixInModule = 6000; | ||
| static constexpr uint32_t maxPixInModuleForMorphing = maxPixInModule; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the name maxPixInModuleForMorphing, it would make sense for this constant to indicate how many pixels at most one can expect to be recovered by the morphing step, rather than the total of original plus recovered pixels ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, makes sense.
|
+1 Size: This PR adds an extra 48KB to repository Comparison SummaryThere are some workflows for which there are errors in the baseline: Summary:
AMD_MI300X Comparison SummarySummary:
NVIDIA_H100 Comparison SummarySummary:
NVIDIA_L40S Comparison SummaryThere are some workflows for which there are errors in the baseline: Summary:
NVIDIA_T4 Comparison SummaryThere are some workflows for which there are errors in the baseline: Summary:
|
|
Pull request #49021 was updated. @Dr15Jones, @bsunanda, @civanch, @cmsbuild, @fwyzard, @jfernan2, @kpedro88, @makortel, @mandrenguyen, @mdhildreth can you please check and sign again. |
|
please test |
|
+1 Size: This PR adds an extra 40KB to repository Comparison SummarySummary:
AMD_MI300X Comparison SummarySummary:
AMD_W7900 Comparison SummarySummary:
NVIDIA_H100 Comparison SummarySummary:
NVIDIA_L40S Comparison SummaryThere are some workflows for which there are errors in the baseline: Summary:
NVIDIA_T4 Comparison SummaryThere are some workflows for which there are errors in the baseline: Summary:
|
|
urgent |
|
+1 |
|
+heterogeneous Thanks Adriano for the fix and addressing the various comments. |
|
@cms-sw/geometry-l2 ping |
|
+1 |
|
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @mandrenguyen, @sextonkennedy, @ftenchini (and backports should be raised in the release meeting by the corresponding L2) |
|
+1 |
PR description:
This PR proposes a couple of fixes for the digi morphing to work properly for differen conditions (when acitve or not):
maxPixInModuleForMorphingconstant depending on theTrackerTraitsto be used to define the number of threads for theFindCluskernel. This also sizes the histogram holding the pixels in a module:maxIterGPUper topology (given the different number of pixels affects the number of iterations we can use to cover the full module);maxFakesInModuleconfiguration parameter to take into account themaxPixInModuleForMorphingmax to avoid the histogram overflowing.PR validation:
160.03502run.Running successfully the test from @henriettepetersen :
Backport is needed to
15_1_Xfor HI data taking.