Skip to content

Conversation

@AdrianoDee
Copy link
Contributor

@AdrianoDee AdrianoDee commented Sep 29, 2025

PR description:

This PR proposes a couple of fixes for the digi morphing to work properly for differen conditions (when acitve or not):

  • defining a maxPixInModuleForMorphing constant depending on the TrackerTraits to be used to define the number of threads for the FindClus kernel. This also sizes the histogram holding the pixels in a module:
using Hist = cms::alpakatools::HistoContainer<uint16_t,
                                                      TrackerTraits::clusterBinning,
                                                      TrackerTraits::maxPixInModuleForMorphing,
                                                      TrackerTraits::clusterBits,
                                                      uint16_t>;                                                      
  • having a different maxIterGPU per topology (given the different number of pixels affects the number of iterations we can use to cover the full module);
  • limiting the maxFakesInModule configuration parameter to take into account the maxPixInModuleForMorphing max to avoid the histogram overflowing.

PR validation:

160.03502 run.

Running successfully the test from @henriettepetersen :

hltConfigFromDB --configName /online/collisions/2025/2e34/v1.2/HLT/V2 > hlt.py
cp /gpu_data/store/data/Run2025C/EphemeralHLTPhysics/FED/run393240_cff.py .
cat >> hlt.py << @EOF

process.load('run393240_cff')

from Configuration.AlCa.GlobalTag import GlobalTag as customiseGlobalTag
process.GlobalTag = customiseGlobalTag(process.GlobalTag, globaltag = '150X_dataRun3_HLT_v1')

from HLTrigger.Configuration.customizeHLTforCMSSW import customizeHLTforCMSSW
process = customizeHLTforCMSSW(process)

process.PrescaleService.lvl1DefaultLabel = '2p0E34'
process.PrescaleService.forceDefault = True

process.options.wantSummary = False
process.MessageLogger.cerr.enableStatistics = cms.untracked.bool(False)

process.FastTimerService.writeJSONSummary = True

process.ThroughputService = cms.Service('ThroughputService',
    enableDQM = cms.untracked.bool(False),
    printEventSummary = cms.untracked.bool(True),
    eventResolution = cms.untracked.uint32(100),
    eventRange = cms.untracked.uint32(10300),
)
process.MessageLogger.cerr.ThroughputService = cms.untracked.PSet(
    limit = cms.untracked.int32(10000000),
    reportEvery = cms.untracked.int32(1)
)

import os
os.makedirs('%s/run%d' % (process.EvFDaqDirector.baseDir.value(), process.EvFDaqDirector.runNumber.value()), exist_ok=True)

process.options.numberOfThreads = 32
process.options.numberOfStreams = 24
process.options.numberOfConcurrentLuminosityBlocks = 2
process.maxEvents.input = 10300

process.hltSiPixelClustersSoA.DoDigiMorphing = cms.bool( True )
process.hltSiPixelClustersSoASerialSync.DoDigiMorphing = cms.bool( True )

@EOF

# run the configuration
cmsRun hlt.py

Backport is needed to 15_1_X for HI data taking.

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 29, 2025

cms-bot internal usage

@AdrianoDee
Copy link
Contributor Author

enable gpu

@AdrianoDee
Copy link
Contributor Author

solves #48885

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-49021/46221

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @AdrianoDee for master.

It involves the following packages:

  • Geometry/CommonTopologies (geometry)
  • RecoLocalTracker/SiPixelClusterizer (reconstruction)

@Dr15Jones, @bsunanda, @civanch, @cmsbuild, @jfernan2, @kpedro88, @makortel, @mandrenguyen, @mdhildreth can you please review it and eventually sign? Thanks.
@GiacomoSguazzoni, @VinInn, @VourMa, @bsunanda, @dkotlins, @elusian, @fabiocos, @felicepantaleo, @ferencek, @gpetruc, @martinamalberti, @mmasciov, @mmusich, @mroguljic, @mtosi, @rovere, @threus, @tsusa, @tvami this is something you requested to watch as well.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@AdrianoDee
Copy link
Contributor Author

test parameters:

  • relvals_gpu = 160.03502
  • relvals_opts_gpu = -w gpu

@AdrianoDee
Copy link
Contributor Author

please test

@AdrianoDee
Copy link
Contributor Author

type bug-fix

@fwyzard
Copy link
Contributor

fwyzard commented Sep 29, 2025

assign heterogeneous

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

#include "HeterogeneousCore/AlpakaInterface/interface/warpsize.h"

//#define GPU_DEBUG
// #define GPU_DEBUG
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you undo the extra whitespace change ?

ALPAKA_ASSERT_ACC((alpaka::getWorkDiv<alpaka::Thread, alpaka::Elems>(acc)[0u] <= maxElements));

constexpr unsigned int maxIter = maxIterGPU * maxElements;
const unsigned int maxIter = TrackerTraits::maxIterClustering * maxElements;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The declaration of the arrays nn[maxIter][maxNeighbours] and nnn[maxIter] should not be allowed if maxIter is not constexpr or anyway known at compile time ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the host code we have tolerated variable-length arrays as a non-standard extension (for reasons that can be debated elsewhere). I don't know to what extent VLAs work in nvcc or hipcc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA does not seem to like it, compiling this

__global__
void kernel(bool more) {
  const int size = more ? 42 : 21;
  float data[size];

  if (threadIdx.x < size) {
    data[size] = 0;
  }
}

int main(void) {
  kernel<<<1,1>>>(true);
  return 0;
}

fails with

$ /usr/local/cuda-12.9/bin/nvcc -c test.cu -o test.o -arch sm_75
test.cu(4): error: expression must have a constant value
    float data[size];
               ^
test.cu(4): note #2689-D: the value of variable "size" (declared at line 3) cannot be used as a constant
    float data[size];
               ^

1 error detected in the compilation of "test.cu".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the compilation of the tests seems to be progressing fine ?
And CUDA does support alloca() 🤔, in fact this compiles:

__global__
void kernel(bool more) {
  const int size = more ? 42 : 21;
  //float data[size];
  float* data = static_cast<float *>(alloca(size * sizeof(float)));

  if (threadIdx.x < size) {
    data[size] = 0;
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative I see is just sizing it with the maximum possible (so TrackerTraits::maxElementsPerBlockMorph), basically wasting some of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And to be honest, I wasn't expecting this to compile either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking a bit better into it I think I understand why it works:

  • on CPU the value of maxIter depends on whether morphing is enabled or not, but it works because as Matti pointed out we allow variable sized arrays;
  • on GPU the value of maxIter is actually independent whether morphing is enabled or not, so the compiler can determine it at compile time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One smart compiler!


static constexpr uint32_t maxPixInModule = 6000;
static constexpr uint32_t maxPixInModuleForMorphing = maxPixInModule;
static constexpr uint32_t maxIterClustering = 16;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we derive this from maxPixInModule or maxPixInModuleForMorphing ?

Copy link
Contributor Author

@AdrianoDee AdrianoDee Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could. But it depends on how we want to handle the number of blocks and threads for FindClus. As is, we fix maxPixInModule, maxIterClustering, blocks, and extrapolate maxElementsPerBlock so that maxElementsPerBlock = maxPixInModule/(maxIterClustering * blocks).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I follow the code correctly, now we have

  • maxPixInModule, maxPixInModuleForMorphing and maxIterClustering fixed here
  • maxElementsPerBlock = maxPixInModule / maxIterClustering, round up to the next multiple of 64
  • maxElementsPerBlockMorph = maxPixInModuleForMorphing / maxIterClustering, round up to the next multiple of 64
  • maxElements
    • on CPU it is either maxElementsPerBlock or maxElementsPerBlockMorph
    • on GPU it is always 1.
  • maxIter = maxIterClustering × maxElements
    • on CPU it is maxPixInModule or maxPixInModuleForMorphing
      rounded up to the next multiple of (maxIterClustering × 64)
    • on GPU it is maxIterClustering

Which results in

  Phase2 Phase1 HIonPhase1
maxPixInModule 6000 6000 10000
maxPixInModuleForMorphing 6000 8400 11000
maxIterClustering 16 24 32
       
maxElementsPerBlock 384 256 320
maxElementsPerBlockMorph 384 384 384
       
maxElements (CPU, enableDigiMorphing = false) 384 256 320
maxElements (CPU, enableDigiMorphing = true) 384 384 384
maxElements (GPU) 1 1 1
       
maxIter (CPU, enableDigiMorphing = false) 6144 4096 5120
maxIter (CPU, enableDigiMorphing = true) 6144 9216 12288
maxIter (GPU) 16 24 32

Then maxIter is used to allocate the arrays of nearest neighbours.

My suggestion would be to

  • fix maxPixInModule and maxPixInModuleForMorphing like in this PR
  • determine maxElementsPerBlock based on what works and gives good performance on the T4 and/or L4 GPUs, and keep it fixed (hopefully using the same value with and without morphing)
  • derive maxIterClustering from maxPixInModuleForMorphing or maxPixInModule, depending if morphing is enabled or not.

What do you think ?

Copy link
Contributor Author

@AdrianoDee AdrianoDee Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was implementing this (I agree it's a better set of fixed variables), but: does this imply that maxIter is not fixed at compile time on GPU and the issue above would manifest?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes... 🤦🏻‍♂️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we stay with the current schema for the moment? Just to get it in for the next 15_1_X release.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I don't have a better suggestions, so let's keep it as it is for the moment 🤷🏻‍♂️

static constexpr uint16_t last_barrel_detIndex = 864;

static constexpr uint32_t maxPixInModule = 6000;
static constexpr uint32_t maxPixInModuleForMorphing = maxPixInModule;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the name maxPixInModuleForMorphing, it would make sense for this constant to indicate how many pixels at most one can expect to be recovered by the morphing step, rather than the total of original plus recovered pixels ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, makes sense.

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 48KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-05a049/48337/summary.html
COMMIT: 9265f41
CMSSW: CMSSW_16_0_X_2025-09-28-2300/el8_amd64_gcc12
Additional Tests: GPU,AMD_MI300X,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/49021/48337/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

There are some workflows for which there are errors in the baseline:
2024.0050001 step 1
The results for the comparisons for these workflows could be incomplete
This means most likely that the IB is having errors in the relvals.The error does NOT come from this pull request

Summary:

  • You potentially removed 3 lines from the logs
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 3861349
  • DQMHistoTests: Total failures: 23
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3861306
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
  • Checked 214 log files, 184 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

AMD_MI300X Comparison Summary

Summary:

NVIDIA_H100 Comparison Summary

Summary:

NVIDIA_L40S Comparison Summary

There are some workflows for which there are errors in the baseline:
160.03502 step 4
The results for the comparisons for these workflows could be incomplete
This means most likely that the IB is having errors in the relvals.The error does NOT come from this pull request

Summary:

  • You potentially removed 22699 lines from the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 237 differences found in the comparisons
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 146188
  • DQMHistoTests: Total failures: 27427
  • DQMHistoTests: Total nulls: 10
  • DQMHistoTests: Total successes: 118751
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 46 log files, 50 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

NVIDIA_T4 Comparison Summary

There are some workflows for which there are errors in the baseline:
160.03502 step 4
The results for the comparisons for these workflows could be incomplete
This means most likely that the IB is having errors in the relvals.The error does NOT come from this pull request

Summary:

  • You potentially removed 18689 lines from the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 196 differences found in the comparisons
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 146188
  • DQMHistoTests: Total failures: 33692
  • DQMHistoTests: Total nulls: 13
  • DQMHistoTests: Total successes: 112483
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 46 log files, 50 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 2, 2025

Pull request #49021 was updated. @Dr15Jones, @bsunanda, @civanch, @cmsbuild, @fwyzard, @jfernan2, @kpedro88, @makortel, @mandrenguyen, @mdhildreth can you please check and sign again.

@AdrianoDee
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 3, 2025

+1

Size: This PR adds an extra 40KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-05a049/48429/summary.html
COMMIT: a487e5e
CMSSW: CMSSW_16_0_X_2025-10-02-1100/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/49021/48429/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

AMD_MI300X Comparison Summary

Summary:

AMD_W7900 Comparison Summary

Summary:

NVIDIA_H100 Comparison Summary

Summary:

NVIDIA_L40S Comparison Summary

There are some workflows for which there are errors in the baseline:
160.03502 step 4
The results for the comparisons for these workflows could be incomplete
This means most likely that the IB is having errors in the relvals.The error does NOT come from this pull request

Summary:

  • You potentially removed 220646 lines from the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 234 differences found in the comparisons
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 146284
  • DQMHistoTests: Total failures: 24204
  • DQMHistoTests: Total nulls: 5
  • DQMHistoTests: Total successes: 122075
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 46 log files, 50 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

NVIDIA_T4 Comparison Summary

There are some workflows for which there are errors in the baseline:
160.03502 step 4
The results for the comparisons for these workflows could be incomplete
This means most likely that the IB is having errors in the relvals.The error does NOT come from this pull request

Summary:

  • You potentially removed 17453 lines from the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 254 differences found in the comparisons
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 146284
  • DQMHistoTests: Total failures: 24974
  • DQMHistoTests: Total nulls: 5
  • DQMHistoTests: Total successes: 121305
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 46 log files, 50 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

@mandrenguyen
Copy link
Contributor

urgent
@cms-sw/heterogeneous-l2 @cms-sw/geometry-l2 @cms-sw/reconstruction-l2 please have a look today, if possible.
The 15_1_0 build is being held up by the the backport of this is PR. Thank you!

@cmsbuild cmsbuild added the urgent label Oct 3, 2025
@jfernan2
Copy link
Contributor

jfernan2 commented Oct 3, 2025

+1

@fwyzard
Copy link
Contributor

fwyzard commented Oct 3, 2025

+heterogeneous

Thanks Adriano for the fix and addressing the various comments.

@mandrenguyen
Copy link
Contributor

@cms-sw/geometry-l2 ping

@civanch
Copy link
Contributor

civanch commented Oct 3, 2025

+1

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 3, 2025

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @mandrenguyen, @sextonkennedy, @ftenchini (and backports should be raised in the release meeting by the corresponding L2)

@mandrenguyen
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit dd7156a into cms-sw:master Oct 3, 2025
26 checks passed
@AdrianoDee AdrianoDee deleted the digimoprh_sharedmemory_160X branch October 13, 2025 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants