Skip to content

Conversation

@mmusich
Copy link
Contributor

@mmusich mmusich commented Jan 14, 2026

PR description:

Following the discussion had at the NGT meeting of Jan 13 2025 this PR implements two changes:

  • hardcode the input file name for the phase-2 HLT timing tests (in order to guarantee perfect deterministic reproducibility across IBs)
  • use explicitly the alpaka process modifier in all the NGT scouting related workflows offload part of the HGCal reconstruction to GPUs

PR validation:

To be tested by the bot.

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

Not a backport, no backport needed.

@mmusich
Copy link
Contributor Author

mmusich commented Jan 14, 2026

type ngt

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 14, 2026

cms-bot internal usage

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @mmusich for master.

It involves the following packages:

  • Configuration/PyReleaseValidation (pdmv)
  • HLTrigger/Configuration (hlt)

@AdrianoDee, @DickyChant, @Martin-Grunewald, @antoniovagnerini, @cmsbuild, @miquork, @mmusich can you please review it and eventually sign? Thanks.
@Martin-Grunewald, @SohamBhattacharya, @VourMa, @fabiocos, @makortel, @missirol, @rovere, @slomeo this is something you requested to watch as well.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@mmusich
Copy link
Contributor Author

mmusich commented Jan 14, 2026

test parameters:

  • enable = hlt_p2_integration, hlt_p2_timing
  • workflows = ph2_hlt

@mmusich
Copy link
Contributor Author

mmusich commented Jan 14, 2026

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

-1

Failed Tests: RelVals
Size: This PR adds an extra 40KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-480d8a/50620/summary.html
COMMIT: 43410e9
CMSSW: CMSSW_16_1_X_2026-01-13-2300/el8_amd64_gcc13
Additional Tests: HLT_P2_INTEGRATION,HLT_P2_TIMING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/49821/50620/install.sh to create a dev area with all the needed externals and cmssw changes.

HLT P2 Timing: chart

Failed RelVals

----- Begin Fatal Exception 14-Jan-2026 14:46:42 CET-----------------------
An exception of category 'OutOfBound' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 4 stream: 0
   [1] Running path 'HLTriggerFinalPath'
   [2] Prefetching for module TriggerSummaryProducerAOD/'hltTriggerSummaryAOD'
   [3] Prefetching for module L1HPSPFTauProducer/'l1tHPSPFTauProducer'
   [4] Prefetching for module L1TPFCandMultiMerger/'l1tLayer1'
   [5] Prefetching for module L1TCorrelatorLayer1Producer/'l1tLayer1HGCal'
   [6] Calling method for module HGCalBackendLayer2Producer/'l1tHGCalBackEndLayer2Producer'
Exception Message:
TC X1 = 0.0713466 out of the seeding histogram bounds 0.076 - 0.58
----- End Fatal Exception -------------------------------------------------

@mmusich
Copy link
Contributor Author

mmusich commented Jan 15, 2026

The updated timing result, adding the alpaka process modifier is odd:

image

@mmusich
Copy link
Contributor Author

mmusich commented Jan 15, 2026

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 16KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-480d8a/50652/summary.html
COMMIT: 43410e9
CMSSW: CMSSW_16_1_X_2026-01-14-2300/el8_amd64_gcc13
Additional Tests: HLT_P2_INTEGRATION,HLT_P2_TIMING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/49821/50652/install.sh to create a dev area with all the needed externals and cmssw changes.

HLT P2 Timing: chart

Comparison Summary

Summary:

  • You potentially added 5 lines to the logs
  • Reco comparison results: 10 differences found in the comparisons
  • DQMHistoTests: Total files compared: 71
  • DQMHistoTests: Total histograms compared: 4580671
  • DQMHistoTests: Total failures: 1010
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4579641
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 70 files compared)
  • Checked 285 log files, 240 edm output root files, 71 DQM output files
  • TriggerResults: found differences in 3 / 69 workflows

@mmusich
Copy link
Contributor Author

mmusich commented Jan 16, 2026

please test

  • to finally have the reference timing.

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 16KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-480d8a/50684/summary.html
COMMIT: 43410e9
CMSSW: CMSSW_16_1_X_2026-01-15-2300/el8_amd64_gcc13
Additional Tests: HLT_P2_INTEGRATION,HLT_P2_TIMING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/49821/50684/install.sh to create a dev area with all the needed externals and cmssw changes.

HLT P2 Timing: chart

Comparison Summary

Summary:

  • You potentially added 6 lines to the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 6 differences found in the comparisons
  • DQMHistoTests: Total files compared: 71
  • DQMHistoTests: Total histograms compared: 4580671
  • DQMHistoTests: Total failures: 1002
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4579649
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 70 files compared)
  • Checked 285 log files, 240 edm output root files, 71 DQM output files
  • TriggerResults: found differences in 3 / 69 workflows

- add the alpaka process modifier to the NGT scouting timing test to offload to GPU part of the HGCal reconstrution
@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-49821/47534

@cmsbuild
Copy link
Contributor

Pull request #49821 was updated. @AdrianoDee, @DickyChant, @Martin-Grunewald, @antoniovagnerini, @cmsbuild, @miquork, @mmusich can you please check and sign again.

@mmusich
Copy link
Contributor Author

mmusich commented Jan 16, 2026

please test

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 20KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-480d8a/50695/summary.html
COMMIT: 9a7179d
CMSSW: CMSSW_16_1_X_2026-01-16-1100/el8_amd64_gcc13
Additional Tests: HLT_P2_INTEGRATION,HLT_P2_TIMING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/49821/50695/install.sh to create a dev area with all the needed externals and cmssw changes.

HLT P2 Timing: chart

Comparison Summary

Summary:

  • You potentially added 3 lines to the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 71
  • DQMHistoTests: Total histograms compared: 4580671
  • DQMHistoTests: Total failures: 1047
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4579604
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 70 files compared)
  • Checked 285 log files, 240 edm output root files, 71 DQM output files
  • TriggerResults: found differences in 3 / 69 workflows

@VourMa
Copy link
Contributor

VourMa commented Jan 19, 2026

hardcode the input file name for the phase-2 HLT timing tests (in order to guarantee perfect deterministic reproducibility across IBs)

For my understanding, is this part of the PR dropped by choice and the description is outdated, or is the intended change lost in some force-push? I only see alpaka-related changes up to now.

@mmusich
Copy link
Contributor Author

mmusich commented Jan 19, 2026

For my understanding, is this part of the PR dropped by choice and the description is outdated, or is the intended change lost in some force-push?

you should not take this PR at face value yet (that's why it's not signed).

@mmusich
Copy link
Contributor Author

mmusich commented Jan 19, 2026

The outcome of the tests at #49821 (comment) confirms the oddity of the timing tests when the alpaka modifier is used:

Vanilla CMSSW_16_1_X_2026-01-16-1100 CMSSW_16_1_X_2026-01-16-1100 + this PR
Screenshot from 2026-01-19 21-08-27 Screenshot from 2026-01-19 21-08-34

Observations:

  • the overall timing is increased;
  • the HGCal slice of the pie is sharply reduced
  • the amount of "other" explodes
plot(6)

@mmusich
Copy link
Contributor Author

mmusich commented Jan 20, 2026

Looking a the console log from here, attaching it to the thread to keep it available

PR cms-sw_cmssw#49821 el8_amd64_gcc13.txt

one can see the measurement is biased by a failure:

22:56:14 Config file NGTScouting_L1P2GT_HLT.py created
22:56:14 + '[' -e NGTScouting_L1P2GT_HLT.py ']'
22:56:14 + '[' '!' -d patatrack-scripts ']'
22:56:14 + patatrack-scripts/benchmark -j 16 -t 16 -s 16 -e 1000 --no-run-io-benchmark --event-skip 100 --event-resolution 10 -k Phase2Timing_resources.json -- NGTScouting_L1P2GT_HLT.py
22:56:17 2 CPUs:
22:56:17   0: AMD EPYC 7763 64-Core Processor (64 cores, 128 threads)
22:56:17   1: AMD EPYC 7763 64-Core Processor (64 cores, 128 threads)
22:56:17 
22:56:17 2 visible NVIDIA CUDA GPUs:
22:56:17   0: Tesla T4
22:56:17   1: Tesla T4
22:56:17 
22:56:17 No visible AMD ROCm GPUs
22:56:17 
22:56:18 Benchmarking NGTScouting_L1P2GT_HLT.py
22:56:18 Warming up
22:57:31 The underlying cmsRun job failed with return code 66
22:57:31 
22:57:31 The last lines of the error log are:
22:57:31    [1] Running path 'DST_PFScouting'
22:57:31    [2] Calling method for module SiPixelPhase2DigiToCluster@alpaka/'hltPhase2SiPixelClustersSoA'
22:57:31    [3] Calling Async::run()
22:57:31 Exception Message:
22:57:31 Framework is shutting down, further run() calls are not allowed
22:57:31 ----- End Fatal Exception -------------------------------------------------

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants