CPU vs. GPU for LST in HLT and updates to the offline #49832

VourMa · 2026-01-14T16:55:29Z

The goal of this PR is to introduce two HLT workflows to monitor the agreement between LST on CPU and LST on GPU:

Workflow 0.7541 monitors the LST output tracks when LST is used for track building (most direct comparison of LST), i.e. for alpakaValidationLST,singleIterPatatrack,trackingLST.
Workflow 0.7573 monitors the built tracks in the upcoming new tracking baseline, where LST is used as an extended seeding algorithm (comparison of LST output in a "production" configuration), i.e. for singleIterPatatrack,phase2CAExtension,trackingLST,seedingLST,trackingMkFitCommon,hltTrackingMkFitInitialStep.

The additional CPU reconstruction (SerialSync) and comparison plots are implemented with a new procModifier, alpakaValidationLST. This procModifier needs to be run only in the procModifier combinations mentioned above to take effect, otherwise it produces neither the additional products nor the comparison plots. It is also included in the alpakaValidation modifier chain.

The analyzer that produces the comparison plots has been improved with a new parameter option to skip luminosity and PU plots.

With the introduction of the alpakaValidationLST modifier, the offline workflow testing LST on CPU vs. LST on GPU can be made explicit. The code is changed so that the heterogeneous workflow 0.712 (previously 0.704) runs the offline reconstruction without any additional CPU reconstruction, while a new workflow, 0.713, runs the comparison. Workflow 0.703 has also been renamed to 0.711. The workflow numbering changes are made so that the offline LST workflows follow the numbering conventions for Alpaka workflows.

Some screenshots of the content of the DQM file:

cmsbuild · 2026-01-14T16:56:01Z

cms-bot internal usage

cmsbuild · 2026-01-14T16:57:26Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-49832/47487

There are other open Pull requests which might conflict with changes you have proposed:
- File Configuration/PyReleaseValidation/README.md modified in PR(s): Remove alpaka procModfier from workflows in which it is no longer useful #49755
- File Configuration/PyReleaseValidation/python/relval_Run4.py modified in PR(s): Remove alpaka procModfier from workflows in which it is no longer useful #49755
- File Configuration/PyReleaseValidation/python/upgradeWorkflowComponents.py modified in PR(s): Remove alpaka procModfier from workflows in which it is no longer useful #49755, Deprecate autoCondPhase2 keys for inexistent geometries #49790, update phase-2 HLT timing script and use alpaka modifier in NGT workflows #49821

cmsbuild · 2026-01-14T16:57:53Z

A new Pull Request was created by @VourMa for master.

It involves the following packages:

Configuration/EventContent (operations)
Configuration/ProcessModifiers (operations)
Configuration/PyReleaseValidation (pdmv)
DQM/TrackingMonitorClient (dqm)
DQM/TrackingMonitorSource (dqm)
HLTrigger/Configuration (hlt)
RecoTracker/IterativeTracking (reconstruction)
Validation/RecoTrack (dqm)

@AdrianoDee, @DickyChant, @Martin-Grunewald, @Moanwar, @antoniovagnerini, @cmsbuild, @ctarricone, @davidlange6, @fabiocos, @ftenchini, @gabrielmscampos, @jfernan2, @mandrenguyen, @miquork, @mmusich, @nothingface0, @rseidita, @srimanob can you please review it and eventually sign? Thanks.
@GiacomoSguazzoni, @Martin-Grunewald, @SohamBhattacharya, @VinInn, @VourMa, @arossi83, @dgulhan, @elusian, @fabiocos, @felicepantaleo, @fioriNTU, @gpetruc, @idebruyn, @jandrea, @makortel, @missirol, @mmasciov, @mmusich, @mtosi, @richa2710, @rovere, @slomeo, @sroychow, @threus, @wmtford this is something you requested to watch as well.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

mmusich · 2026-01-14T18:59:03Z

assign heterogeneous

cmsbuild · 2026-01-14T18:59:28Z

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

mmusich · 2026-01-14T19:00:21Z

Configuration/PyReleaseValidation/python/relval_Run4.py

 numWFIB.extend([prefixDet+34.7521])# HLTTiming75e33, ticl_v5, ticlv5TrackLinkingGNN
 numWFIB.extend([prefixDet+34.753]) # HLTTiming75e33, alpaka,singleIterPatatrack
 numWFIB.extend([prefixDet+34.754]) # HLTTiming75e33, alpaka,singleIterPatatrack,trackingLST
+numWFIB.extend([prefixDet+34.7541]) # HLTTiming75e33, alpakaValidationLST,singleIterPatatrack,trackingLST


Shouldn't this go rather in the gpu matrix? How do I test this from the bot with a GPU backend available?

Oops, my bad. Should be fixed in the last push.

cmsbuild · 2026-01-14T21:50:28Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-49832/47496

There are other open Pull requests which might conflict with changes you have proposed:
- File Configuration/PyReleaseValidation/README.md modified in PR(s): Remove alpaka procModfier from workflows in which it is no longer useful #49755
- File Configuration/PyReleaseValidation/python/upgradeWorkflowComponents.py modified in PR(s): Remove alpaka procModfier from workflows in which it is no longer useful #49755, Deprecate autoCondPhase2 keys for inexistent geometries #49790, update phase-2 HLT timing script and use alpaka modifier in NGT workflows #49821

cmsbuild · 2026-01-14T21:50:51Z

Pull request #49832 was updated. @AdrianoDee, @DickyChant, @Martin-Grunewald, @Moanwar, @antoniovagnerini, @cmsbuild, @ctarricone, @davidlange6, @fabiocos, @ftenchini, @fwyzard, @gabrielmscampos, @jfernan2, @makortel, @mandrenguyen, @miquork, @mmusich, @nothingface0, @rseidita, @srimanob can you please check and sign again.

mmusich · 2026-01-15T08:27:23Z

enable gpu

mmusich · 2026-01-15T08:32:41Z

test parameters:

enable = hlt_p2_integration, hlt_p2_timing
workflows = ph2_hlt
enable_tests = gpu
workflows_gpu = 34434.7041, 34434.7541, 34434.7573
relvals_opt = -w upgrade,standard
relvals_opt_gpu = -w upgrade,standard

mmusich · 2026-01-15T08:32:48Z

@cmsbuild, please test

mmusich · 2026-01-19T12:19:00Z

+1

the gpu matrix didn't run, despite #49832 (comment) + #49832 (comment). Not sure what's the right way of configuring it.

mmusich · 2026-01-20T12:56:09Z

test parameters:

enable = gpu, hlt_p2_integration, hlt_p2_timing
workflows = ph2_hlt
workflows_gpu = 34434.712, 34434.713, 34434.7541, 34434.7573
relvals_opt = -w upgrade,standard
relvals_opt_gpu = -w upgrade,standard

mmusich · 2026-01-20T12:56:17Z

@cmsbuild, please test

VourMa · 2026-01-21T16:07:26Z

Are the tests stuck on this PR?

smuzaffar · 2026-01-21T16:12:58Z

yes @VourMa , tests were stuck. I just have force rebuild the pending tests

VourMa · 2026-01-21T16:15:01Z

yes @VourMa , tests were stuck. I just have force rebuild the pending tests

Thanks a lot, @smuzaffar!

cmsbuild · 2026-01-21T18:08:03Z

-1

Failed Tests: RelVals-NVIDIA_H100
Size: This PR adds an extra 44KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-371ebe/50753/summary.html
COMMIT: b807f98
CMSSW: CMSSW_16_1_X_2026-01-20-1100/el8_amd64_gcc13
Additional Tests: GPU,HLT_P2_INTEGRATION,HLT_P2_TIMING,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/49832/50753/install.sh to create a dev area with all the needed externals and cmssw changes.

HLT P2 Timing: chart

Failed RelVals-NVIDIA_H100

ValueError: Undefined workflows: 34634.704

Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 4 differences found in the comparisons
DQMHistoTests: Total files compared: 73
DQMHistoTests: Total histograms compared: 4814076
DQMHistoTests: Total failures: 0
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 4814056
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 72 files compared)
Checked 293 log files, 250 edm output root files, 73 DQM output files
TriggerResults: no differences found

VourMa · 2026-01-21T19:53:55Z

I do not see where I might have missed a 0.704 workflow. If anyone has any suggestions, please let me know...

mmusich · 2026-01-21T20:04:59Z

I do not see where I might have missed a 0.704 workflow. If anyone has any suggestions, please let me know...

I think here

VourMa · 2026-01-21T20:10:05Z

I do not see where I might have missed a 0.704 workflow. If anyone has any suggestions, please let me know...

I think here

Oh, OK, thanks!
For my understanding, is this supposed to hard-coded and not controlled by some subset of workflows from this repository?
In any case, I can make a PR to the bot repo as well, if that's the recommended way.

mmusich · 2026-01-21T20:11:41Z

For my understanding, is this supposed to hard-coded and not controlled by some subset of workflows from this repository?

🤷‍♂️

VourMa · 2026-01-22T13:18:11Z

In any case, I can make a PR to the bot repo as well, if that's the recommended way.

The relevant PR has been made: cms-sw/cms-bot#2663

mmusich · 2026-01-22T16:36:59Z

test parameters:

enable = gpu, hlt_p2_integration, hlt_p2_timing
pull_request = Update GPU wfs for 1601 after renumbering in cmssw cms-bot#2663
workflows = ph2_hlt
workflows_gpu = 34434.712, 34434.713, 34434.7541, 34434.7573
relvals_opt = -w upgrade,standard
relvals_opt_gpu = -w upgrade,standard

mmusich · 2026-01-22T16:37:06Z

@cmsbuild, please test

cmsbuild · 2026-01-22T19:57:43Z

-1

Failed Tests: UnitTests RelVals-NVIDIA_L40S
Size: This PR adds an extra 16KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-371ebe/50817/summary.html
COMMIT: b807f98
CMSSW: CMSSW_16_1_X_2026-01-22-1100/el8_amd64_gcc13
Additional Tests: GPU,HLT_P2_INTEGRATION,HLT_P2_TIMING,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/49832/50817/install.sh to create a dev area with all the needed externals and cmssw changes.

HLT P2 Timing: chart

Failed Unit Tests

I found 1 errors in the following unit tests:

---> test RecoTrackerLSTCore-standalone-compilation had ERRORS

Failed RelVals-NVIDIA_L40S

34634.71334634.713_TTbar_14TeV+Run4D121PU_lstOnGPUIters01TrackingOnlyAlpakaValidationLST/step2_TTbar_14TeV+Run4D121PU_lstOnGPUIters01TrackingOnlyAlpakaValidationLST.log
34634.71234634.712_TTbar_14TeV+Run4D121PU_lstOnGPUIters01TrackingOnly/step2_TTbar_14TeV+Run4D121PU_lstOnGPUIters01TrackingOnly.log
34634.40334634.403_TTbar_14TeV+Run4D121PU_Patatrack_PixelOnlyAlpaka_Validation/step2_TTbar_14TeV+Run4D121PU_Patatrack_PixelOnlyAlpaka_Validation.log

Expand to see more relval errors ...

Comparison Summary

Summary:

You potentially removed 3 lines from the logs
Reco comparison results: 4 differences found in the comparisons
DQMHistoTests: Total files compared: 73
DQMHistoTests: Total histograms compared: 4814076
DQMHistoTests: Total failures: 3
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 4814053
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 72 files compared)
Checked 293 log files, 250 edm output root files, 73 DQM output files
TriggerResults: no differences found

VourMa · 2026-01-22T22:31:25Z

The failed RelVals are due to the recent, usual error:

----- Begin Fatal Exception 22-Jan-2026 19:34:42 CET-----------------------
An exception of category 'OutOfBound' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 0
   [1] Running path 'HLTriggerFinalPath'
   [2] Prefetching for module TriggerSummaryProducerAOD/'hltTriggerSummaryAOD'
   [3] Prefetching for module L1HPSPFTauProducer/'l1tHPSPFTauProducer'
   [4] Prefetching for module L1TPFCandMultiMerger/'l1tLayer1'
   [5] Prefetching for module L1TCorrelatorLayer1Producer/'l1tLayer1HGCal'
   [6] Calling method for module HGCalBackendLayer2Producer/'l1tHGCalBackEndLayer2Producer'
Exception Message:
TC X1 = 0.0683642 out of the seeding histogram bounds 0.076 - 0.58
----- End Fatal Exception -------------------------------------------------

while the failed unit test is unrelated and fixed in #49895.

cmsbuild added this to the CMSSW_16_1_X milestone Jan 14, 2026

cmsbuild added reconstruction-pending dqm-pending hlt-pending operations-pending pending-signatures tests-pending orp-pending pdmv-pending code-checks-pending tracking labels Jan 14, 2026

cmsbuild added code-checks-approved and removed code-checks-pending labels Jan 14, 2026

cmsbuild added the heterogeneous-pending label Jan 14, 2026

mmusich reviewed Jan 14, 2026

View reviewed changes

VourMa force-pushed the CMSSW_16_0_0_pre3_serialSync branch from 6fd4c49 to 2cf3f6b Compare January 14, 2026 21:48

cmsbuild added code-checks-pending and removed code-checks-approved labels Jan 14, 2026

cmsbuild added code-checks-approved and removed code-checks-pending labels Jan 14, 2026

cmsbuild added operations-pending tests-started and removed operations-approved tests-approved labels Jan 20, 2026

cmsbuild mentioned this pull request Jan 21, 2026

Add additional forwarding files for DataFormats #49889

Merged

cmsbuild added tests-rejected and removed tests-started labels Jan 21, 2026

VourMa mentioned this pull request Jan 22, 2026

Update GPU wfs for 1601 after renumbering in cmssw cms-sw/cms-bot#2663

Open

cmsbuild added requires-external tests-started and removed tests-rejected labels Jan 22, 2026

cmsbuild added tests-rejected and removed tests-started labels Jan 22, 2026

cmsbuild mentioned this pull request Jan 23, 2026

Modernization of TrackToTrackComparisonHists #49913

Open

CPU vs. GPU for LST in HLT and updates to the offline #49832

Are you sure you want to change the base?

CPU vs. GPU for LST in HLT and updates to the offline #49832

Conversation

VourMa commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmsbuild commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmsbuild commented Jan 14, 2026

Uh oh!

cmsbuild commented Jan 14, 2026

Uh oh!

mmusich commented Jan 14, 2026

Uh oh!

cmsbuild commented Jan 14, 2026

Uh oh!

mmusich Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

VourMa Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

cmsbuild commented Jan 14, 2026

Uh oh!

cmsbuild commented Jan 14, 2026

Uh oh!

mmusich commented Jan 15, 2026

Uh oh!

mmusich commented Jan 15, 2026

Uh oh!

mmusich commented Jan 15, 2026

Uh oh!

mmusich commented Jan 19, 2026

Uh oh!

mmusich commented Jan 20, 2026

Uh oh!

mmusich commented Jan 20, 2026

Uh oh!

VourMa commented Jan 21, 2026

Uh oh!

smuzaffar commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VourMa commented Jan 21, 2026

Uh oh!

cmsbuild commented Jan 21, 2026

Failed RelVals-NVIDIA_H100

Comparison Summary

Uh oh!

VourMa commented Jan 21, 2026

Uh oh!

mmusich commented Jan 21, 2026

Uh oh!

VourMa commented Jan 21, 2026

Uh oh!

mmusich commented Jan 21, 2026

Uh oh!

VourMa commented Jan 22, 2026

Uh oh!

mmusich commented Jan 22, 2026

Uh oh!

mmusich commented Jan 22, 2026

Uh oh!

cmsbuild commented Jan 22, 2026

Failed Unit Tests

Failed RelVals-NVIDIA_L40S

Comparison Summary

Uh oh!

VourMa commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

VourMa commented Jan 14, 2026 •

edited

Loading

cmsbuild commented Jan 14, 2026 •

edited

Loading

smuzaffar commented Jan 21, 2026 •

edited

Loading