Skip to content

Conversation

@VourMa
Copy link
Contributor

@VourMa VourMa commented Jan 14, 2026

The goal of this PR is to introduce two HLT workflows to monitor the agreement between LST on CPU and LST on GPU:

  1. Workflow 0.7541 monitors the LST output tracks when LST is used for track building (most direct comparison of LST), i.e. for alpakaValidationLST,singleIterPatatrack,trackingLST.
  2. Workflow 0.7573 monitors the built tracks in the upcoming new tracking baseline, where LST is used as an extended seeding algorithm (comparison of LST output in a "production" configuration), i.e. for singleIterPatatrack,phase2CAExtension,trackingLST,seedingLST,trackingMkFitCommon,hltTrackingMkFitInitialStep.

The additional CPU reconstruction (SerialSync) and comparison plots are implemented with a new procModifier, alpakaValidationLST. This procModifier needs to be run only in the procModifier combinations mentioned above to take effect, otherwise it produces neither the additional products nor the comparison plots. It is also included in the alpakaValidation modifier chain.

The analyzer that produces the comparison plots has been improved with a new parameter option to skip luminosity and PU plots.

With the introduction of the alpakaValidationLST modifier, the offline workflow testing LST on CPU vs. LST on GPU can be made explicit. The code is changed so that the heterogeneous workflow 0.712 (previously 0.704) runs the offline reconstruction without any additional CPU reconstruction, while a new workflow, 0.713, runs the comparison. Workflow 0.703 has also been renamed to 0.711. The workflow numbering changes are made so that the offline LST workflows follow the numbering conventions for Alpaka workflows.

Some screenshots of the content of the DQM file:
Screenshot from 2026-01-07 19-38-56
image
image

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 14, 2026

cms-bot internal usage

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-49832/47487

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @VourMa for master.

It involves the following packages:

  • Configuration/EventContent (operations)
  • Configuration/ProcessModifiers (operations)
  • Configuration/PyReleaseValidation (pdmv)
  • DQM/TrackingMonitorClient (dqm)
  • DQM/TrackingMonitorSource (dqm)
  • HLTrigger/Configuration (hlt)
  • RecoTracker/IterativeTracking (reconstruction)
  • Validation/RecoTrack (dqm)

@AdrianoDee, @DickyChant, @Martin-Grunewald, @Moanwar, @antoniovagnerini, @cmsbuild, @ctarricone, @davidlange6, @fabiocos, @ftenchini, @gabrielmscampos, @jfernan2, @mandrenguyen, @miquork, @mmusich, @nothingface0, @rseidita, @srimanob can you please review it and eventually sign? Thanks.
@GiacomoSguazzoni, @Martin-Grunewald, @SohamBhattacharya, @VinInn, @VourMa, @arossi83, @dgulhan, @elusian, @fabiocos, @felicepantaleo, @fioriNTU, @gpetruc, @idebruyn, @jandrea, @makortel, @missirol, @mmasciov, @mmusich, @mtosi, @richa2710, @rovere, @slomeo, @sroychow, @threus, @wmtford this is something you requested to watch as well.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@mmusich
Copy link
Contributor

mmusich commented Jan 14, 2026

assign heterogeneous

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

numWFIB.extend([prefixDet+34.7521])# HLTTiming75e33, ticl_v5, ticlv5TrackLinkingGNN
numWFIB.extend([prefixDet+34.753]) # HLTTiming75e33, alpaka,singleIterPatatrack
numWFIB.extend([prefixDet+34.754]) # HLTTiming75e33, alpaka,singleIterPatatrack,trackingLST
numWFIB.extend([prefixDet+34.7541]) # HLTTiming75e33, alpakaValidationLST,singleIterPatatrack,trackingLST
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this go rather in the gpu matrix? How do I test this from the bot with a GPU backend available?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, my bad. Should be fixed in the last push.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-49832/47496

@cmsbuild
Copy link
Contributor

@mmusich
Copy link
Contributor

mmusich commented Jan 15, 2026

enable gpu

@mmusich
Copy link
Contributor

mmusich commented Jan 15, 2026

test parameters:

  • enable = hlt_p2_integration, hlt_p2_timing
  • workflows = ph2_hlt
  • enable_tests = gpu
  • workflows_gpu = 34434.7041, 34434.7541, 34434.7573
  • relvals_opt = -w upgrade,standard
  • relvals_opt_gpu = -w upgrade,standard

@mmusich
Copy link
Contributor

mmusich commented Jan 15, 2026

@cmsbuild, please test

@mmusich
Copy link
Contributor

mmusich commented Jan 19, 2026

+1

the gpu matrix didn't run, despite #49832 (comment) + #49832 (comment). Not sure what's the right way of configuring it.

@mmusich
Copy link
Contributor

mmusich commented Jan 20, 2026

test parameters:

  • enable = gpu, hlt_p2_integration, hlt_p2_timing
  • workflows = ph2_hlt
  • workflows_gpu = 34434.712, 34434.713, 34434.7541, 34434.7573
  • relvals_opt = -w upgrade,standard
  • relvals_opt_gpu = -w upgrade,standard

@mmusich
Copy link
Contributor

mmusich commented Jan 20, 2026

@cmsbuild, please test

@VourMa
Copy link
Contributor Author

VourMa commented Jan 21, 2026

Are the tests stuck on this PR?

@smuzaffar
Copy link
Contributor

smuzaffar commented Jan 21, 2026

yes @VourMa , tests were stuck. I just have force rebuild the pending tests

@VourMa
Copy link
Contributor Author

VourMa commented Jan 21, 2026

yes @VourMa , tests were stuck. I just have force rebuild the pending tests

Thanks a lot, @smuzaffar!

@cmsbuild
Copy link
Contributor

-1

Failed Tests: RelVals-NVIDIA_H100
Size: This PR adds an extra 44KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-371ebe/50753/summary.html
COMMIT: b807f98
CMSSW: CMSSW_16_1_X_2026-01-20-1100/el8_amd64_gcc13
Additional Tests: GPU,HLT_P2_INTEGRATION,HLT_P2_TIMING,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/49832/50753/install.sh to create a dev area with all the needed externals and cmssw changes.

HLT P2 Timing: chart

Failed RelVals-NVIDIA_H100

ValueError: Undefined workflows: 34634.704

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 73
  • DQMHistoTests: Total histograms compared: 4814076
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4814056
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 72 files compared)
  • Checked 293 log files, 250 edm output root files, 73 DQM output files
  • TriggerResults: no differences found

@VourMa
Copy link
Contributor Author

VourMa commented Jan 21, 2026

I do not see where I might have missed a 0.704 workflow. If anyone has any suggestions, please let me know...

@mmusich
Copy link
Contributor

mmusich commented Jan 21, 2026

I do not see where I might have missed a 0.704 workflow. If anyone has any suggestions, please let me know...

I think here

@VourMa
Copy link
Contributor Author

VourMa commented Jan 21, 2026

I do not see where I might have missed a 0.704 workflow. If anyone has any suggestions, please let me know...

I think here

Oh, OK, thanks!
For my understanding, is this supposed to hard-coded and not controlled by some subset of workflows from this repository?
In any case, I can make a PR to the bot repo as well, if that's the recommended way.

@mmusich
Copy link
Contributor

mmusich commented Jan 21, 2026

For my understanding, is this supposed to hard-coded and not controlled by some subset of workflows from this repository?

🤷‍♂️

@VourMa
Copy link
Contributor Author

VourMa commented Jan 22, 2026

In any case, I can make a PR to the bot repo as well, if that's the recommended way.

The relevant PR has been made: cms-sw/cms-bot#2663

@mmusich
Copy link
Contributor

mmusich commented Jan 22, 2026

test parameters:

@mmusich
Copy link
Contributor

mmusich commented Jan 22, 2026

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests RelVals-NVIDIA_L40S
Size: This PR adds an extra 16KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-371ebe/50817/summary.html
COMMIT: b807f98
CMSSW: CMSSW_16_1_X_2026-01-22-1100/el8_amd64_gcc13
Additional Tests: GPU,HLT_P2_INTEGRATION,HLT_P2_TIMING,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/49832/50817/install.sh to create a dev area with all the needed externals and cmssw changes.

HLT P2 Timing: chart

Failed Unit Tests

I found 1 errors in the following unit tests:

---> test RecoTrackerLSTCore-standalone-compilation had ERRORS

Failed RelVals-NVIDIA_L40S

  • 34634.71334634.713_TTbar_14TeV+Run4D121PU_lstOnGPUIters01TrackingOnlyAlpakaValidationLST/step2_TTbar_14TeV+Run4D121PU_lstOnGPUIters01TrackingOnlyAlpakaValidationLST.log
  • 34634.71234634.712_TTbar_14TeV+Run4D121PU_lstOnGPUIters01TrackingOnly/step2_TTbar_14TeV+Run4D121PU_lstOnGPUIters01TrackingOnly.log
  • 34634.40334634.403_TTbar_14TeV+Run4D121PU_Patatrack_PixelOnlyAlpaka_Validation/step2_TTbar_14TeV+Run4D121PU_Patatrack_PixelOnlyAlpaka_Validation.log
Expand to see more relval errors ...

Comparison Summary

Summary:

  • You potentially removed 3 lines from the logs
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 73
  • DQMHistoTests: Total histograms compared: 4814076
  • DQMHistoTests: Total failures: 3
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4814053
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 72 files compared)
  • Checked 293 log files, 250 edm output root files, 73 DQM output files
  • TriggerResults: no differences found

@VourMa
Copy link
Contributor Author

VourMa commented Jan 22, 2026

The failed RelVals are due to the recent, usual error:

----- Begin Fatal Exception 22-Jan-2026 19:34:42 CET-----------------------
An exception of category 'OutOfBound' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 0
   [1] Running path 'HLTriggerFinalPath'
   [2] Prefetching for module TriggerSummaryProducerAOD/'hltTriggerSummaryAOD'
   [3] Prefetching for module L1HPSPFTauProducer/'l1tHPSPFTauProducer'
   [4] Prefetching for module L1TPFCandMultiMerger/'l1tLayer1'
   [5] Prefetching for module L1TCorrelatorLayer1Producer/'l1tLayer1HGCal'
   [6] Calling method for module HGCalBackendLayer2Producer/'l1tHGCalBackEndLayer2Producer'
Exception Message:
TC X1 = 0.0683642 out of the seeding histogram bounds 0.076 - 0.58
----- End Fatal Exception -------------------------------------------------

while the failed unit test is unrelated and fixed in #49895.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants