Skip to content

Conversation

@makortel
Copy link
Contributor

@makortel makortel commented Dec 22, 2025

PR description:

This PR removes CUDA-depending modules from DQM/SiPixelHeterogeneous. The inclusion of these modules in runTheMatrix workflow configurations came up in failures following CUDADataFormats dictionary removal in #49656 (comment). Since all direct CUDA components are slated for removal (#45844), this PR suggests to remove them. These components seem to have been superseded by more generic ones in #45206.

Resolves cms-sw/framework-team#1742

PR validation:

Workflows 11634.5 and 34434.5 succeeded.

These modules seem to have been supersed by more generic ones:
- SiPixelCompareVertexSoA -> SiPixelCompareVertices
- SiPixel*CompareRecHitsSoA -> SiPixel*CompareRecHits
- SiPixel*CompareTrackSoA -> SiPixel*CompareTracks
- SiPixel*MonitorRecHitsSoA -> SiPixel*MonitorRecHitsSoAAlpaka
- SiPixel*MonitorTrackSoA -> SiPixel*MonitorTrackSoAAlpaka
- SiPixelMonitorVertexSoA -> SiPixelMonitorVertexSoAAlpaka
@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 22, 2025

cms-bot internal usage

@makortel
Copy link
Contributor Author

FYI @cms-sw/heterogeneous-l2

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-49697/47255

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @makortel for master.

It involves the following packages:

  • DQM/SiPixelHeterogeneous (dqm)

@cmsbuild, @ctarricone, @gabrielmscampos, @nothingface0, @rseidita can you please review it and eventually sign? Thanks.
@fioriNTU, @idebruyn, @jandrea, @mmusich, @threus this is something you requested to watch as well.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

test parameters:

  • workflows = 136.8855,136.8885,11634.5,34434.5
  • enable = gpu

@makortel
Copy link
Contributor Author

@cmsbuild, please test


# Run-3 sequence
monitorpixelSoASource = cms.Sequence(siPixelPhase1MonitorRecHitsSoA * siPixelPhase1MonitorTrackSoA * siPixelMonitorVertexSoA)
monitorpixelSoASource = cms.Sequence()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear to me if these empty sequences serve any purpose anymore other than being placeholders that are toReplaceWith() below with various modifiers.

Comment on lines 128 to 130
monitorpixelSoACompareSource = cms.Sequence(siPixelPhase1MonitorRawDataACPU *
siPixelPhase1MonitorRawDataAGPU *
siPixelPhase1MonitorRecHitsSoACPU *
siPixelPhase1MonitorRecHitsSoAGPU *
siPixelPhase1CompareRecHitsSoA *
siPixelPhase1MonitorTrackSoAGPU *
siPixelPhase1MonitorTrackSoACPU *
siPixelPhase1CompareTrackSoA *
siPixelMonitorVertexSoACPU *
siPixelMonitorVertexSoAGPU *
siPixelCompareVertexSoA *
siPixelPhase1RawDataErrorComparator)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear to me if the remaining 3 modules in this Sequence would be useful, or if it would be better to remove them as well.

@cmsbuild
Copy link
Contributor

-1

Failed Tests: RelVals-NVIDIA_T4
Size: This PR adds an extra 20KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-046ac5/50392/summary.html
COMMIT: d99a3f5
CMSSW: CMSSW_16_1_X_2025-12-22-1100/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/49697/50392/install.sh to create a dev area with all the needed externals and cmssw changes.

Failed RelVals-NVIDIA_T4

ValueError: Undefined workflows: 29834.751, 29834.404, 29834.402, 29834.704, 29834.403

Comparison Summary

There are some workflows for which there are errors in the baseline:
11634.5 step 3
136.8855 step 3
136.8885 step 3
34434.5 step 3
The results for the comparisons for these workflows could be incomplete
This means most likely that the IB is having errors in the relvals.The error does NOT come from this pull request

Summary:

  • You potentially added 99 lines to the logs
  • Reco comparison results: 12 differences found in the comparisons
  • Reco comparison had 4 failed jobs
  • DQMHistoTests: Total files compared: 53
  • DQMHistoTests: Total histograms compared: 4280553
  • DQMHistoTests: Total failures: 15
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4280518
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 52 files compared)
  • Checked 239 log files, 204 edm output root files, 53 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor Author

ValueError: Undefined workflows: 29834.751, 29834.404, 29834.402, 29834.704, 29834.403

I wonder what this means in practice

@makortel
Copy link
Contributor Author

ValueError: Undefined workflows: 29834.751, 29834.404, 29834.402, 29834.704, 29834.403

I wonder what this means in practice

I opened an issue #49700

@makortel
Copy link
Contributor Author

@cmsbuild, please test

cms-sw/cms-bot#2636 was merged

@makortel
Copy link
Contributor Author

CPU comparison differences are related to #47071

-1

In RelVal-INPUT tests the workflow the input file for 2025.0020001_RunEGamma02025D_10k is unreachable (already in the IBs)

On NVIDIA H100 the runTheMatrix tests fail with

----- Begin Fatal Exception 29-Dec-2025 17:06:58 CET-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 0
   [1] Running path 'MC_Ele5_Open_Unseeded'
   [2] Calling method for module HGCalSoARecHitsLayerClustersProducer@alpaka/'hltHgcalSoARecHitsLayerClustersProducer'
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/build-any-ib/w/el8_amd64_gcc13/external/alpaka/2.0.0-8493f1d11d0378dc14d6ea6ecfc69ac5/include/alpaka/mem/buf/uniformCudaHip/traits/BufUniformCudaHipRtTraits.hpp(283) 'TApi::mallocAsync(&memPtr, static_cast<std::size_t>(width) * sizeof(TElem), queue.getNativeHandle())' returned error  : 'cudaErrorNotSupported': 'operation not supported'!
----- End Fatal Exception -------------------------------------------------

that could be an infrastructure problem.

@makortel
Copy link
Contributor Author

ignore tests-rejected with external-failure

@smuzaffar
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 6, 2026

-1

Size: This PR adds an extra 16KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-046ac5/50433/summary.html
COMMIT: d99a3f5
CMSSW: CMSSW_16_1_X_2026-01-05-1100/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/49697/50433/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

There are some workflows for which there are errors in the baseline:
11634.5 step 3
136.8855 step 3
136.8885 step 3
34434.5 step 3
The results for the comparisons for these workflows could be incomplete
This means most likely that the IB is having errors in the relvals.The error does NOT come from this pull request

Summary:

  • You potentially added 100 lines to the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 4 differences found in the comparisons
  • Reco comparison had 4 failed jobs
  • DQMHistoTests: Total files compared: 53
  • DQMHistoTests: Total histograms compared: 4280553
  • DQMHistoTests: Total failures: 72
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4280461
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 52 files compared)
  • Checked 239 log files, 204 edm output root files, 53 DQM output files
  • TriggerResults: no differences found

AMD_W7900 Comparison Summary

Summary:

  • You potentially removed 2 lines from the logs
  • Reco comparison results: 247 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 149371
  • DQMHistoTests: Total failures: 27569
  • DQMHistoTests: Total nulls: 8
  • DQMHistoTests: Total successes: 121794
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

NVIDIA_L40S Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 236 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 149371
  • DQMHistoTests: Total failures: 36928
  • DQMHistoTests: Total nulls: 9
  • DQMHistoTests: Total successes: 112434
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

NVIDIA_T4 Comparison Summary

Summary:

  • You potentially removed 10 lines from the logs
  • Reco comparison results: 222 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 149371
  • DQMHistoTests: Total failures: 35529
  • DQMHistoTests: Total nulls: 8
  • DQMHistoTests: Total successes: 113834
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor Author

makortel commented Jan 6, 2026

-1

Several of the 34634.X workflows continue to fail on NVIDIA H100 as in #49697 (comment). Maybe it's not an infrastructure problem (since other workflows succeed), but something else that would be better investigated separately from this PR.

@makortel
Copy link
Contributor Author

makortel commented Jan 6, 2026

-1

Several of the 34634.X workflows continue to fail on NVIDIA H100 as in #49697 (comment). Maybe it's not an infrastructure problem (since other workflows succeed), but something else that would be better investigated separately from this PR.

Those workflows succeed in IBs though

@makortel
Copy link
Contributor Author

makortel commented Jan 6, 2026

@cms-sw/dqm-l2 Could you please review and sign? This PR resolves the IB failures in the 4 workflows listed in #49697 (comment).

@gabrielmscampos
Copy link
Member

+dqm

  • Regarding your comment on the monitorpixelSoACompareSource modules, I'll look for a SiPixel contact to take a look into it. Since I don't think it is urgent and this fixes the IB failures listed, I'm signing it right away.

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 6, 2026

This pull request is fully signed and it will be integrated in one of the next master IBs (test failures were overridden). This pull request will now be reviewed by the release team before it's merged. @sextonkennedy, @mandrenguyen, @ftenchini (and backports should be raised in the release meeting by the corresponding L2)

@fwyzard
Copy link
Contributor

fwyzard commented Jan 6, 2026

@cms-sw/tracking-pog-l2 can you review the DQM sequences highlighted by @makortel comments ?

@mandrenguyen
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit d82850b into cms-sw:master Jan 7, 2026
24 of 25 checks passed
@makortel makortel deleted the cudaDQMRemove branch January 7, 2026 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove obsolete CUDA-using modules from DQM/SiPixelHeterogeneous

6 participants