Skip to content

Conversation

@makortel
Copy link
Contributor

PR description:

Backport of #48824. Original description

Testing ROOT PR on an option to disable header parsing during TClass::GetClass() call (root-project/root#18402) it was noticed the type alias gets listed in the rootmap file only if the alias is requested before the real type (see root-project/root#19705 for more details).

In the mean time, having the type alias in the rootmap file is necessary to avoid header parsing for the execution of the read rules that use the type alias names, e.g.

rule.fSource = type + "::Layout layout_;";

, and therefore it seemed worthwhile to change the dictionaries with PortableHost{Collection,Object}.

In the present state this PR demonstrates what the impact would be for the PortableTestObjects. If deemed viable, the next steps would be to update DataFormats/Portable README and scripts, and then update all the other classes_def.xml files that declare these portable data products.

PR validation:

None beyond #48824

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

Backport of #48824 (using the same branch).

…heir aliased types in classes_def.xml

In ROOT presently the type alias gets listed in the rootmap file only
if the alias is requested before the real type. Having the type alias
in the rootmap file is necessary to avoid header parsing for the
execution of the read rules that use the type alias names.
…aliased-to types

This should avoid ROOT header parsing
They are not in DataFormats, so they are allowed to be transient only.
@cmsbuild
Copy link
Contributor

A new Pull Request was created by @makortel for CMSSW_16_0_X.

It involves the following packages:

  • DataFormats/BeamSpot (reconstruction)
  • DataFormats/EcalRecHit (reconstruction)
  • DataFormats/HGCalReco (reconstruction)
  • DataFormats/HcalDigi (simulation)
  • DataFormats/HcalRecHit (reconstruction)
  • DataFormats/ParticleFlowReco (reconstruction)
  • DataFormats/Portable (heterogeneous)
  • DataFormats/PortableTestObjects (heterogeneous)
  • DataFormats/SiPixelClusterSoA (heterogeneous, reconstruction)
  • DataFormats/SiPixelDigiSoA (heterogeneous, reconstruction)
  • RecoTracker/LSTCore (reconstruction)

@Moanwar, @civanch, @cmsbuild, @fwyzard, @jfernan2, @kpedro88, @makortel, @mandrenguyen, @mdhildreth, @srimanob can you please review it and eventually sign? Thanks.
@GiacomoSguazzoni, @ReyerBand, @VinInn, @VourMa, @abdoulline, @argiro, @bsunanda, @dgulhan, @dkotlins, @elusian, @felicepantaleo, @ferencek, @gpetruc, @hatakeyamak, @lgray, @mariadalfonso, @missirol, @mmasciov, @mmusich, @mroguljic, @mtosi, @rchatter, @rovere, @thomreis, @tsusa, @wang0jin this is something you requested to watch as well.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 12, 2026

cms-bot internal usage

@makortel
Copy link
Contributor Author

enable gpu

@makortel
Copy link
Contributor Author

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

-1

Size: This PR adds an extra 20KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-1bc021/50529/summary.html
COMMIT: d001897
CMSSW: CMSSW_16_0_X_2026-01-12-1100/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/49774/50529/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially removed 1 lines from the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 4 differences found in the comparisons
  • Reco comparison had 4 failed jobs
  • DQMHistoTests: Total files compared: 55
  • DQMHistoTests: Total histograms compared: 4513958
  • DQMHistoTests: Total failures: 67
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4513871
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 54 files compared)
  • Checked 235 log files, 208 edm output root files, 55 DQM output files
  • TriggerResults: no differences found

AMD_MI300X Comparison Summary

Summary:

  • You potentially added 3 lines to the logs
  • Reco comparison results: 234 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 149371
  • DQMHistoTests: Total failures: 31134
  • DQMHistoTests: Total nulls: 9
  • DQMHistoTests: Total successes: 118228
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

AMD_W7900 Comparison Summary

Summary:

  • You potentially added 9 lines to the logs
  • Reco comparison results: 241 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 149371
  • DQMHistoTests: Total failures: 32407
  • DQMHistoTests: Total nulls: 12
  • DQMHistoTests: Total successes: 116952
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

NVIDIA_L40S Comparison Summary

Summary:

  • You potentially removed 7 lines from the logs
  • Reco comparison results: 239 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 149371
  • DQMHistoTests: Total failures: 27798
  • DQMHistoTests: Total nulls: 11
  • DQMHistoTests: Total successes: 121562
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

NVIDIA_T4 Comparison Summary

Summary:

  • You potentially removed 14 lines from the logs
  • Reco comparison results: 249 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 149371
  • DQMHistoTests: Total failures: 30097
  • DQMHistoTests: Total nulls: 13
  • DQMHistoTests: Total successes: 119261
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor Author

On NVIDIA H100 some of the workflows failed with

----- Begin Fatal Exception 12-Jan-2026 19:18:50 CET-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 0
   [1] Running path 'MC_Ele5_Open_Unseeded'
   [2] Calling method for module HGCalSoARecHitsLayerClustersProducer@alpaka/'hltHgcalSoARecHitsLayerClustersProducer'
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/build-any-ib/w/el8_amd64_gcc13/external/alpaka/2.0.0-8493f1d11d0378dc14d6ea6ecfc69ac5/include/alpaka/mem/buf/uniformCudaHip/traits/BufUniformCudaHipRtTraits.hpp(283) 'TApi::mallocAsync(&memPtr, static_cast<std::size_t>(width) * sizeof(TElem), queue.getNativeHandle())' returned error  : 'cudaErrorNotSupported': 'operation not supported'!
----- End Fatal Exception -------------------------------------------------

The symptoms look like #47270 with

AlpakaServiceCudaAsync succesfully initialised.
Found 1 device:
  - NVIDIA H100L-2-24C MIG 2g.24gb

@fwyzard
Copy link
Contributor

fwyzard commented Jan 13, 2026

+heterogeneous

@fwyzard
Copy link
Contributor

fwyzard commented Jan 13, 2026

The failing tests were running on grid1165624 with the NVIDIA driver 580.x:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+

while the passing ones ran on ngt-nvidia-h100-01 with the NVIDIA driver 570.x:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+

@smuzaffar do we have different runners for the H100 tests ?

@fwyzard
Copy link
Contributor

fwyzard commented Jan 13, 2026

ignore tests-rejected with external-failure

@jfernan2
Copy link
Contributor

+1

@smuzaffar
Copy link
Contributor

@smuzaffar do we have different runners for the H100 tests ?

yes @fwyzard , we also get/use HTCondor GPU resources (gridXXX type host) . Although HTCondor has following GPUs but we get mostly get H100 one. If this is an issue then I can disable Jenkins to not use these resources

[a]

     28 GPUs_DeviceName = "NVIDIA A100-PCIE-40GB"
      7 GPUs_DeviceName = "NVIDIA H100L-1-12C MIG 1g.12gb"
     18 GPUs_DeviceName = "NVIDIA H100L-2-24C MIG 2g.24gb"
     27 GPUs_DeviceName = "NVIDIA H100 NVL"
     13 GPUs_DeviceName = "NVIDIA H200"
      1 GPUs_DeviceName = "Tesla P100-SXM2-16GB"
      5 GPUs_DeviceName = "Tesla T4"
      4 GPUs_DeviceName = "Tesla V100-PCIE-32GB"
      3 GPUs_DeviceName = "Tesla V100S-PCIE-32GB"

@civanch
Copy link
Contributor

civanch commented Jan 13, 2026

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next CMSSW_16_0_X IBs (test failures were overridden) and once validation in the development release cycle CMSSW_16_1_X is complete. This pull request will now be reviewed by the release team before it's merged. @ftenchini, @sextonkennedy, @mandrenguyen (and backports should be raised in the release meeting by the corresponding L2)

@mandrenguyen
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit e1cbacc into cms-sw:CMSSW_16_0_X Jan 14, 2026
45 of 46 checks passed
@makortel makortel deleted the portableCollectionDictionary branch January 14, 2026 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants