Skip to content

Conversation

@fwyzard
Copy link
Contributor

@fwyzard fwyzard commented May 12, 2025

Various updates to the ROCm package for CMSSW:

  • remove the debuginfo files, that are not used;
  • include additional libraries and tool in the ROCm package.

The extra libraries are needed to build UCX with ROCm support.

Enable unified memory on Instinct MI100, MI210/250, and MI300 GPUs:

  • compile the device code without an explicit xnack setting, which supports running with xnack enabled or disabled;
  • attempt to enable xnack support setting HSA_XNACK=1.

@cmsbuild
Copy link
Contributor

cmsbuild commented May 12, 2025

A new Pull Request was created by @fwyzard for branch IB/CMSSW_15_0_X/master.

@cmsbuild, @iarspider, @smuzaffar can you please review it and eventually sign? Thanks.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.
cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

cmsbuild commented May 12, 2025

cms-bot internal usage

@fwyzard fwyzard force-pushed the IB/CMSSW_15_0_X/master_ROCm_updates branch from 66ce743 to f0949d7 Compare May 12, 2025 16:08
@fwyzard
Copy link
Contributor Author

fwyzard commented May 12, 2025

enable gpu

@fwyzard
Copy link
Contributor Author

fwyzard commented May 12, 2025

please test

@cmsbuild
Copy link
Contributor

Pull request #9853 was updated.

@cmsbuild
Copy link
Contributor

Pull request #9853 was updated.

@fwyzard fwyzard force-pushed the IB/CMSSW_15_0_X/master_ROCm_updates branch from cfed3b8 to 186a23c Compare May 13, 2025 22:17
@cmsbuild
Copy link
Contributor

Pull request #9853 was updated.

@fwyzard
Copy link
Contributor Author

fwyzard commented May 13, 2025

please test

@cmsbuild
Copy link
Contributor

-1

Failed Tests: Build
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-0b6a6c/46105/summary.html
COMMIT: 186a23c
CMSSW: CMSSW_15_0_X_2025-05-13-1100/el8_amd64_gcc12
Additional Tests: CUDA,ROCM
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/9853/46105/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-0b6a6c/46105/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-0b6a6c/46105/git-merge-result

Build

I found compilation error when building:

    raise RuntimeError("failed to load library '"+library+"'")
RuntimeError: failed to load library 'lib/el8_amd64_gcc12//libAnalysisDataFormatsTrackInfo.so'
@@@@ ----> OK EDM Class Version CUDADataFormatsBeamSpot
>> Checking EDM Class Transients in CUDADataFormatsBeamSpot
Suggestion: You can run 'scram build updateclassversion' to generate src/AnalysisDataFormats/TrackInfo/src/classes_def.xml.generated with updated ClassVersion
gmake: *** [tmp/el8_amd64_gcc12/edm_checks/libAnalysisDataFormatsTrackInfo.so] Error 1
@@@@ ----> OK EDM Class Version CUDADataFormatsCommon
>> Checking EDM Class Transients in CUDADataFormatsCommon
@@@@ ----> OK EDM Class Version AnalysisDataFormatsTopObjects
>> Checking EDM Class Version for src/CUDADataFormats/Track/src/classes_def.xml in CUDADataFormatsTrack
>> Checking EDM Class Transients in AnalysisDataFormatsTopObjects


@fwyzard
Copy link
Contributor Author

fwyzard commented May 14, 2025

The error

Error in <TUnixSystem::Load>: version mismatch, /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_15_0_X_2025-05-13-1100/lib/el8_amd64_gcc12/libAnalysisDataFormatsTrackInfo.so = 63211, ROOT = 63213
Traceback (most recent call last):
  File "/data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_15_0_X_2025-05-13-1100/src/FWCore/Reflection/scripts/edmCheckClassVersion", line 159, in <module>
    sys.exit(main(args))
  File "/data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_15_0_X_2025-05-13-1100/src/FWCore/Reflection/scripts/edmCheckClassVersion", line 124, in main
    ClassesDefUtils.initROOT(args.library)
  File "/data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_15_0_X_2025-05-13-1100/src/FWCore/Reflection/python/ClassesDefXmlUtils.py", line 101, in initROOT
    raise RuntimeError("failed to load library '"+library+"'")
RuntimeError: failed to load library 'lib/el8_amd64_gcc12//libAnalysisDataFormatsTrackInfo.so'
@@@@ ----> OK EDM Class Version CUDADataFormatsBeamSpot
>> Checking EDM Class Transients in CUDADataFormatsBeamSpot
Suggestion: You can run 'scram build updateclassversion' to generate src/AnalysisDataFormats/TrackInfo/src/classes_def.xml.generated with updated ClassVersion

seem unrelated to this PR, and possibly due to the other changes that were tested with it.

fwyzard added 3 commits May 14, 2025 20:59
Include more libraries and tools in the ROCm package:
  - ROCr headers
  - hipCUB
  - RCCL

Merge hipRAND/rocRAND back into the base ROCm tool.

Remove debuginfo files.
@fwyzard fwyzard force-pushed the IB/CMSSW_15_0_X/master_ROCm_updates branch from 186a23c to e8e0a7f Compare May 14, 2025 18:59
@fwyzard
Copy link
Contributor Author

fwyzard commented May 14, 2025

please test

Let's try again without spurious PRs.

@cmsbuild
Copy link
Contributor

Pull request #9853 was updated.

@cmsbuild
Copy link
Contributor

-1

Failed Tests: rocmUnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-0b6a6c/46139/summary.html
COMMIT: e8e0a7f
CMSSW: CMSSW_15_0_X_2025-05-14-1100/el8_amd64_gcc12
Additional Tests: CUDA,ROCM
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/9853/46139/install.sh to create a dev area with all the needed externals and cmssw changes.

ROCm Unit Tests

I found 3 errors in the following unit tests:

---> test testRocmSoALayoutAndView_t had ERRORS
---> test alpakaTestKernelROCmAsync had ERRORS
---> test alpakaTestBufferROCmAsync had ERRORS

Comparison Summary

Summary:

  • You potentially added 21 lines to the logs
  • Reco comparison results: 15 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 4005094
  • DQMHistoTests: Total failures: 71
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4005003
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
  • Checked 218 log files, 189 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

CUDA Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 24 differences found in the comparisons
  • DQMHistoTests: Total files compared: 7
  • DQMHistoTests: Total histograms compared: 53071
  • DQMHistoTests: Total failures: 877
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 52194
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 6 files compared)
  • Checked 24 log files, 30 edm output root files, 7 DQM output files
  • TriggerResults: no differences found

ROCM Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 24 differences found in the comparisons
  • DQMHistoTests: Total files compared: 7
  • DQMHistoTests: Total histograms compared: 53071
  • DQMHistoTests: Total failures: 878
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 52193
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 6 files compared)
  • Checked 24 log files, 30 edm output root files, 7 DQM output files

@fwyzard
Copy link
Contributor Author

fwyzard commented May 15, 2025

ignore tests-rejected with ib-failure

@fwyzard
Copy link
Contributor Author

fwyzard commented May 16, 2025

backport #9843

@smuzaffar
Copy link
Contributor

+externals

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 3, 2025

This pull request is fully signed and it will be integrated in one of the next IB/CMSSW_15_0_X/master IBs (test failures were overridden). This pull request will now be reviewed by the release team before it's merged. @rappoccio, @antoniovilela, @sextonkennedy, @mandrenguyen (and backports should be raised in the release meeting by the corresponding L2)

@smuzaffar
Copy link
Contributor

@cms-sw/orp-l2 , this is also ready to go in to next 15.0.X IB/release

@mandrenguyen
Copy link

+1

@cmsbuild cmsbuild merged commit c72fdf1 into cms-sw:IB/CMSSW_15_0_X/master Jun 4, 2025
14 of 15 checks passed
@fwyzard fwyzard deleted the IB/CMSSW_15_0_X/master_ROCm_updates branch September 14, 2025 22:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants