Skip to content

Conversation

@fwyzard
Copy link
Contributor

@fwyzard fwyzard commented May 4, 2025

ROCm

  • add ROCr headers (used by OFI), remove debuginfo files
  • make the list of ROCm GPU targets available to other externals

RDMA Core

  • update RDMA Core Userspace Libraries to version 57.0
  • make libibverbs plugins relocatable

OFI

  • add Libfabric OpenFabrics version 2.1.0

UCX

  • update UCX to version 1.18.1

Open MPI

  • update Open MPI to 4.1.8, plus the HEAD of the 4.1.x branch as of 05/05/2025

MPCH

  • add MPICH version 4.3.0.
  • MPICH is not selected by default, as it would conflict with Open MPI.

generic MPI tool

  • add a generic MPI tool that CMSSW packages can use instead of explicitly using Open MPI.
  • the CMSSW packages should declare use name="mpi" to use the generic tool: updating the tool to use MPICH instead of Open MPI automatically updates all packages.

@fwyzard
Copy link
Contributor Author

fwyzard commented May 4, 2025

backport #9871

Actually, backport of #9818, #9845, and #9871.

@cmsbuild
Copy link
Contributor

cmsbuild commented May 4, 2025

A new Pull Request was created by @fwyzard for branch IB/CMSSW_15_0_X/master.

@cmsbuild, @iarspider, @smuzaffar can you please review it and eventually sign? Thanks.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.
cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

cmsbuild commented May 4, 2025

cms-bot internal usage

@fwyzard
Copy link
Contributor Author

fwyzard commented May 4, 2025

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented May 5, 2025

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-279c8b/45831/summary.html
COMMIT: b65a4c0
CMSSW: CMSSW_15_0_X_2025-05-04-0000/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/9824/45831/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found 1 errors in the following unit tests:

---> test test_MilleZmm had ERRORS

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 6 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 4012378
  • DQMHistoTests: Total failures: 65
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4012293
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
  • Checked 218 log files, 189 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

@fwyzard fwyzard force-pushed the IB/CMSSW_15_0_X/master_openmpi_updates branch from b65a4c0 to d0573fa Compare May 5, 2025 16:47
@cmsbuild
Copy link
Contributor

cmsbuild commented May 5, 2025

Pull request #9824 was updated.

@fwyzard
Copy link
Contributor Author

fwyzard commented May 5, 2025

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented May 5, 2025

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-279c8b/45853/summary.html
COMMIT: d0573fa
CMSSW: CMSSW_15_0_X_2025-05-05-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/9824/45853/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 25 lines to the logs
  • Reco comparison results: 1 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 4012378
  • DQMHistoTests: Total failures: 48
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4012310
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
  • Checked 218 log files, 189 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 5, 2025

-1

Failed Tests: rocmUnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-279c8b/46544/summary.html
COMMIT: 6d8d6c5
CMSSW: CMSSW_15_0_X_2025-06-04-1100/el8_amd64_gcc12
Additional Tests: CUDA,ROCM
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/9824/46544/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-279c8b/46544/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-279c8b/46544/git-merge-result

ROCm Unit Tests

I found 2 errors in the following unit tests:

---> test testRocmSoALayoutAndView_t had ERRORS
---> test alpakaTestBufferROCmAsync had ERRORS

Comparison Summary

Summary:

  • You potentially added 17 lines to the logs
  • Reco comparison results: 399 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 4003557
  • DQMHistoTests: Total failures: 4561
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3998976
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.961 KiB( 48 files compared)
  • DQMHistoSizes: changed ( 1000.0 ): 0.480 KiB AlCaReco/TkAlKShortTracks
  • DQMHistoSizes: changed ( 1000.0 ): 0.480 KiB AlCaReco/TkAlLambdaTracks
  • Checked 216 log files, 188 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

CUDA Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 7
  • DQMHistoTests: Total histograms compared: 53071
  • DQMHistoTests: Total failures: 376
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 52695
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 6 files compared)
  • Checked 24 log files, 30 edm output root files, 7 DQM output files
  • TriggerResults: no differences found

ROCM Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 24 differences found in the comparisons
  • DQMHistoTests: Total files compared: 7
  • DQMHistoTests: Total histograms compared: 53071
  • DQMHistoTests: Total failures: 882
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 52189
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 6 files compared)
  • Checked 24 log files, 30 edm output root files, 7 DQM output files

@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 5, 2025

ignore tests-rejected with ib-failure

@mandrenguyen
Copy link

+1

@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 20, 2025

please test

@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 20, 2025

Let's refresh the tests, since they last ran two weeks ago.

@cmsbuild
Copy link
Contributor

-1

Failed Tests: rocmUnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-279c8b/46845/summary.html
COMMIT: 6d8d6c5
CMSSW: CMSSW_15_0_X_2025-06-19-2300/el8_amd64_gcc12
Additional Tests: CUDA,ROCM
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/9824/46845/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-279c8b/46845/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-279c8b/46845/git-merge-result

ROCm Unit Tests

I found 3 errors in the following unit tests:

---> test testRocmSoALayoutAndView_t had ERRORS
---> test alpakaTestRadixSortROCmAsync had ERRORS
---> test alpakaTestBufferROCmAsync had ERRORS

Comparison Summary

Summary:

  • You potentially added 19 lines to the logs
  • Reco comparison results: 359 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 4007448
  • DQMHistoTests: Total failures: 5979
  • DQMHistoTests: Total nulls: 209
  • DQMHistoTests: Total successes: 4001240
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -0.08200000000000006 KiB( 49 files compared)
  • DQMHistoSizes: changed ( 1000.0,... ): -0.002 KiB EcalEndcap/EETriggerTowerTask
  • Checked 218 log files, 189 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

CUDA Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 24 differences found in the comparisons
  • DQMHistoTests: Total files compared: 7
  • DQMHistoTests: Total histograms compared: 53086
  • DQMHistoTests: Total failures: 905
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 52181
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -0.002 KiB( 6 files compared)
  • DQMHistoSizes: changed ( 12834.412 ): -0.002 KiB EcalEndcap/EETriggerTowerTask
  • Checked 24 log files, 30 edm output root files, 7 DQM output files
  • TriggerResults: no differences found

ROCM Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 7
  • DQMHistoTests: Total histograms compared: 53086
  • DQMHistoTests: Total failures: 33
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 53053
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -0.002 KiB( 6 files compared)
  • DQMHistoSizes: changed ( 12834.412 ): -0.002 KiB EcalEndcap/EETriggerTowerTask
  • Checked 24 log files, 30 edm output root files, 7 DQM output files

@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 21, 2025 via email

@mandrenguyen
Copy link

+1

@smuzaffar
Copy link
Contributor

+externals

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 7, 2025

This pull request is fully signed and it will be integrated in one of the next IB/CMSSW_15_0_X/master IBs (test failures were overridden). This pull request will be automatically merged.

@cmsbuild cmsbuild merged commit f44446e into cms-sw:IB/CMSSW_15_0_X/master Jul 7, 2025
13 of 14 checks passed
@fwyzard fwyzard deleted the IB/CMSSW_15_0_X/master_openmpi_updates branch September 14, 2025 22:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants