Skip to content

Conversation

@fwyzard
Copy link
Contributor

@fwyzard fwyzard commented May 9, 2025

Include additional libraries and tools in the ROCm package, needed to build UCX and MPI with ROCm support.

Enable unified memory on Instinct MI100, MI210/250, and MI300 GPUs:

  • compile the device code without an explicit xnack setting, which supports running with xnack enabled or disabled;
  • attempt to enable xnack support setting HSA_XNACK=1.

@fwyzard
Copy link
Contributor Author

fwyzard commented May 9, 2025

enable gpu

@fwyzard
Copy link
Contributor Author

fwyzard commented May 9, 2025

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented May 9, 2025

A new Pull Request was created by @fwyzard for branch IB/CMSSW_15_1_X/master.

@iarspider, @smuzaffar can you please review it and eventually sign? Thanks.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.
cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

cmsbuild commented May 9, 2025

cms-bot internal usage

@cmsbuild
Copy link
Contributor

-1

Failed Tests: RelVals-ROCM rocmUnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-874667/46009/summary.html
COMMIT: 38abd56
CMSSW: CMSSW_15_1_X_2025-05-09-1100/el8_amd64_gcc12
Additional Tests: CUDA,ROCM
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/9843/46009/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals-ROCM

  • 12834.40312834.403_TTbar_14TeV+2024_Patatrack_PixelOnlyAlpaka_Validation/step3_TTbar_14TeV+2024_Patatrack_PixelOnlyAlpaka_Validation.log

ROCm Unit Tests

I found 2 errors in the following unit tests:

---> test testRocmSoALayoutAndView_t had ERRORS
---> test alpakaTestBufferROCmAsync had ERRORS

Comparison Summary

There are some workflows for which there are errors in the baseline:
1000.0 step 2
1001.0 step 2
101.0 step 1
10224.0 step 3
11634.0 step 3
12434.0 step 3
12834.0 step 3
12846.0 step 3
13034.0 step 3
1306.0 step 3
13234.0 step 2
1330.0 step 3
135.4 step 1
136.731 step 3
136.793 step 3
136.874 step 3
139.001 step 3
140.045 step 3
140.56 step 2
14034.0 step 2
141.042 step 3
14234.0 step 2
145.014 step 3
145.104 step 3
145.202 step 3
145.301 step 3
145.408 step 3
145.5 step 3
145.604 step 3
145.713 step 3
16834.0 step 3
17034.0 step 3
24834.911 step 3
25.0 step 3
2500.201 step 2
250202.181 step 4
25202.0 step 3
29634.0 step 3
29634.75 step 2
29634.911 step 3
29696.0 step 3
29700.0 step 3
29834.999 step 4
312.0 step 3
4.22 step 3
4.53 step 3
5.1 step 1
8.0 step 4
9.0 step 3
The results for the comparisons for these workflows could be incomplete
This means most likely that the IB is having errors in the relvals.The error does NOT come from this pull request

Summary:

  • You potentially added 17593 lines to the logs
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 1
  • DQMHistoTests: Total histograms compared: 0
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 0
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0 KiB( 0 files compared)
  • Checked 138 log files, 69 edm output root files, 1 DQM output files

CUDA Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 1
  • DQMHistoTests: Total histograms compared: 0
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 0
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0 KiB( 0 files compared)
  • Checked 0 log files, 0 edm output root files, 1 DQM output files

@fwyzard
Copy link
Contributor Author

fwyzard commented May 10, 2025

ignore tests-rejected with ib-failure

@fwyzard
Copy link
Contributor Author

fwyzard commented May 10, 2025

The ROCm failures are a known issue.

@cmsbuild
Copy link
Contributor

Pull request #9843 was updated.

@fwyzard
Copy link
Contributor Author

fwyzard commented May 10, 2025

Rebased after merging #9818.

@fwyzard
Copy link
Contributor Author

fwyzard commented May 10, 2025

please test

@fwyzard
Copy link
Contributor Author

fwyzard commented May 10, 2025

hold

We may need to change the xnack settings.

@cmsbuild
Copy link
Contributor

Pull request has been put on hold by @fwyzard
They need to issue an unhold command to remove the hold state or L1 can unhold it for all

@cmsbuild cmsbuild added the hold label May 10, 2025
@cmsbuild
Copy link
Contributor

-1

Failed Tests: RelVals-ROCM rocmUnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-874667/46018/summary.html
COMMIT: 384ba7f
CMSSW: CMSSW_15_1_X_2025-05-09-2300/el8_amd64_gcc12
Additional Tests: CUDA,ROCM
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/9843/46018/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-874667/46018/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-874667/46018/git-merge-result

RelVals-ROCM

  • 12834.40312834.403_TTbar_14TeV+2024_Patatrack_PixelOnlyAlpaka_Validation/step3_TTbar_14TeV+2024_Patatrack_PixelOnlyAlpaka_Validation.log

ROCm Unit Tests

I found 3 errors in the following unit tests:

---> test testRocmSoALayoutAndView_t had ERRORS
---> test alpakaTestBufferROCmAsync had ERRORS
---> test alpakaTestRadixSortROCmAsync had ERRORS

Comparison Summary

Summary:

  • You potentially added 13 lines to the logs
  • Reco comparison results: 2 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 4038131
  • DQMHistoTests: Total failures: 3
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4038108
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
  • Checked 215 log files, 184 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

CUDA Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 1
  • DQMHistoTests: Total histograms compared: 0
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 0
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0 KiB( 0 files compared)
  • Checked 0 log files, 0 edm output root files, 1 DQM output files

@fwyzard fwyzard force-pushed the IB/CMSSW_15_1_X/master_ROCm_updates branch from 8587e4b to 215e928 Compare May 13, 2025 22:34
@cmsbuild
Copy link
Contributor

Pull request #9843 was updated.

@fwyzard
Copy link
Contributor Author

fwyzard commented May 13, 2025

OK, for the time being I've only added here the minimal set of libraries for UCX and MPI.
I'll push the larger set of libraries used by PyTorch to a separate PR, and we can discuss if we want them and how to handle the large size.

@fwyzard
Copy link
Contributor Author

fwyzard commented May 13, 2025

please test

@cmsbuild
Copy link
Contributor

-1

Failed Tests: rocmUnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-874667/46107/summary.html
COMMIT: 215e928
CMSSW: CMSSW_15_1_X_2025-05-13-1100/el8_amd64_gcc12
Additional Tests: CUDA,ROCM
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/9843/46107/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-874667/46107/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-874667/46107/git-merge-result

ROCm Unit Tests

I found 3 errors in the following unit tests:

---> test testRocmSoALayoutAndView_t had ERRORS
---> test alpakaTestKernelROCmAsync had ERRORS
---> test alpakaTestBufferROCmAsync had ERRORS

Comparison Summary

Summary:

  • You potentially added 341 lines to the logs
  • Reco comparison results: 7428 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 4038163
  • DQMHistoTests: Total failures: 165060
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 3873082
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 229.73700000000002 KiB( 49 files compared)
  • DQMHistoSizes: changed ( 16834.0,... ): 112.246 KiB GEM/Digis
  • DQMHistoSizes: changed ( 16834.0,... ): 1.978 KiB GEM/RecHits
  • DQMHistoSizes: changed ( 17034.0 ): 1.289 KiB SiStrip/MechanicalView
  • Checked 215 log files, 184 edm output root files, 50 DQM output files
  • TriggerResults: found differences in 2 / 48 workflows

CUDA Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 1
  • DQMHistoTests: Total histograms compared: 0
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 0
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0 KiB( 0 files compared)
  • Checked 0 log files, 0 edm output root files, 1 DQM output files

ROCM Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 1
  • DQMHistoTests: Total histograms compared: 0
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 0
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0 KiB( 0 files compared)
  • Checked 0 log files, 0 edm output root files, 1 DQM output files

@fwyzard
Copy link
Contributor Author

fwyzard commented May 14, 2025

ignore tests-rejected with ib-failure

@fwyzard
Copy link
Contributor Author

fwyzard commented May 14, 2025

@smuzaffar could you merge these changes ?

They are needed to rebase the MPI PR, and to make progress debugging the issues on LUMI.

Thanks !

@smuzaffar
Copy link
Contributor

+externals

Thanks @fwyzard for the cleanup up. rocm now just has 800MB more ( 2.8GB as compare to 2GB previous).

@smuzaffar smuzaffar merged commit 2870eb8 into cms-sw:IB/CMSSW_15_1_X/master May 14, 2025
14 of 15 checks passed
@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next IB/CMSSW_15_1_X/master IBs (test failures were overridden). This pull request will now be reviewed by the release team before it's merged. @antoniovilela, @rappoccio, @sextonkennedy, @mandrenguyen (and backports should be raised in the release meeting by the corresponding L2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants