Skip to content

Conversation

@fwyzard
Copy link
Contributor

@fwyzard fwyzard commented May 9, 2025

Include more libraries and tools in the ROCm package:

  • ROCr headers
  • hipCUB
  • hipBLAS/rocBLAS
  • hipFFT/rocFFT
  • hipSPARSE/rocSPARSE
  • hipSOLVER/rocSOLVER
  • RCCL
  • MIOpen

Merge hipRAND/rocRAND back into the base ROCm tool.

Drop support for Radeon Pro W6800.

fwyzard added 2 commits May 9, 2025 20:00
Include more libraries and tools in the ROCm package:
  - ROCr headers
  - hipCUB
  - hipBLAS/rocBLAS
  - hipFFT/rocFFT
  - hipSPARSE/rocSPARSE
  - hipSOLVER/rocSOLVER
  - RCCL
  - MIOpen

Merge hipRAND/rocRAND back into the base ROCm tool.
@fwyzard
Copy link
Contributor Author

fwyzard commented May 9, 2025

enable gpu

@cmsbuild
Copy link
Contributor

cmsbuild commented May 9, 2025

A new Pull Request was created by @fwyzard for branch IB/CMSSW_15_1_X/master.

@cmsbuild, @iarspider, @smuzaffar can you please review it and eventually sign? Thanks.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.
cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

cmsbuild commented May 9, 2025

cms-bot internal usage

@fwyzard
Copy link
Contributor Author

fwyzard commented May 9, 2025

test parameters:

  • full_cmssw = true
  • enable = gpu

@fwyzard
Copy link
Contributor Author

fwyzard commented May 9, 2025

please test

@cmsbuild
Copy link
Contributor

-1

Failed Tests: rocmUnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-747b89/46012/summary.html
COMMIT: 876534a
CMSSW: CMSSW_15_1_X_2025-05-09-1100/el8_amd64_gcc12
Additional Tests: CUDA,ROCM
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/9844/46012/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-747b89/46012/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-747b89/46012/git-merge-result

ROCm Unit Tests

I found 2 errors in the following unit tests:

---> test testRocmSoALayoutAndView_t had ERRORS
---> test alpakaTestBufferROCmAsync had ERRORS

Comparison Summary

There are some workflows for which there are errors in the baseline:
1000.0 step 2
1001.0 step 2
101.0 step 1
10224.0 step 3
11634.0 step 3
12434.0 step 3
12834.0 step 3
12846.0 step 3
13034.0 step 3
1306.0 step 3
13234.0 step 2
1330.0 step 3
135.4 step 1
136.731 step 3
136.793 step 3
136.874 step 3
139.001 step 3
140.045 step 3
140.56 step 2
14034.0 step 2
141.042 step 3
14234.0 step 2
145.014 step 3
145.104 step 3
145.202 step 3
145.301 step 3
145.408 step 3
145.5 step 3
145.604 step 3
145.713 step 3
16834.0 step 3
17034.0 step 3
24834.911 step 3
25.0 step 3
2500.201 step 2
250202.181 step 4
25202.0 step 3
29634.0 step 3
29634.75 step 2
29634.911 step 3
29696.0 step 3
29700.0 step 3
29834.999 step 4
312.0 step 3
4.22 step 3
4.53 step 3
5.1 step 1
8.0 step 4
9.0 step 3
The results for the comparisons for these workflows could be incomplete
This means most likely that the IB is having errors in the relvals.The error does NOT come from this pull request

Summary:

  • You potentially added 17495 lines to the logs
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 1
  • DQMHistoTests: Total histograms compared: 0
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 0
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0 KiB( 0 files compared)
  • Checked 138 log files, 69 edm output root files, 1 DQM output files

CUDA Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 1
  • DQMHistoTests: Total histograms compared: 0
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 0
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0 KiB( 0 files compared)
  • Checked 0 log files, 0 edm output root files, 1 DQM output files

ROCM Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 1
  • DQMHistoTests: Total histograms compared: 0
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 0
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0 KiB( 0 files compared)
  • Checked 0 log files, 0 edm output root files, 1 DQM output files

@fwyzard
Copy link
Contributor Author

fwyzard commented May 10, 2025

Reverting to ROCm 6.2.4 does not seem to help with the semi-random issues with ROCm:

$ cmsRun step3_RAW2DIGI_RECO_VALIDATION_DQM.py
%MSG-i ThreadStreamSetup:  (NoModuleName) 10-May-2025 10:36:24 EEST pre-events
setting # threads 8
setting # streams 8
%MSG
%MSG-i AlpakaService:  (NoModuleName) 10-May-2025 10:36:25 EEST pre-events
AlpakaServiceSerialSync succesfully initialised.
Found 1 device:
  - AMD EPYC 7A53 64-Core Processor
%MSG
%MSG-i ROCmService:  (NoModuleName) 10-May-2025 10:36:25 EEST pre-events
ROCm runtime version 6.2.41134, driver version 6.2.41134, AMD driver version 6.3.6
ROCm device 0: AMD Instinct MI250X (gfx90a:sramecc+:xnack-)
%MSG
%MSG-i AlpakaService:  (NoModuleName) 10-May-2025 10:36:26 EEST pre-events
AlpakaServiceROCmAsync succesfully initialised.
Found 1 device:
  - AMD Instinct MI250X
%MSG
10-May-2025 10:36:34 EEST  Initiating request to open file file:step2.root
10-May-2025 10:36:37 EEST  Successfully opened file file:step2.root
Begin processing the 1st record. Run 1, Event 10, LumiSection 1 on stream 1 at 10-May-2025 10:36:50.944 EEST
Begin processing the 2nd record. Run 1, Event 4, LumiSection 1 on stream 3 at 10-May-2025 10:36:51.071 EEST
Begin processing the 3rd record. Run 1, Event 6, LumiSection 1 on stream 7 at 10-May-2025 10:36:51.148 EEST
Begin processing the 4th record. Run 1, Event 7, LumiSection 1 on stream 4 at 10-May-2025 10:36:51.173 EEST
Begin processing the 5th record. Run 1, Event 5, LumiSection 1 on stream 6 at 10-May-2025 10:36:51.175 EEST
Begin processing the 6th record. Run 1, Event 8, LumiSection 1 on stream 2 at 10-May-2025 10:36:51.175 EEST
Begin processing the 7th record. Run 1, Event 3, LumiSection 1 on stream 0 at 10-May-2025 10:36:51.175 EEST
Begin processing the 8th record. Run 1, Event 9, LumiSection 1 on stream 5 at 10-May-2025 10:36:51.175 EEST
----- Begin Fatal Exception 10-May-2025 10:36:52 EEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 4 stream: 3
   [1] Running path 'dqmoffline_step'
   [2] Prefetching for module SiPixelCompareVertices/'siPixelCompareVertices'
   [3] Prefetching for module PixelVertexProducerAlpakaPhase1@alpaka/'pixelVerticesAlpaka'
   [4] Prefetching for module PixelVertexProducerAlpakaPhase1@alpaka/'pixelVerticesAlpaka'
   [5] Calling method for module CAHitNtupletAlpakaPhase1@alpaka/'pixelTracksAlpaka'
Exception Message:
A std::exception was thrown.
Requested allocation size 1558315927700 bytes is too large for the caching detail with maximum bin 1073741824 bytes. You might want to increase the maximum bin size
----- End Fatal Exception -------------------------------------------------
10-May-2025 10:36:53 EEST  Closed file file:step2.root

@fwyzard fwyzard closed this May 10, 2025
@fwyzard fwyzard deleted the IB/CMSSW_15_1_X/master_ROCm_624 branch May 12, 2025 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants