Skip to content

Conversation

@smuzaffar
Copy link
Contributor

Improved cuda-runtime package. Include all public libs.headers

@smuzaffar
Copy link
Contributor Author

test parameters:

  • full_cmssw = true
  • enable = gpu

@cmsbuild
Copy link
Contributor

cmsbuild commented May 5, 2025

A new Pull Request was created by @smuzaffar for branch IB/CMSSW_15_1_X/cudart.

@cmsbuild, @iarspider, @smuzaffar can you please review it and eventually sign? Thanks.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.
cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

cmsbuild commented May 5, 2025

cms-bot internal usage

@smuzaffar
Copy link
Contributor Author

please test for CMSSW_15_1_CUDART_X

@cmsbuild
Copy link
Contributor

cmsbuild commented May 5, 2025

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2e2d6e/45845/summary.html
COMMIT: f75447e
CMSSW: CMSSW_15_1_CUDART_X_2025-05-04-2300/el8_amd64_gcc12
Additional Tests: CUDA,ROCM
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/9828/45845/install.sh to create a dev area with all the needed externals and cmssw changes.

External Build

I found compilation error when building:

Trying to install the rpm package external+hwloc+2.12.0-af9daca5de88b067b0fe494e77bfe58b just built.
Checking package dependencies: external+hwloc+2.12.0-af9daca5de88b067b0fe494e77bfe58b
Done checking package dependencies: external+hwloc+2.12.0-af9daca5de88b067b0fe494e77bfe58b
Checking local path dependency for rpm package external+hwloc+2.12.0-af9daca5de88b067b0fe494e77bfe58b just build.
RPM installation stderr hwloc:
error: Failed dependencies:
	libnvidia-ml.so.1()(64bit) is needed by external+hwloc+2.12.0-af9daca5de88b067b0fe494e77bfe58b-1-1.x86_64

Failed to install RPM for hwloc
Build logs cleanup cudnn
Build successful cudnn.


@smuzaffar
Copy link
Contributor Author

please test for CMSSW_15_1_CUDART_X

@cmsbuild
Copy link
Contributor

cmsbuild commented May 5, 2025

Pull request #9828 was updated.

@cmsbuild
Copy link
Contributor

cmsbuild commented May 5, 2025

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2e2d6e/45850/summary.html
COMMIT: 8c2b514
CMSSW: CMSSW_15_1_CUDART_X_2025-05-04-2300/el8_amd64_gcc12
Additional Tests: CUDA,ROCM
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/9828/45850/install.sh to create a dev area with all the needed externals and cmssw changes.

External Build

I found compilation error when building:

FATAL: malformed spec found while quering it. Command: 
source /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/rpm-env.sh ;  rpm -q --specfile /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/tmpspec-cuda-runtime --info --define "cmsdist_directory /data/cmsbld/jenkins/workspace/ib-run-pr-tests/cmsdist" --define "compilerv 1231" --define "cmscompilerv 12" --define "cmsos el8_amd64" --define "almalinux_ver 8" --define "almalinux 8" --define "centos_ver 8" --define "centos 8" --define "rhel 8" --define "dist .el8" --define "el8 1" --define "package_vectorization %{nil}" --define "cmsswdata_version_link 1" --define "archfirst yes" --define "cmsBuild_bootstrap 1"  --define 'buildroot /foo'
Resulted in:

warning: line 30: It's not recommended to have unversioned Obsoletes: Obsoletes: external+cuda-runtime+1.0
error: line 439: Unknown tag: Privides: libnvidia-ml.so.1()(64bit)
error: query of specfile /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/tmpspec-cuda-runtime failed, can't parse
Traceback (most recent call last):
  File "./pkgtools/cmsBuild", line 5085, in 
    build(opts, args[1:], PKGFactory)
  File "./pkgtools/cmsBuild", line 4259, in build


@smuzaffar
Copy link
Contributor Author

please test for CMSSW_15_1_CUDART_X

@cmsbuild
Copy link
Contributor

cmsbuild commented May 5, 2025

Pull request #9828 was updated.

@cmsbuild
Copy link
Contributor

cmsbuild commented May 5, 2025

-1

Failed Tests: Build
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2e2d6e/45851/summary.html
COMMIT: b617162
CMSSW: CMSSW_15_1_CUDART_X_2025-05-04-2300/el8_amd64_gcc12
Additional Tests: CUDA,ROCM
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/9828/45851/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2e2d6e/45851/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2e2d6e/45851/git-merge-result

Build

I found compilation error when building:

/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/cuda/12.8.1-c6bbcf8cc4b37a17804bfdc25b1286a3/bin/nvcc -dlink -L/data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_15_1_CUDART_X_2025-05-04-2300/static/el8_amd64_gcc12 -lHeterogeneousTestCUDAKernel_nv -lHeterogeneousTestCUDADevice_nv -L/data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_15_1_CUDART_X_2025-05-04-2300/biglib/el8_amd64_gcc12 -L/data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_15_1_CUDART_X_2025-05-04-2300/lib/el8_amd64_gcc12 -L/data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_15_1_CUDART_X_2025-05-04-2300/external/el8_amd64_gcc12/lib -L/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/cuda/12.8.1-c6bbcf8cc4b37a17804bfdc25b1286a3/lib64 -L/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/cuda/12.8.1-c6bbcf8cc4b37a17804bfdc25b1286a3/lib64/stubs -lcudadevrt --diag-suppress 20014 -std=c++20 -O3 --generate-line-info --source-in-ptx --display-error-number --expt-relaxed-constexpr --extended-lambda -gencode arch=compute_60,code=[sm_60,compute_60] -gencode arch=compute_70,code=[sm_70,compute_70] -gencode arch=compute_75,code=[sm_75,compute_75] -gencode arch=compute_80,code=[sm_80,compute_80] -gencode arch=compute_89,code=[sm_89,compute_89] -gencode arch=compute_90,code=[sm_90,compute_90] -Wno-deprecated-gpu-targets -diag-suppress=3012 -diag-suppress=3189 -Xcudafe --diag_suppress=esa_on_defaulted_function_ignored -Xcudafe --gnu_version=120300 --cudart shared --compiler-options '-O3 -pthread -pipe -Werror=main -Werror=pointer-arith -Werror=overlength-strings -Wno-vla -Werror=overflow -ftree-vectorize -Werror=array-bounds -Werror=format-contains-nul -Werror=type-limits -fvisibility-inlines-hidden -fno-math-errno --param vect-max-version-for-alias-checks=50 -Xassembler --compress-debug-sections -Wno-error=array-bounds -Warray-bounds -fuse-ld=bfd -march=x86-64-v3 -felide-constructors -fmessage-length=0 -Wall -Wno-non-template-friend -Wno-long-long -Wreturn-type -Wextra -Wpessimizing-move -Wclass-memaccess -Wno-cast-function-type -Wno-unused-but-set-parameter -Wno-ignored-qualifiers -Wno-unused-parameter -Wunused -Wparentheses -Werror=return-type -Werror=missing-braces -Werror=unused-value -Werror=unused-label -Werror=address -Werror=format -Werror=sign-compare -Werror=write-strings -Werror=delete-non-virtual-dtor -Werror=strict-aliasing -Werror=narrowing -Werror=unused-but-set-variable -Werror=reorder -Werror=unused-variable -Werror=conversion-null -Werror=return-local-addr -Wnon-virtual-dtor -Werror=switch -fdiagnostics-show-option -Wno-unused-local-typedefs -Wno-attributes -Wno-psabi -DEIGEN_DONT_PARALLELIZE -DEIGEN_MAX_ALIGN_BYTES=64 -Wno-error=unused-variable -DBOOST_DISABLE_ASSERTS  -std=c++20 -fPIC ' tmp/el8_amd64_gcc12/src/HeterogeneousTest/CUDAKernel/plugins/HeterogeneousTestCUDAKernelPlugins/CUDATestKernelAdditionAlgo.cu.o -o tmp/el8_amd64_gcc12/src/HeterogeneousTest/CUDAKernel/plugins/HeterogeneousTestCUDAKernelPlugins/HeterogeneousTestCUDAKernelPlugins_cudadlink.o
>> Building edm plugin tmp/el8_amd64_gcc12/src/HeterogeneousTest/CUDAKernel/plugins/HeterogeneousTestCUDAKernelPlugins/libHeterogeneousTestCUDAKernelPlugins.so
/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/bin/c++ -O3 -pthread -pipe -Werror=main -Werror=pointer-arith -Werror=overlength-strings -Wno-vla -Werror=overflow -std=c++20 -ftree-vectorize -Werror=array-bounds -Werror=format-contains-nul -Werror=type-limits -fvisibility-inlines-hidden -fno-math-errno --param vect-max-version-for-alias-checks=50 -Xassembler --compress-debug-sections -Wno-error=array-bounds -Warray-bounds -fuse-ld=bfd -march=x86-64-v3 -felide-constructors -fmessage-length=0 -Wall -Wno-non-template-friend -Wno-long-long -Wreturn-type -Wextra -Wpessimizing-move -Wclass-memaccess -Wno-cast-function-type -Wno-unused-but-set-parameter -Wno-ignored-qualifiers -Wno-unused-parameter -Wunused -Wparentheses -Werror=return-type -Werror=missing-braces -Werror=unused-value -Werror=unused-label -Werror=address -Werror=format -Werror=sign-compare -Werror=write-strings -Werror=delete-non-virtual-dtor -Werror=strict-aliasing -Werror=narrowing -Werror=unused-but-set-variable -Werror=reorder -Werror=unused-variable -Werror=conversion-null -Werror=return-local-addr -Wnon-virtual-dtor -Werror=switch -fdiagnostics-show-option -Wno-unused-local-typedefs -Wno-attributes -Wno-psabi -DEIGEN_DONT_PARALLELIZE -DEIGEN_MAX_ALIGN_BYTES=64 -Wno-error=unused-variable -DBOOST_DISABLE_ASSERTS -flto=auto -fipa-icf -flto-odr-type-merging -fno-fat-lto-objects -Wodr -shared -Wl,-E    -Wl,-z,defs     tmp/el8_amd64_gcc12/src/HeterogeneousTest/CUDAKernel/plugins/HeterogeneousTestCUDAKernelPlugins/CUDATestKernelAdditionAlgo.cu.o tmp/el8_amd64_gcc12/src/HeterogeneousTest/CUDAKernel/plugins/HeterogeneousTestCUDAKernelPlugins/CUDATestKernelAdditionModule.cc.o tmp/el8_amd64_gcc12/src/HeterogeneousTest/CUDAKernel/plugins/HeterogeneousTestCUDAKernelPlugins/HeterogeneousTestCUDAKernelPlugins_cudadlink.o -o tmp/el8_amd64_gcc12/src/HeterogeneousTest/CUDAKernel/plugins/HeterogeneousTestCUDAKernelPlugins/libHeterogeneousTestCUDAKernelPlugins.so -Wl,-E -Wl,--hash-style=gnu -Wl,--as-needed -Wl,-z,noexecstack -L/data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_15_1_CUDART_X_2025-05-04-2300/biglib/el8_amd64_gcc12 -L/data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_15_1_CUDART_X_2025-05-04-2300/lib/el8_amd64_gcc12 -L/data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_15_1_CUDART_X_2025-05-04-2300/external/el8_amd64_gcc12/lib -L/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/cuda/12.8.1-c6bbcf8cc4b37a17804bfdc25b1286a3/lib64 -L/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/cuda/12.8.1-c6bbcf8cc4b37a17804bfdc25b1286a3/lib64/stubs -L/data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_15_1_CUDART_X_2025-05-04-2300/static/el8_amd64_gcc12 -lFWCoreFramework -lHeterogeneousCoreCUDAServices -lFWCoreCommon -lFWCoreServiceRegistry -lDataFormatsCommon -lFWCoreParameterSet -lHeterogeneousCoreCUDAUtilities -lFWCoreAbstractServices -lFWCoreMessageLogger -lDataFormatsProvenance -lFWCoreConcurrency -lFWCorePluginManager -lFWCoreReflection -Wl,--push-state -Wl,--no-as-needed -lHeterogeneousTestCUDAKernel -Wl,--pop-state -lFWCoreUtilities -lFWCoreVersion -Wl,--push-state -Wl,--no-as-needed -lHeterogeneousTestCUDADevice -Wl,--pop-state -lTree -lNet -lThread -lMathCore -lRIO -lboost_program_options -lCore -lboost_thread -lboost_date_time -lpcre -lbz2 -lcudart -lcudadevrt -lnvToolsExt -lnvidia-ml -luuid -ltbb -llzma -lz -lcuda -lfmt -lcms-md5 -lcrypt -ldl -lrt -lstdc++fs -ltinyxml2
Leaving library rule at src/HeterogeneousTest/CUDAKernel/plugins
@@@@ Running EDM write config for HeterogeneousTestCUDAKernelPlugins
error: edmWriteConfigs caught an exception while loading a plugin library.
The executable will return success (0) so scram will continue,
but no cfi files will be written.
An exception of category 'PluginLibraryLoadError' occurred.
Exception Message:
unable to load /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_15_1_CUDART_X_2025-05-04-2300/tmp/el8_amd64_gcc12/src/HeterogeneousTest/CUDAKernel/plugins/HeterogeneousTestCUDAKernelPlugins/libHeterogeneousTestCUDAKernelPlugins.so because libcudart.so.12: cannot open shared object file: No such file or directory


@smuzaffar
Copy link
Contributor Author

please test for CMSSW_15_1_CUDART_X

@cmsbuild
Copy link
Contributor

cmsbuild commented May 7, 2025

Pull request #9828 was updated.

@cmsbuild
Copy link
Contributor

cmsbuild commented May 7, 2025

-1

Failed Tests: UnitTests rocmUnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2e2d6e/45892/summary.html
COMMIT: d31ae0d
CMSSW: CMSSW_15_1_CUDART_X_2025-05-05-1100/el8_amd64_gcc12
Additional Tests: CUDA,ROCM
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/9828/45892/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2e2d6e/45892/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2e2d6e/45892/git-merge-result

Unit Tests

I found 2 errors in the following unit tests:

---> test TestDQMServicesDemo had ERRORS
---> test TestDQMGUIUpload had ERRORS

ROCm Unit Tests

I found 3 errors in the following unit tests:

---> test testRocmSoALayoutAndView_t had ERRORS
---> test alpakaTestBufferROCmAsync had ERRORS
---> test alpakaTestRadixSortROCmAsync had ERRORS

Comparison Summary

Summary:

@cmsbuild
Copy link
Contributor

cmsbuild commented May 7, 2025

Pull request #9828 was updated.

@smuzaffar
Copy link
Contributor Author

please test for CMSSW_15_1_CUDART_X

@cmsbuild
Copy link
Contributor

cmsbuild commented May 7, 2025

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2e2d6e/45936/summary.html
COMMIT: 064413c
CMSSW: CMSSW_15_1_CUDART_X_2025-05-05-1100/el8_amd64_gcc12
Additional Tests: CUDA,ROCM
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/9828/45936/install.sh to create a dev area with all the needed externals and cmssw changes.

External Build

I found compilation error when building:

+ '[' -e /cvmfs/projects.cern.ch/cms-restricted/x86_64/rhel8/external/cuda/12.8.1/nvvm/lib/libnvvm.so ']'
+ '[' -e /cvmfs/projects.cern.ch/cms-restricted/x86_64/rhel8/external/cuda/12.8.1/nvvm/lib/libnvvm.a ']'
+ echo 'ERROR: Unable to find nvvm/lib/libnvvm'
ERROR: Unable to find nvvm/lib/libnvvm
+ exit 1
error: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.SXsCeR (%install)


RPM build errors:
Macro expanded in comment on line 467: %{pkginstroot}/lib64



@smuzaffar
Copy link
Contributor Author

please test for CMSSW_15_1_CUDART_X

@cmsbuild
Copy link
Contributor

cmsbuild commented May 9, 2025

Pull request #9828 was updated.

@smuzaffar
Copy link
Contributor Author

please test for CMSSW_15_1_CUDART_X

@cmsbuild
Copy link
Contributor

cmsbuild commented May 9, 2025

Pull request #9828 was updated.

@cmsbuild
Copy link
Contributor

cmsbuild commented May 9, 2025

-1

Failed Tests: rocmUnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2e2d6e/45996/summary.html
COMMIT: 1e6b3ec
CMSSW: CMSSW_15_1_CUDART_X_2025-05-07-2300/el8_amd64_gcc12
Additional Tests: CUDA,ROCM
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/9828/45996/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2e2d6e/45996/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2e2d6e/45996/git-merge-result

ROCm Unit Tests

I found 2 errors in the following unit tests:

---> test testRocmSoALayoutAndView_t had ERRORS
---> test alpakaTestBufferROCmAsync had ERRORS

Comparison Summary

Summary:

  • You potentially added 17 lines to the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 69 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 4015624
  • DQMHistoTests: Total failures: 9065
  • DQMHistoTests: Total nulls: 50
  • DQMHistoTests: Total successes: 4006489
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 166.443 KiB( 49 files compared)
  • DQMHistoSizes: changed ( 139.001,... ): 0.001 KiB HLT/Filters
  • DQMHistoSizes: changed ( 145.014,... ): 6.954 KiB HLT/Filters
  • DQMHistoSizes: changed ( 16834.0,... ): 48.450 KiB HLT/HLTEgammaValidation
  • Checked 215 log files, 184 edm output root files, 50 DQM output files
  • TriggerResults: found differences in 10 / 48 workflows

@smuzaffar smuzaffar merged commit 2284a63 into IB/CMSSW_15_1_X/cudart May 12, 2025
11 of 12 checks passed
@smuzaffar smuzaffar deleted the cuda-runtime branch May 12, 2025 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants