Skip to content

Conversation

@fwyzard
Copy link
Contributor

@fwyzard fwyzard commented May 14, 2025

Fix the configuration of CUDA and cuDNN in PyTorch and related tools.

@fwyzard fwyzard changed the title PyTorch: fix use of CUDA and of nvtx3 PyTorch: fix use of CUDA and of nvtx3 [15.0.x] May 14, 2025
@cmsbuild
Copy link
Contributor

cmsbuild commented May 14, 2025

A new Pull Request was created by @fwyzard for branch IB/CMSSW_15_0_X/master.

@cmsbuild, @iarspider, @smuzaffar can you please review it and eventually sign? Thanks.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.
cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

cmsbuild commented May 14, 2025

cms-bot internal usage

@fwyzard
Copy link
Contributor Author

fwyzard commented May 14, 2025

backport #9860

@fwyzard
Copy link
Contributor Author

fwyzard commented May 14, 2025

please test

@cmsbuild
Copy link
Contributor

Pull request has been put on hold by @fwyzard
They need to issue an unhold command to remove the hold state or L1 can unhold it for all

@cmsbuild cmsbuild added the hold label May 14, 2025
@fwyzard fwyzard force-pushed the IB/CMSSW_15_0_X/master_fix_pytorch branch from 7014228 to 5950ff0 Compare May 14, 2025 07:47
@fwyzard fwyzard changed the title PyTorch: fix use of CUDA and of nvtx3 [15.0.x] PyTorch: fix configuration of CUDA and cuDNN [15.0.x] May 14, 2025
@cmsbuild
Copy link
Contributor

Pull request #9859 was updated.

@fwyzard
Copy link
Contributor Author

fwyzard commented May 14, 2025

please test

@cmsbuild
Copy link
Contributor

-1

Failed Tests: Build
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a16fd0/46112/summary.html
COMMIT: 5950ff0
CMSSW: CMSSW_15_0_X_2025-05-13-2300/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/9859/46112/install.sh to create a dev area with all the needed externals and cmssw changes.

Build

I found compilation error when building:

/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/bin/../lib/gcc/x86_64-redhat-linux-gnu/12.3.1/../../../../x86_64-redhat-linux-gnu/bin/ld.bfd: /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_15_0_X_2025-05-13-2300/external/el8_amd64_gcc12/lib/libtorch_cuda.so: undefined reference to `cudnnConvolutionBiasActivationForward'
/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/bin/../lib/gcc/x86_64-redhat-linux-gnu/12.3.1/../../../../x86_64-redhat-linux-gnu/bin/ld.bfd: /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_15_0_X_2025-05-13-2300/external/el8_amd64_gcc12/lib/libtorch_cuda.so: undefined reference to `cudnnSpatialTfGridGeneratorBackward'
/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/bin/../lib/gcc/x86_64-redhat-linux-gnu/12.3.1/../../../../x86_64-redhat-linux-gnu/bin/ld.bfd: /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_15_0_X_2025-05-13-2300/external/el8_amd64_gcc12/lib/libtorch_cuda.so: undefined reference to `cudnnBackendSetAttribute'
/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/bin/../lib/gcc/x86_64-redhat-linux-gnu/12.3.1/../../../../x86_64-redhat-linux-gnu/bin/ld.bfd: /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_15_0_X_2025-05-13-2300/external/el8_amd64_gcc12/lib/libtorch_cuda.so: undefined reference to `cudnnSetConvolutionMathType'
/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/bin/../lib/gcc/x86_64-redhat-linux-gnu/12.3.1/../../../../x86_64-redhat-linux-gnu/bin/ld.bfd: /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_15_0_X_2025-05-13-2300/external/el8_amd64_gcc12/lib/libtorch_cuda.so: undefined reference to `cudnnDestroySpatialTransformerDescriptor'
collect2: error: ld returned 1 exit status
>> Deleted: tmp/el8_amd64_gcc12/src/PhysicsTools/PyTorch/test/testTorch/testTorch
gmake: *** [tmp/el8_amd64_gcc12/src/PhysicsTools/PyTorch/test/testTorch/testTorch] Error 1
>> Compiling  src/PhysicsTools/PyTorch/test/testRunner.cc
/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/bin/c++ -c -DCMS_MICRO_ARCH='x86-64-v3' -DGNU_GCC -D_GNU_SOURCE -DBOOST_SPIRIT_THREADSAFE -DPHOENIX_THREADSAFE -DBOOST_MATH_DISABLE_STD_FPCLASSIFY -DBOOST_UUID_RANDOM_PROVIDER_FORCE_POSIX -DCMSSW_GIT_HASH='CMSSW_15_0_X_2025-05-13-2300' -DPROJECT_NAME='CMSSW' -DPROJECT_VERSION='CMSSW_15_0_X_2025-05-13-2300' -Isrc -Ipoison -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02889/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_X_2025-05-13-2300/src -I/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/pytorch/2.4.0-cac6549d087ed8f720bd68fc07f673de/include -I/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/pytorch/2.4.0-cac6549d087ed8f720bd68fc07f673de/include/torch/csrc/api/include -isystem/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/boost/1.80.0-f9d55d46c162407ba3033b506563d0e2/include -I/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/cppunit/1.15.x-fb84a4bbf5a436317d208e3ef0864e91/include -I/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/protobuf/3.21.9-3b4dc3892e5c8da1829fdf0bfe570820/include -I/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/zlib/1.2.13-d217cdbdd8d586e845e05946de2796be/include -O3 -pthread -pipe -Werror=main -Werror=pointer-arith -Werror=overlength-strings -Wno-vla -Werror=overflow -std=c++20 -ftree-vectorize -Werror=array-bounds -Werror=format-contains-nul -Werror=type-limits -fvisibility-inlines-hidden -fno-math-errno --param vect-max-version-for-alias-checks=50 -Xassembler --compress-debug-sections -Wno-error=array-bounds -Warray-bounds -fuse-ld=bfd -march=x86-64-v3 -felide-constructors -fmessage-length=0 -Wall -Wno-non-template-friend -Wno-long-long -Wreturn-type -Wextra -Wpessimizing-move -Wclass-memaccess -Wno-cast-function-type -Wno-unused-but-set-parameter -Wno-ignored-qualifiers -Wno-unused-parameter -Wunused -Wparentheses -Werror=return-type -Werror=missing-braces -Werror=unused-value -Werror=unused-label -Werror=address -Werror=format -Werror=sign-compare -Werror=write-strings -Werror=delete-non-virtual-dtor -Werror=strict-aliasing -Werror=narrowing -Werror=unused-but-set-variable -Werror=reorder -Werror=unused-variable -Werror=conversion-null -Werror=return-local-addr -Wnon-virtual-dtor -Werror=switch -fdiagnostics-show-option -Wno-unused-local-typedefs -Wno-attributes -Wno-psabi -Wno-error=unused-variable -DBOOST_DISABLE_ASSERTS -flto=auto -fipa-icf -flto-odr-type-merging -fno-fat-lto-objects -Wodr -fPIC -MMD -MF tmp/el8_amd64_gcc12/src/PhysicsTools/PyTorch/test/testTorchSimpleDnn/testRunner.cc.d src/PhysicsTools/PyTorch/test/testRunner.cc -o tmp/el8_amd64_gcc12/src/PhysicsTools/PyTorch/test/testTorchSimpleDnn/testRunner.cc.o
>> Compiling  src/PhysicsTools/PyTorch/test/testTorchSimpleDnn.cc


@fwyzard fwyzard force-pushed the IB/CMSSW_15_0_X/master_fix_pytorch branch from 5950ff0 to dfdacc2 Compare May 14, 2025 12:21
@cmsbuild
Copy link
Contributor

Pull request #9859 was updated.

@fwyzard
Copy link
Contributor Author

fwyzard commented May 14, 2025

please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a16fd0/46135/summary.html
COMMIT: 38f822e
CMSSW: CMSSW_15_0_X_2025-05-14-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/9859/46135/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 16 lines to the logs
  • Reco comparison results: 7 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 4005094
  • DQMHistoTests: Total failures: 66
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4005008
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
  • Checked 218 log files, 189 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

@fwyzard
Copy link
Contributor Author

fwyzard commented May 16, 2025

type bugfix

@smuzaffar
Copy link
Contributor

please test

lets refresh tests

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a16fd0/46255/summary.html
COMMIT: 38f822e
CMSSW: CMSSW_15_0_X_2025-05-20-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/9859/46255/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

@smuzaffar
Copy link
Contributor

+externals

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next IB/CMSSW_15_0_X/master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @rappoccio, @antoniovilela, @sextonkennedy, @mandrenguyen (and backports should be raised in the release meeting by the corresponding L2)

@mandrenguyen
Copy link

+1

@cmsbuild cmsbuild merged commit 8c9af35 into cms-sw:IB/CMSSW_15_0_X/master Jun 3, 2025
9 checks passed
@fwyzard fwyzard deleted the IB/CMSSW_15_0_X/master_fix_pytorch branch September 14, 2025 22:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants