-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Improve performance of heterogeneous CLUE. #49078
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Remove unnecessary memset operations on all cases in which the memory is in any case written by the kernels w/o assuming any previous value. In all other cases, leave the memset operations to preserve the correctness of the algorithms.
|
cms-bot internal usage |
|
enable gpu |
|
test parameters:
|
|
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-49078/46314 |
|
A new Pull Request was created by @rovere for master. It involves the following packages:
@Moanwar, @cmsbuild, @jfernan2, @mandrenguyen, @srimanob, @subirsarkar can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
|
@cmsbuild, please test |
|
type performance-improvements |
|
-1 Failed Tests: RelVals-AMD_MI300X The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic: You can see more details here: RelVals-AMD_MI300X
Expand to see more relval errors ...Comparison SummarySummary:
AMD_W7900 Comparison SummarySummary:
NVIDIA_H100 Comparison SummarySummary:
NVIDIA_L40S Comparison SummarySummary:
NVIDIA_T4 Comparison SummarySummary:
|
|
@cmsbuild, please test |
|
+1 Size: This PR adds an extra 16KB to repository The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:
You can see more details here: Comparison SummarySummary:
AMD_MI300X Comparison SummarySummary:
AMD_W7900 Comparison SummarySummary:
NVIDIA_H100 Comparison SummaryThere are some workflows for which there are errors in the baseline: Summary:
NVIDIA_L40S Comparison SummarySummary:
NVIDIA_T4 Comparison SummarySummary:
|
|
For the record the performance of this PR in combination with the latest commit of cms-sw/cmsdist#10114 has been checked, resulting in a huge timing performance gain when running the HLT Phase2 timing menu
[1] #!/bin/bash -ex
ALL_FILES_TTBAR='file:/shared/data/012dcc7c-fc39-45ad-b603-7cb987156456.root,file:/shared/data/02e8911a-095a-40c5-9200-a1b5efbfad45.root,file:/shared/data/08fb354c-6ed3-481e-b230-17822759dcdf.root,file:/shared/data/093f401c-5bc6-4101-9722-8f487b36d4d6.root,file:/shared/data/6058b392-1f46-4247-bf54-c753640131f8.root,file:/shared/data/613b476d-0041-4514-a0e6-911bc96c6516.root,file:/shared/data/6328ef1a-9228-442b-a427-45612bb7ce54.root,file:/shared/data/ac6c7f8a-f32d-4983-89c8-8029533e379c.root,file:/shared/data/aef94c88-cf94-4863-8f05-d7684b38a409.root,file:/shared/data/afbcecf7-2f9b-4bd3-9bb4-76cc8204cad1.root'
cmsDriver.py step2 -s L1P2GT,HLT:75e33_timing \
--conditions auto:phase2_realistic_T33 \
--datatier DQMIO,NANOAODSIM \
-n 1000 \
--eventcontent DQMIO,NANOAODSIM \
--geometry ExtendedRun4D110 \
--era Phase2C17I13M9 \
--procModifier alpaka \
--filein $ALL_FILES_TTBAR \
--nThreads 24 \
--process HLTX \
--inputCommands='keep *, drop *_hlt*_*_HLT, drop triggerTriggerFilterObjectWithRefs_l1t*_*_HLT' \
--no_exec \
--python_filename 75e33_timing_config_ONCPU.py
cat <<@EOF >> 75e33_timing_config_ONCPU.py
process.options.accelerators = ['cpu']
@EOF |
|
assign heterogeneous |
|
+1 |
|
+Upgrade |
|
+heterogeneous |
|
any idea why the |
no, we've been wondering that ourselves without a clear explanation. |
|
I guess one more reason to try and break down the |
|
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @mandrenguyen, @sextonkennedy, @ftenchini (and backports should be raised in the release meeting by the corresponding L2) |
|
+1 |
|
type ngt |

PR description:
Remove unnecessary memset operations on all cases in which the memory is in any case written by the kernels w/o assuming any previous value. In all other cases, leave the memset operations to preserve the correctness of the algorithms.
PR validation:
Run on a set of events, the output, in terms of
Trackstersis identical.The performance has improved both running on GPU and, also, using the alpaka Serial CPU backend (thanks @mmusich for the results of the tests!!)
To properly test this PR, we also need an updated version of the CLUE external library:
cms-sw/cmsdist#10114