-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Speeding up FastTimerServiceClient #36313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
54e6015 to
b175820
Compare
|
-code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36313/27050
Code check has found code style and quality issues which could be resolved by applying following patch(s)
|
|
@cmsbuild please test |
|
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36313/27051
|
|
A new Pull Request was created by @Sam-Harper (Sam Harper) for master. It involves the following packages:
@emanueleusai, @ahmad3213, @Martin-Grunewald, @missirol, @jfernan2, @pmandrik, @pbo0, @rvenditti can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
|
@Sam-Harper thanks for investigating the slowdown and for suggesting a workaround. I'm confused as well as to why the histograms need to be recreated or re-set every lumisection, and clearly such slowdown was not there when the client was initially written, one or to DQM migrations ago. Rather than checking if each label needs to be updated or not, could you try to check if the plots need to be "reconfigured" at all ? |
|
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36313/27054
|
|
Pull request #36313 was updated. @Martin-Grunewald, @emanueleusai, @ahmad3213, @cmsbuild, @missirol, @jfernan2, @pmandrik, @pbo0, @rvenditti can you please check and sign again. |
Indeed. It was trivial to fix the super slow part (which I agree was clearly not there when it was written, it feels something recent broke it but I dont know what) but yes I would very much like to just make them once. Before I went down that way, I thought I would get confirmation from you first if there was some fundamental reason why it was done this way. Also I've noticed this slows down sometimes and not others (ie it was ~okay on voms007 but was pretty bad on cms-hlt-gpu). Its very odd and I dont understand it yet. Will have a poke at it |
|
Thanks for this improvement @Sam-Harper I am not sure if this change will work with MEs which have evolving labels at Online as LSs are coming in. I am not even sure if the Jenkin tests will be able to spot differences in labels from this PR. I will have to investigate further. |
|
Well if the label isnt identical it'll be changed. Otherwise it'll not be changed. Again really fixes an issue which should be fixed in root. However the DQM is more of a driveby fix, I have to figure out the root cause in fasttimer client. THis PR is a bit of a work in progress. |
|
Okay coming back to this (as I had forgotten). And its pretty critical to have in the release with re-emulating L1 on the ZB (lots and lots of lumi section (hence why I remembered) @jfernan2: do you want this fix for the rest of the DQM. If not, it gets dropped from the PR. Its harmless, it checks if the bin is already set to that label before setting. So the end result is the same... For our problem, I have dug deeper and understood the problem.
|
|
@Sam-Harper I don't have a strong preference, so if the fix works for you I believe it is fine |
|
please test |
|
Thanks for the speedy response! Hmm thinking about it then if you have no preference. I might just submit a new PR with the fast timer change I made today. I want this to get in CMSSW_12_3_0 as a bug fix (we need it!) and it might make people more comfortable with a one line change. Its only a problem when folks are resetting the bin labels for the same histogram over and over which is arguably a bug condition. |
|
I've decided to fix it this way: #37402 and therefore will close this PR. Although if you decide you want the protections in, let me know. |
|
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-1c6a8b/23500/summary.html Comparison SummarySummary:
|
PR description:
FastTimerServiceClient seemed to be taking a lot of time so I had a quick look.
It was quickly established that all the time was spent in TAxis::SetBinLabel https://root.cern.ch/doc/v616/TAxis_8cxx_source.html#l00809, specifically THashList::Rehash
I then had a quick look at FastTimerServiceClient and noticed its remaking the histograms every lumisection
https://github.com/cms-sw/cmssw/blob/master/HLTrigger/Timer/plugins/FastTimerServiceClient.cc#L95
I'm not entirely clear why it does it and I didnt spend any time to understand it. Perhaps experts could comment, I have no idea how DQM lumi plots work. It would be ideal if we could move that call to just end job if possible and experts agree.
However what it is doing is resetting the bin labels on each histogram each time. This is a very expensive operation so I put a simple check to see if the bin label is already set to that value before changing it. Honestly this should be inside TAxis::SetBinLabel ...
On my cern PC it goes from taking 14mins to 4mins for a file which only has a single lumisection. It'll be bigger gains for multi lumisection files.
@fwyzard
PR validation:
labels are still being set on my test workflow