Skip to content

Conversation

@Sam-Harper
Copy link
Contributor

@Sam-Harper Sam-Harper commented Dec 1, 2021

PR description:

FastTimerServiceClient seemed to be taking a lot of time so I had a quick look.

It was quickly established that all the time was spent in TAxis::SetBinLabel https://root.cern.ch/doc/v616/TAxis_8cxx_source.html#l00809, specifically THashList::Rehash

I then had a quick look at FastTimerServiceClient and noticed its remaking the histograms every lumisection
https://github.com/cms-sw/cmssw/blob/master/HLTrigger/Timer/plugins/FastTimerServiceClient.cc#L95

I'm not entirely clear why it does it and I didnt spend any time to understand it. Perhaps experts could comment, I have no idea how DQM lumi plots work. It would be ideal if we could move that call to just end job if possible and experts agree.

However what it is doing is resetting the bin labels on each histogram each time. This is a very expensive operation so I put a simple check to see if the bin label is already set to that value before changing it. Honestly this should be inside TAxis::SetBinLabel ...

On my cern PC it goes from taking 14mins to 4mins for a file which only has a single lumisection. It'll be bigger gains for multi lumisection files.

@fwyzard

PR validation:

labels are still being set on my test workflow

@Sam-Harper Sam-Harper force-pushed the SetBinLabelFixes_1220pre2 branch from 54e6015 to b175820 Compare December 1, 2021 09:56
@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 1, 2021

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36313/27050

  • This PR adds an extra 24KB to repository

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@Sam-Harper
Copy link
Contributor Author

@cmsbuild please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 1, 2021

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36313/27051

  • This PR adds an extra 24KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 1, 2021

A new Pull Request was created by @Sam-Harper (Sam Harper) for master.

It involves the following packages:

  • DQMServices/Core (dqm)
  • HLTrigger/Timer (hlt)

@emanueleusai, @ahmad3213, @Martin-Grunewald, @missirol, @jfernan2, @pmandrik, @pbo0, @rvenditti can you please review it and eventually sign? Thanks.
@Martin-Grunewald, @missirol, @silviodonato, @barvic this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@fwyzard
Copy link
Contributor

fwyzard commented Dec 1, 2021

@Sam-Harper thanks for investigating the slowdown and for suggesting a workaround.

I'm confused as well as to why the histograms need to be recreated or re-set every lumisection, and clearly such slowdown was not there when the client was initially written, one or to DQM migrations ago.

Rather than checking if each label needs to be updated or not, could you try to check if the plots need to be "reconfigured" at all ?
The point is that if the configuration used to generate the timing data is the same, it should be possible to reuse the histrogams as they are, without checking every individual label.
And if the configuration is different, the number of entries in the axes may also be different, so just checking the lables is probably not enough.

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 1, 2021

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36313/27054

  • This PR adds an extra 24KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 1, 2021

Pull request #36313 was updated. @Martin-Grunewald, @emanueleusai, @ahmad3213, @cmsbuild, @missirol, @jfernan2, @pmandrik, @pbo0, @rvenditti can you please check and sign again.

@Sam-Harper
Copy link
Contributor Author

@Sam-Harper thanks for investigating the slowdown and for suggesting a workaround.

I'm confused as well as to why the histograms need to be recreated or re-set every lumisection, and clearly such slowdown was not there when the client was initially written, one or to DQM migrations ago.

Rather than checking if each label needs to be updated or not, could you try to check if the plots need to be "reconfigured" at all ? The point is that if the configuration used to generate the timing data is the same, it should be possible to reuse the histrogams as they are, without checking every individual label. And if the configuration is different, the number of entries in the axes may also be different, so just checking the lables is probably not enough.

Indeed. It was trivial to fix the super slow part (which I agree was clearly not there when it was written, it feels something recent broke it but I dont know what) but yes I would very much like to just make them once. Before I went down that way, I thought I would get confirmation from you first if there was some fundamental reason why it was done this way.

Also I've noticed this slows down sometimes and not others (ie it was ~okay on voms007 but was pretty bad on cms-hlt-gpu). Its very odd and I dont understand it yet. Will have a poke at it

@jfernan2
Copy link
Contributor

jfernan2 commented Dec 1, 2021

Thanks for this improvement @Sam-Harper
As far as I can track back, the SetBinLabel in the MonitorElement class goes at least as old in time as 9 years ago when code was moved from cvs to github taken the snapshot of CMSSW_6_2_0_pre8.

https://github.com/cms-sw/cmssw/blame/d61dd35a4019cccc753eff56d5fd02aa75ce0025/DQMServices/Core/src/MonitorElement.cc#L948

I am not sure if this change will work with MEs which have evolving labels at Online as LSs are coming in. I am not even sure if the Jenkin tests will be able to spot differences in labels from this PR. I will have to investigate further.

@Sam-Harper
Copy link
Contributor Author

Well if the label isnt identical it'll be changed. Otherwise it'll not be changed. Again really fixes an issue which should be fixed in root.

However the DQM is more of a driveby fix, I have to figure out the root cause in fasttimer client. THis PR is a bit of a work in progress.

@smuzaffar smuzaffar modified the milestones: CMSSW_12_2_X, CMSSW_12_3_X Dec 6, 2021
@smuzaffar smuzaffar modified the milestones: CMSSW_12_3_X, CMSSW_12_4_X Mar 11, 2022
@Sam-Harper
Copy link
Contributor Author

Okay coming back to this (as I had forgotten). And its pretty critical to have in the release with re-emulating L1 on the ZB (lots and lots of lumi section (hence why I remembered)

@jfernan2: do you want this fix for the rest of the DQM. If not, it gets dropped from the PR. Its harmless, it checks if the bin is already set to that label before setting. So the end result is the same...

For our problem, I have dug deeper and understood the problem.

  1. there is no need to do per lumi section updates of the histograms, they are identical except for their entry counts which is an artifact of being refilled all the time, bin contents are directly set so those are correct). I will update the PR to disable filling the histograms per lumi section and it only happens at end job. This is as we suspected.

  2. the reason this is happening is that IGetter.get requires the full path not just the name. So the code always thinks the ME doesnt exist when it does. However IBooker recognises that it exists (it just requires the name, it figures out the path) and thus reuses the existing histogram. I think the IGetter behavour changed with the DQMStore update two years ago but I didnt spend too much time doing archaeology . If so it would explain why we're seeing it now (we havnt run on many lumisections during the shutdown till now)

@jfernan2
Copy link
Contributor

@Sam-Harper I don't have a strong preference, so if the fix works for you I believe it is fine
In the meantime I will trigger the tests to spot any other possible problem in the change, since the bot did not recognize you command

@jfernan2
Copy link
Contributor

please test

@Sam-Harper
Copy link
Contributor Author

Thanks for the speedy response!

Hmm thinking about it then if you have no preference. I might just submit a new PR with the fast timer change I made today. I want this to get in CMSSW_12_3_0 as a bug fix (we need it!) and it might make people more comfortable with a one line change.

Its only a problem when folks are resetting the bin labels for the same histogram over and over which is arguably a bug condition.

@Sam-Harper
Copy link
Contributor Author

I've decided to fix it this way: #37402 and therefore will close this PR. Although if you decide you want the protections in, let me know.

@Sam-Harper Sam-Harper closed this Mar 29, 2022
@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-1c6a8b/23500/summary.html
COMMIT: 4dfae67
CMSSW: CMSSW_12_4_X_2022-03-29-1100/slc7_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/36313/23500/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3585896
  • DQMHistoTests: Total failures: 2
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3585872
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 200 log files, 45 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants