Skip to content

Conversation

@smorovic
Copy link
Contributor

PR description:

Fix a crash caused by the missing luminosityBlockAuxiliary object, seen in rare instances at the start of run in production DAQ/HLT.

A race can happen because the local file lock is unlocked before lsToStart is determined.
Since this check involves stat calls for marker files locally, another competing process can in parallel create
the EoL marker file being checked (and cause lsToStart to be increased
above LS of the newly opened file).

In case lsToStart is larger than ls, source would skip opening a lumisection
before ending up processing events and this results in the lumi block
related assertion.

Bugfix ensures that lumisection is opened before file can be processed.

PR validation:

Tested in DAQ filterfarm test system (Openstack VMs). A special test mode was constructed to emulate conditions of crash and find a fix.

A race can happen because local lock is unlocked when lsToStart is determined,
and since this involves checking for marker files locally, another competing process can create
EoL file since lock was released (and cause lsToStart to be increased
above LS of the newly opened file).

In case lsToStart is larger than ls, source would skip opening a lumisection
before ending up processing events and this results in the lumi block
related assertion.
@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35682/25973

  • This PR adds an extra 24KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @smorovic (Srecko Morovic) for master.

It involves the following packages:

  • EventFilter/Utilities (daq, reconstruction)

@jpata, @cmsbuild, @emeschi, @smorovic, @slava77 can you please review it and eventually sign? Thanks.
@Martin-Grunewald, @missirol this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@smorovic
Copy link
Contributor Author

@cmsbuild please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-86092f/19645/summary.html
COMMIT: 8fe4166
CMSSW: CMSSW_12_1_X_2021-10-14-1100/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/35682/19645/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-86092f/19645/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-86092f/19645/git-merge-result

Comparison Summary

The workflows 1001.0 have different files in step1_dasquery.log than the ones found in the baseline. You may want to check and retrigger the tests if necessary. You can check it in the "files" directory in the results of the comparisons

Summary:

  • No significant changes to the logs found
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 24 differences found in the comparisons
  • DQMHistoTests: Total files compared: 40
  • DQMHistoTests: Total histograms compared: 2768870
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2768842
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -10588.944 KiB( 38 files compared)
  • DQMHistoSizes: changed ( 10024.0,... ): -3151.653 KiB CTPPS/TimingFastSilicon
  • DQMHistoSizes: changed ( 10024.0,... ): 2360.820 KiB CTPPS/DiamondSampic
  • DQMHistoSizes: changed ( 11634.0,... ): 2441.273 KiB CTPPS/DiamondSampic
  • Checked 160 log files, 37 edm output root files, 40 DQM output files
  • TriggerResults: no differences found

@smorovic
Copy link
Contributor Author

+daq

@slava77
Copy link
Contributor

slava77 commented Oct 15, 2021

+reconstruction

for #35682 8fe4166

  • code changes are only in FedRawDataInputSource.cc, which is not really a part of reco, but is still in a shared package EventFilter/Utilities
  • jenkins tests pass; the differences should be related to other PRs included in the tests

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@perrotta
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit aefd5fd into cms-sw:master Oct 15, 2021
@smorovic smorovic deleted the 121X-fix-ls-assert branch August 24, 2023 08:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants