-
Notifications
You must be signed in to change notification settings - Fork 4.6k
[DAQ] fix input source raw file deletion deadlock (15_1_X) #47641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cms-bot internal usage |
|
-code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47641/44178 ERROR: Build errors found during clang-tidy run. |
|
type bug-fix |
4059115 to
880d8cb
Compare
|
-code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47641/44179
Code check has found code style and quality issues which could be resolved by applying following patch(s)
|
880d8cb to
ec1d351
Compare
|
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47641/44181
|
|
Pull request #47641 was updated. |
|
@cmsbuild please test |
|
A new Pull Request was created by @smorovic for master. It involves the following packages:
@emeschi, @smorovic can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
|
Hi @smorovic |
|
The way files are deleted was changed in 15_1 and 15_0 this winter. It was delegated to a helper thread to avoid spending any time in the main loop. The decision of deleting file, which we want to do only after each event from that file is processed (due to collection of Error stream if there is a crash or exception) is done by tracking what each event stream is processing and whether it has moved on to a new file. Before this change we also had a secondary mechanism that tracked closure of lumisection and therefore all events in it are processed and any remaining files can be deleted. This was done via the DAQDirector service globalEndLumi callback, since that callback is not available in the input source. I throught of this mechanism as redundant so it was removed, but apparently triggered this problem. Now, with this fix, we also allow to proceed with deletion if event-streams that processed this file are in postEvent state, in which they are until they get new event from the source, so we don't depend on slow assignment of new events to event-stream. In fact, I think DaqDirector endLumi method above can still probably cause the same deadlock, just less probable in that way, but maybe it can happen with a lot of small files in one lumisection, for example). Generally speaking, we don't have an easy way to track files and we use these workarounds. It's not the same problem as offline root files and tracking their transition in CMSSW, because we have multiple files within a LS, while with root files it is the opposite. This commit made changes described: This was the PR: #47068 |
|
+1 Size: This PR adds an extra 56KB to repository Comparison SummarySummary:
|
|
+daq |
|
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @rappoccio, @antoniovilela, @sextonkennedy, @mandrenguyen (and backports should be raised in the release meeting by the corresponding L2) |
|
@cms-sw/orp-l2 , I am merging this and triggering an IB to get it testing as soon as possible |
PR description:
As seen in HLT at low input rate runs, source gets stuck in fetching files because streams do not get next event and are still in status of consuming the old file. This fix checks FastMonitoringService status that no event stream is processing this file.
PR validation:
Tested live in emulator run in CDAQ.