Skip to content

Conversation

@smorovic
Copy link
Contributor

PR description:

As seen in HLT at low input rate runs, source gets stuck in fetching files because streams do not get next event and are still in status of consuming the old file. This fix checks FastMonitoringService status that no event stream is processing this file.

PR validation:

Tested live in emulator run in CDAQ.

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 20, 2025

cms-bot internal usage

@cmsbuild
Copy link
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47641/44178

ERROR: Build errors found during clang-tidy run.

src/EventFilter/Utilities/src/FedRawDataInputSource.cc:730:65: error: expected ')' [clang-diagnostic-error]
  730 |             if (fileLSOpen && (!fms_ || !fms_->streamIsIdle(i)) {
      |                                                                 ^
src/EventFilter/Utilities/src/FedRawDataInputSource.cc:730:16: note: to match this '('
--
src/EventFilter/Utilities/src/FedRawDataInputSource.cc:734:11: error: expected statement [clang-diagnostic-error]
  734 |           }
      |           ^
Suppressed 2082 warnings (2078 in non-user code, 4 NOLINT).
--
gmake: *** [config/SCRAM/GMake/Makefile.coderules:129: code-checks] Error 2
gmake: *** [There are compilation/build errors. Please see the detail log above.] Error 2

@smorovic
Copy link
Contributor Author

type bug-fix

@cmsbuild
Copy link
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47641/44179

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47641/44181

@cmsbuild
Copy link
Contributor

Pull request #47641 was updated.

@smorovic
Copy link
Contributor Author

@cmsbuild please test

@smorovic smorovic marked this pull request as ready for review March 20, 2025 16:45
@cmsbuild
Copy link
Contributor

A new Pull Request was created by @smorovic for master.

It involves the following packages:

  • EventFilter/Utilities (daq)

@emeschi, @smorovic can you please review it and eventually sign? Thanks.
@Martin-Grunewald, @missirol this is something you requested to watch as well.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@srimanob
Copy link
Contributor

Hi @smorovic
Could you elaborate more why it happens in this 15_0? Something changes in DAQ code that leads to this issue. I can't understand from the PR description. Thanks.

@smorovic
Copy link
Contributor Author

smorovic commented Mar 20, 2025

The way files are deleted was changed in 15_1 and 15_0 this winter. It was delegated to a helper thread to avoid spending any time in the main loop.

The decision of deleting file, which we want to do only after each event from that file is processed (due to collection of Error stream if there is a crash or exception) is done by tracking what each event stream is processing and whether it has moved on to a new file. Before this change we also had a secondary mechanism that tracked closure of lumisection and therefore all events in it are processed and any remaining files can be deleted. This was done via the DAQDirector service globalEndLumi callback, since that callback is not available in the input source. I throught of this mechanism as redundant so it was removed, but apparently triggered this problem.
Also, the patch introduced the limit of files that are in deletion queue so that we don't get processes that has too many open files left, so in combination with above it created this problem.

Now, with this fix, we also allow to proceed with deletion if event-streams that processed this file are in postEvent state, in which they are until they get new event from the source, so we don't depend on slow assignment of new events to event-stream. In fact, I think DaqDirector endLumi method above can still probably cause the same deadlock, just less probable in that way, but maybe it can happen with a lot of small files in one lumisection, for example).

Generally speaking, we don't have an easy way to track files and we use these workarounds. It's not the same problem as offline root files and tracking their transition in CMSSW, because we have multiple files within a LS, while with root files it is the opposite.

This commit made changes described:

commit 7a99b4babb32578ad1a519182e2ec533f4715b64
Author: Srecko <[email protected]>
Date:   Thu Jul 4 15:25:48 2024 +0200

    file deleter thread instead of DAQ director
    improved synchronization between threads (less sleeping)

This was the PR: #47068

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 56KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-3a6dc0/45115/summary.html
COMMIT: ec1d351
CMSSW: CMSSW_15_1_X_2025-03-20-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/47641/45115/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

@smorovic
Copy link
Contributor Author

+daq

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @rappoccio, @antoniovilela, @sextonkennedy, @mandrenguyen (and backports should be raised in the release meeting by the corresponding L2)

@smuzaffar
Copy link
Contributor

@cms-sw/orp-l2 , I am merging this and triggering an IB to get it testing as soon as possible

@smuzaffar smuzaffar merged commit 0a37ce5 into cms-sw:master Mar 20, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants