Skip to content

Conversation

@wddgit
Copy link
Contributor

@wddgit wddgit commented Feb 26, 2025

PR description:

Use the new LookupInitializationComplete signal to get access to the PathsAndConsumesOfModules object that is available to Services. Previously the PreBeginJob signal was used for this purpose, but since PR #46929 was merged, the object is only partially filled at preBeginJob. When the new signal is emitted it is fully filled. PathsAndConsumesOfModules now includes information related to EventSetup modules that previously did not exist. That is the part that is not filled at preBeginJob.

This shouldn't affect the output of anything (except DependencyGraph service used with SubProcesses, see below).

This PR removes the argument from the preBeginJob signal interface. For several of the services modified in this PR, that argument is not used and the PR just deletes it from the argument list. These changes are trivial and the fact that those services compile should be a sufficient test.

This PR modifies other affected services to use the new signal.

  • A few services are used in Framework unit tests and those test pass with the changes in this PR.
  • DependencyGraph runs and produces a graph successfully. If one modifies it to run without SubProcesses, the output .gv file is identical before and after this PR. (Note that with SubProcesses the version of DependencyGraph before this PR has a bug that is fixed with the changes in this PR. The old version assumes the PostBeginRun signal is a "local" signal" in ActivityRegistry but it is "global" so none of the SubProcess info is included in the .gv file that is output in the old version.)
  • FastTimerService is moved to the new signal and it looks like nothing should change in the output. There are many references to FastTimerService in CMSSW, but I don't see any unit test devoted to testing it. Is there an existing test of FastTimerService somewhere that I could run? Or is there an expert willing to test this? It is probably just a matter of running and checking the output is reasonable. Nothing should change.

There is one Service where the changes are more complex. That is the NVProfilerService. The PathsAndConsumesOfModules object is used to size some vectors at PreBeginJob. Not only is this case more complicated but the current implementation appears to be incorrect. There is a bug that could cause out of bounds vector access. The size of the ProcessCallGraph was being used to size the vectors, but that size could be incorrect if modules are deleted. The framework will delete unused modules. The ID from the ModuleDescription is incremented for every module constructed, including ones that are later deleted. This ID is not modified when a module is deleted. The index used to access the vector is the ID from the ModuleDescription, but in the old version the size is a count of undeleted modules from PathsAndConsumesOfModules and that does have deleted modules removed. This is fixed in the new version where the size is derived from the number of modules constructed and it counts the ones that later get deleted.

Note that there are no tests in the CMSSW repository of NVProfilerService. In fact there are no references to it at all in CMSSW. Is there a test of NVProfilerService somewhere that I could run? Or is there an expert willing to test this?

Should I develop tests for these two services? If necessary, this could be accomplished in a separate PR...

Resolves cms-sw/framework-team#1194

PR validation:

Relied on existing unit tests which all pass.

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 26, 2025

cms-bot internal usage

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47467/43899

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @wddgit for master.

It involves the following packages:

  • EventFilter/Utilities (daq)
  • FWCore/Framework (core)
  • FWCore/Integration (core)
  • FWCore/PrescaleService (core)
  • FWCore/ServiceRegistry (core)
  • FWCore/Services (core)
  • FWCore/TestProcessor (core)
  • HLTrigger/Timer (hlt)
  • HeterogeneousCore/CUDAServices (heterogeneous)
  • HeterogeneousCore/SonicTriton (heterogeneous)
  • IOMC/RandomEngine (core)
  • PerfTools/AllocMonitor (core)

@Dr15Jones, @Martin-Grunewald, @cmsbuild, @emeschi, @fwyzard, @makortel, @mmusich, @smorovic, @smuzaffar can you please review it and eventually sign? Thanks.
@Martin-Grunewald, @fabiocos, @felicepantaleo, @fwyzard, @kpedro88, @makortel, @missirol, @mmusich, @riga, @rovere this is something you requested to watch as well.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@makortel
Copy link
Contributor

@cmsbuild, please test

@wddgit
Copy link
Contributor Author

wddgit commented Feb 26, 2025

I just noticed that FastTimerService will also see a behavior change caused by this PR if used with SubProcesses. The activity that used to be in PostBeginJob would not have been executed for SubProcess signals and now would be. Not sure if this matters as we are not currently using SubProcess and discussing removing the SubProcess feature entirely. Possibly this fixes a problem also.

The behavior of other services would not be affected by this PR, but they might be having similar problems if used with SubProcess. NVProfilerService uses the PostBeginJob signal and might or might not be affected (even if it has nothing to do with this PR).

@makortel
Copy link
Contributor

Should I develop tests for these two services?

No.

@mmusich
Copy link
Contributor

mmusich commented Feb 26, 2025

@cmsbuild, please abort

@mmusich
Copy link
Contributor

mmusich commented Feb 26, 2025

test parameters:

  • enable = hlt_p2_timing

@mmusich
Copy link
Contributor

mmusich commented Feb 26, 2025

@cmsbuild, please test

@wddgit
Copy link
Contributor Author

wddgit commented Feb 26, 2025

@mmusich Thanks.

@fwyzard
Copy link
Contributor

fwyzard commented Feb 26, 2025

hold

@fwyzard
Copy link
Contributor

fwyzard commented Feb 26, 2025

until I will have time to assess the impact on the FastTimerService

@cmsbuild
Copy link
Contributor

Pull request has been put on hold by @fwyzard
They need to issue an unhold command to remove the hold state or L1 can unhold it for all

@wddgit
Copy link
Contributor Author

wddgit commented Apr 10, 2025

please test

@fwyzard
Copy link
Contributor

fwyzard commented Apr 10, 2025

@wddgit thanks for the update. The only question I still have is here .

@wddgit
Copy link
Contributor Author

wddgit commented Apr 10, 2025

I think I resolved the comments. Let me know if there are more. Two changes since you looked at it:

  • Removed callgraph_
  • Added call to nvtxDomainMark

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 24KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b40efd/45503/summary.html
COMMIT: 634b7fd
CMSSW: CMSSW_15_1_X_2025-04-10-1100/el8_amd64_gcc12
Additional Tests: HLT_P2_TIMING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/47467/45503/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 4 lines to the logs
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 3916361
  • DQMHistoTests: Total failures: 14
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3916327
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
  • Checked 215 log files, 184 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor

+core

@fwyzard
Copy link
Contributor

fwyzard commented Apr 11, 2025

+heterogeneous

@wddgit
Copy link
Contributor Author

wddgit commented Apr 11, 2025

This might conflict with #47659. It will be trivial to fix, but I will rebase on top of that one as soon as it is merged. (Possibly git will be smart enough to handle it automatically). Probably not a good idea to merge them at the same time...

@mmusich
Copy link
Contributor

mmusich commented Apr 11, 2025

+hlt

@makortel
Copy link
Contributor

Since this PR has more signatures than #47659, maybe it would be better to merge this one first if @cms-sw/daq-l2 signs soon?

@wddgit
Copy link
Contributor Author

wddgit commented Apr 11, 2025

Either order should work fine. Just not both at the same time. I agree that doing this one first makes sense if we get all the signatures.

@smorovic
Copy link
Contributor

+daq

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @mandrenguyen, @antoniovilela, @sextonkennedy, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@mandrenguyen
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 6eacf8c into cms-sw:master Apr 12, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove access to partially filled PathsAndConsumes object

7 participants