Skip to content

Conversation

@fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Jan 12, 2021

The PR description needs to be updated to reflect the recent developments.

PR description:

Let multiple CMSSW processes on the same or different machines coordinate event processing and transfer data products over MPI.

The implementation is based on four CMSSW modules.
Two are responsible for setting up the communication channels and coordinate the event processing:

  • a "remote controller" called MPIController
  • a "remote source" called MPISource

and two are responsible for the transfer of data products:

  • a "sender" called MPISender
  • a "receiver" called MPIReceiver

.

image

The MPIController is an EDProducer running in a regular CMSSW process. After setting up the communication with an MPISource, it transmits to it all EDM run, lumi and event transitions, and instructs the MPISource to replicate them in the second process.

The MPISource is a Source controlling the execution of a second CMSSW process. After setting up the communication with an MPIController, it listens for EDM run, lumi and event transitions, and replicates them in its process.

Both MPIController and MPISource produce an MPIToken, a special data product that encapsulates the information about the MPI communication channel.

The MPISender is an EDProducer that can read one or more collections from the Event, serialise them using their ROOT dictionaries, and send them over the MPI communication channel.

The MPIReceiver is an EDProducer that can receive a set number of collections over the MPI communication channel, deserialise them using their ROOT dictionaries, and put them in the Event with a configurable instance label.

In principle any non-transient collection with a ROOT dictionary can be transmitted. Any transient information is lost during the transfer, and needs to be recreated by the receiving side.

Each MPISender and MPIReceiver is configured with an instance value that is used to match one MPISender in one process to one MPIReceiver in another process. Using different instance values allows the use of multiple MPISenders/MPIReceivers in a process.

Both MPISender and MPIReceiver obtain the MPI communication channel reading an MPIToken from the event. They also produce a copy of the MPIToken, so other modules can consume it to declare a dependency on the previous modules.

A few unit tests are included in the test/ directory.

PR validation:

These development have been extensively validated using the Run 3 HLT menu as a test case, and the results have been presented: https://indico.cern.ch/event/1557810/contributions/6560443/.

The new unit tests pass.

Backport

To be backported to 16.0.x for testing online in parallel to the 2026 data taking.

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 12, 2021

A new Pull Request was created by @fwyzard (Andrea Bocci) for CMSSW_11_2_X.

It involves the following packages:

HeterogeneousCore/MPICore
HeterogeneousCore/MPIServices

@makortel, @cmsbuild, @fwyzard can you please review it and eventually sign? Thanks.
@makortel, @rovere this is something you requested to watch as well.
@silviodonato, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@fwyzard fwyzard changed the title Mpi updates Implement a simple CMSSW client/server over MPI Jan 12, 2021
@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 12, 2021

@makortel @felicepantaleo @rovere FYI.

@cmsbuild
Copy link
Contributor

Pull request #32632 was updated. @makortel, @cmsbuild, @fwyzard can you please check and sign again.

@fwyzard fwyzard changed the base branch from CMSSW_11_2_X to master January 13, 2021 17:47
@cmsbuild cmsbuild modified the milestones: CMSSW_11_2_X, CMSSW_11_3_X Jan 13, 2021
@cmsbuild
Copy link
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-32632/20729

ERROR: Build errors found during clang-tidy run.

HeterogeneousCore/MPICore/plugins/messages.cc:19:3: error: constexpr variable 'types' must be initialized by a constant expression [clang-diagnostic-error]
  DECLARE_MPI_TYPE(EDM_MPI_Empty,    // MPI_Datatype
  ^
HeterogeneousCore/MPICore/plugins/macros.h:102:28: note: expanded from macro 'DECLARE_MPI_TYPE'
--
HeterogeneousCore/MPICore/plugins/messages.cc:25:3: error: constexpr variable 'types' must be initialized by a constant expression [clang-diagnostic-error]
  DECLARE_MPI_TYPE(EDM_MPI_RunAuxiliary,    // MPI_Datatype
  ^
HeterogeneousCore/MPICore/plugins/macros.h:102:28: note: expanded from macro 'DECLARE_MPI_TYPE'
--
HeterogeneousCore/MPICore/plugins/messages.cc:35:3: error: constexpr variable 'types' must be initialized by a constant expression [clang-diagnostic-error]
  DECLARE_MPI_TYPE(EDM_MPI_LuminosityBlockAuxiliary,    // MPI_Datatype
  ^
HeterogeneousCore/MPICore/plugins/macros.h:102:28: note: expanded from macro 'DECLARE_MPI_TYPE'
--
HeterogeneousCore/MPICore/plugins/messages.cc:46:3: error: constexpr variable 'types' must be initialized by a constant expression [clang-diagnostic-error]
  DECLARE_MPI_TYPE(EDM_MPI_EventAuxiliary,    // MPI_Datatype
  ^
HeterogeneousCore/MPICore/plugins/macros.h:102:28: note: expanded from macro 'DECLARE_MPI_TYPE'
--
gmake: *** [config/SCRAM/GMake/Makefile.coderules:128: code-checks] Error 2
gmake: *** [There are compilation/build errors. Please see the detail log above.] Error 2

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 14, 2021

please test

@cmsbuild
Copy link
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-32632/20747

ERROR: Build errors found during clang-tidy run.

HeterogeneousCore/MPICore/plugins/macros.h:58:45: note: cast from 'void *' is not allowed in a constant expression
--
HeterogeneousCore/MPICore/plugins/macros.h:61:26: error: constexpr variable 'mpi_type<long double>' must be initialized by a constant expression [clang-diagnostic-error]
  constexpr MPI_Datatype mpi_type<long double> = MPI_LONG_DOUBLE;
                         ^
HeterogeneousCore/MPICore/plugins/macros.h:61:50: note: cast from 'void *' is not allowed in a constant expression
--
HeterogeneousCore/MPICore/plugins/macros.h:64:26: error: constexpr variable 'mpi_type<std::byte>' must be initialized by a constant expression [clang-diagnostic-error]
  constexpr MPI_Datatype mpi_type<std::byte> = MPI_BYTE;
                         ^
HeterogeneousCore/MPICore/plugins/macros.h:64:48: note: cast from 'void *' is not allowed in a constant expression
--
HeterogeneousCore/MPICore/plugins/messages.cc:19:3: error: constexpr variable 'types' must be initialized by a constant expression [clang-diagnostic-error]
  DECLARE_MPI_TYPE(EDM_MPI_Empty,    // MPI_Datatype
  ^
HeterogeneousCore/MPICore/plugins/macros.h:102:28: note: expanded from macro 'DECLARE_MPI_TYPE'
--
HeterogeneousCore/MPICore/plugins/messages.cc:25:3: error: constexpr variable 'types' must be initialized by a constant expression [clang-diagnostic-error]
  DECLARE_MPI_TYPE(EDM_MPI_RunAuxiliary,    // MPI_Datatype
  ^
HeterogeneousCore/MPICore/plugins/macros.h:102:28: note: expanded from macro 'DECLARE_MPI_TYPE'
--
gmake: *** [config/SCRAM/GMake/Makefile.coderules:128: code-checks] Error 2
gmake: *** [There are compilation/build errors. Please see the detail log above.] Error 2

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 14, 2021

please test

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 48KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-674e41/50840/summary.html
COMMIT: 3db895e
CMSSW: CMSSW_16_1_X_2026-01-23-1100/el8_amd64_gcc13
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/32632/50840/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 52
  • DQMHistoTests: Total histograms compared: 4025536
  • DQMHistoTests: Total failures: 3
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4025513
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 51 files compared)
  • Checked 222 log files, 193 edm output root files, 52 DQM output files
  • TriggerResults: no differences found

Let multiple CMSSW processes on the same or different machines coordinate event
processing and transfer data products over MPI.

The implementation is based on four CMSSW modules. Two are responsible for
setting up the communication channels and coordinate the event processing:
  - a "remote controller" called MPIController;
  - a "remote source" called MPISource;
and two are responsible for the transfer of data products:
  - a "sender" called MPISender;
  - a "receiver" called MPIReceiver.

Data products can be serialised and transferred using the trivial serialisation
from HeterogeneousCore/TrivialSerialisation - if available - or the ROOT-based
serialisation.

Various tests are used to validate the implementation: matching the local and
remote event id, transferring SoA products with trivial serialisation, and
transferring legacy product with ROOT serialisation.

Co-authored-by: Andrea Bocci <[email protected]>
Co-authored-by: Anna Polova <[email protected]>
Co-authored-by: Mario Gonzalez <[email protected]>
@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 23, 2026

please test

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 23, 2026

+heterogeneous

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-32632/47677

@cmsbuild
Copy link
Contributor

Pull request #32632 was updated. can you please check and sign again.

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 32KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-674e41/50869/summary.html
COMMIT: bb50fbd
CMSSW: CMSSW_16_1_X_2026-01-23-1100/el8_amd64_gcc13
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/32632/50869/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially removed 2 lines from the logs
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 52
  • DQMHistoTests: Total histograms compared: 4025536
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4025510
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 51 files compared)
  • Checked 222 log files, 193 edm output root files, 52 DQM output files
  • TriggerResults: no differences found

@mandrenguyen
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit e9b86c6 into cms-sw:master Jan 24, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants