-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Extend CMSSW to a distributed application over MPI #32632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
A new Pull Request was created by @fwyzard (Andrea Bocci) for CMSSW_11_2_X. It involves the following packages: HeterogeneousCore/MPICore @makortel, @cmsbuild, @fwyzard can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
|
-code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-32632/20729 ERROR: Build errors found during clang-tidy run. |
|
please test |
|
-code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-32632/20747 ERROR: Build errors found during clang-tidy run. |
|
please test |
|
+1 Size: This PR adds an extra 48KB to repository Comparison SummarySummary:
|
Let multiple CMSSW processes on the same or different machines coordinate event processing and transfer data products over MPI. The implementation is based on four CMSSW modules. Two are responsible for setting up the communication channels and coordinate the event processing: - a "remote controller" called MPIController; - a "remote source" called MPISource; and two are responsible for the transfer of data products: - a "sender" called MPISender; - a "receiver" called MPIReceiver. Data products can be serialised and transferred using the trivial serialisation from HeterogeneousCore/TrivialSerialisation - if available - or the ROOT-based serialisation. Various tests are used to validate the implementation: matching the local and remote event id, transferring SoA products with trivial serialisation, and transferring legacy product with ROOT serialisation. Co-authored-by: Andrea Bocci <[email protected]> Co-authored-by: Anna Polova <[email protected]> Co-authored-by: Mario Gonzalez <[email protected]>
|
please test |
|
+heterogeneous |
|
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-32632/47677
|
|
Pull request #32632 was updated. can you please check and sign again. |
|
+1 Size: This PR adds an extra 32KB to repository Comparison SummarySummary:
|
|
+1 |
The PR description needs to be updated to reflect the recent developments.
PR description:
Let multiple CMSSW processes on the same or different machines coordinate event processing and transfer data products over MPI.
The implementation is based on four CMSSW modules.
Two are responsible for setting up the communication channels and coordinate the event processing:
MPIControllerMPISourceand two are responsible for the transfer of data products:
MPISenderMPIReceiver.
The
MPIControlleris anEDProducerrunning in a regular CMSSW process. After setting up the communication with anMPISource, it transmits to it all EDM run, lumi and event transitions, and instructs theMPISourceto replicate them in the second process.The
MPISourceis aSourcecontrolling the execution of a second CMSSW process. After setting up the communication with anMPIController, it listens for EDM run, lumi and event transitions, and replicates them in its process.Both
MPIControllerandMPISourceproduce anMPIToken, a special data product that encapsulates the information about the MPI communication channel.The
MPISenderis anEDProducerthat can read one or more collections from the Event, serialise them using their ROOT dictionaries, and send them over the MPI communication channel.The
MPIReceiveris anEDProducerthat can receive a set number of collections over the MPI communication channel, deserialise them using their ROOT dictionaries, and put them in the Event with a configurable instance label.In principle any non-transient collection with a ROOT dictionary can be transmitted. Any transient information is lost during the transfer, and needs to be recreated by the receiving side.
Each
MPISenderandMPIReceiveris configured with an instance value that is used to match oneMPISenderin one process to oneMPIReceiverin another process. Using different instance values allows the use of multipleMPISenders/MPIReceiversin a process.Both
MPISenderandMPIReceiverobtain the MPI communication channel reading anMPITokenfrom the event. They also produce a copy of theMPIToken, so other modules can consume it to declare a dependency on the previous modules.A few unit tests are included in the
test/directory.PR validation:
These development have been extensively validated using the Run 3 HLT menu as a test case, and the results have been presented: https://indico.cern.ch/event/1557810/contributions/6560443/.
The new unit tests pass.
Backport
To be backported to 16.0.x for testing online in parallel to the 2026 data taking.