-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distribution of detector blocks across MPI processes #334
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work, @anand-avinash. Thanks a lot! We can merge this once the documentation is updated and tests are added. Thank you for duly annotating all the types.
Great work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for having fixed the comments!
I believe this PR is ready to be merged, but there are a few minor issues that need discussion. The implementation of attribute-based detector blocks works well in the full simulation pipeline when Nonetheless, I tried to resolve some of the error which I am documenting here. The first error I encountered was:
which I solved simply by adding the following: litebird_sim/litebird_sim/observations.py Lines 209 to 212 in a941fd0
The next error I encountered was
It appears because litebird_sim/litebird_sim/observations.py Lines 677 to 682 in a941fd0
The next error I encountered is
It can be solved by replacing line 222 of
Next one I encountered while calling the destriper:
This one seems little non-trivial to solve. At the end we may face errors elsewhere as well that I have not tested. And it brings me to the question: should we worry about fixing this issue or should we leave it to the user to supply a well-balanced set of detectors? |
The simplest solution would probably be to check in |
As we discussed in the telecon, I have now added a new subcommunicator that excludes the processes from For the processes that do not belong to this new sub-communicator, the sub-communicator object points to a NULL communicator ( To resolve the issues I reported previously, I have replaced the global communicator with the subcommunicator at various places. However, there might be some other places where this replacement is necessary, for example @ziotom78 @paganol please let me know if we can proceed with this solution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it! Nice work!
litebird_sim/mpi.py
Outdated
comm_grid = _GridCommClass() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Being a global variable that is not expected to be reassigned, I would name it MPI_COMM_GRID
, in analogy with MPI_COMM_WORLD
, MPI_ENABLED
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactoring a variable name is trivial if you use an advanced IDE like PyCharm. If you don't, let me know, and I'll rename it. (It’s just a matter of one mouse click.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Being a global variable that is not expected to be reassigned, I would name it
MPI_COMM_GRID
, in analogy withMPI_COMM_WORLD
,MPI_ENABLED
, etc.
Do you mean removing the _GridCommClass
altogether and instead simply use MPI_COMM_GRID
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Being a global variable that is not expected to be reassigned, I would name it
MPI_COMM_GRID
, in analogy withMPI_COMM_WORLD
,MPI_ENABLED
, etc.Do you mean removing the
_GridCommClass
altogether and instead simply useMPI_COMM_GRID
?
No, it's just a matter of renaming comm_grid
, as in this way it will match the naming convention of the other global variables in the mpi
module.
I have renamed |
I discovered today that a sub-communicator same as my implementation was already in place as a local variable: litebird_sim/litebird_sim/observations.py Lines 687 to 688 in 241a401
There are also the sub-communicators for detector and time blocks: litebird_sim/litebird_sim/observations.py Line 697 in 241a401
litebird_sim/litebird_sim/observations.py Line 708 in 241a401
It might be useful to make the sub-communicators for detector and time blocks the members of |
My gosh, thank you so much for having spotted this, @anand-avinash ! It was added four years ago but never documented.
Absolutely, I agree! |
fd1c451
to
b5fb9ab
Compare
I have added the sub-communicators for the detector and time blocks, and updated the documentation. It is now ready to be merged. P.S.: A new version of |
This pull request is related to the discussion #325.
Here I have added a new method
_make_detector_blocks()
to theObservation
class. It takes a list of the detectors and divides them into different groups as specified bydet_blocks_attributes
parameter passed to theObservation
class. The group of detectors are saved in the variabledetector_blocks
ofObservation
class. For instance, withdet_blocks_attributes = ["wafer", "pixel"]
, the detector groups will be made in such a way that each group will have detectors with same wafer and pixel attributes.Once the detectors are divided into the groups, it sets the number of detector blocks. The number of time blocks are computed accordingly (
n_blocks_time = comm.size // n_blocks_det
). Given the structure of the code base, this scheme was pretty easy to implement than I thought, and it doesn't breaks any other functionality.On the other side, this implementation can lead to load imbalance in different MPI processes. Consider a simulation with 2 MPI processes and 4 detectors. If the detectors are ended up in two groups, with 3 and 1 elements respectively, one MPI process will have to work with 3 times more data than the other MPI process.
Another caveat is related to the
det_idx
member ofObservation
class. Previously, it corresponded to the global list of detectors passed to theObservation
class. Now with the introduction of detector blocks, it won't be true anymore. For a simple solution, I have introduced a member variabledetectors_global
, that is the global list of detectors sorted according the detector groups. Now thedet_idx
will correspond to the indices of the detectors indetectors_global
. But then we end up with three variablesdetectors
(the list of detectors used to initialize theObservation
class),Observation.detector_blocks
, andObservation.detectors_global
storing the same data in different manners.After adding documentation, this pull request should become ready to be merged. But please let me know if you have other solutions.