Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distribution of detector blocks across MPI processes #334

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

anand-avinash
Copy link
Contributor

@anand-avinash anand-avinash commented Oct 9, 2024

This pull request is related to the discussion #325.

Here I have added a new method _make_detector_blocks() to the Observation class. It takes a list of the detectors and divides them into different groups as specified by det_blocks_attributes parameter passed to the Observation class. The group of detectors are saved in the variable detector_blocks of Observation class. For instance, with det_blocks_attributes = ["wafer", "pixel"], the detector groups will be made in such a way that each group will have detectors with same wafer and pixel attributes.

Once the detectors are divided into the groups, it sets the number of detector blocks. The number of time blocks are computed accordingly (n_blocks_time = comm.size // n_blocks_det). Given the structure of the code base, this scheme was pretty easy to implement than I thought, and it doesn't breaks any other functionality.

On the other side, this implementation can lead to load imbalance in different MPI processes. Consider a simulation with 2 MPI processes and 4 detectors. If the detectors are ended up in two groups, with 3 and 1 elements respectively, one MPI process will have to work with 3 times more data than the other MPI process.

Another caveat is related to the det_idx member of Observation class. Previously, it corresponded to the global list of detectors passed to the Observation class. Now with the introduction of detector blocks, it won't be true anymore. For a simple solution, I have introduced a member variable detectors_global, that is the global list of detectors sorted according the detector groups. Now the det_idx will correspond to the indices of the detectors in detectors_global. But then we end up with three variables detectors (the list of detectors used to initialize the Observation class), Observation.detector_blocks, and Observation.detectors_global storing the same data in different manners.

After adding documentation, this pull request should become ready to be merged. But please let me know if you have other solutions.

Copy link
Member

@ziotom78 ziotom78 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work, @anand-avinash. Thanks a lot! We can merge this once the documentation is updated and tests are added. Thank you for duly annotating all the types.

Great work!

litebird_sim/detectors.py Outdated Show resolved Hide resolved
litebird_sim/distribute.py Outdated Show resolved Hide resolved
litebird_sim/observations.py Outdated Show resolved Hide resolved
litebird_sim/observations.py Outdated Show resolved Hide resolved
litebird_sim/observations.py Outdated Show resolved Hide resolved
litebird_sim/observations.py Outdated Show resolved Hide resolved
litebird_sim/simulations.py Outdated Show resolved Hide resolved
@ziotom78 ziotom78 self-requested a review October 16, 2024 10:08
Copy link
Member

@ziotom78 ziotom78 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for having fixed the comments!

@anand-avinash
Copy link
Contributor Author

I believe this PR is ready to be merged, but there are a few minor issues that need discussion. The implementation of attribute-based detector blocks works well in the full simulation pipeline when comm.size equals n_blocks_det * n_blocks_time. However, we encounter errors when comm.size is larger, as this leaves some MPI processes unused. The Observation classes on the unused processes are still being queried at several places, leading to the errors. When we provide det_blocks_attributes, the n_blocks_time is computed during runtime using the value of comm.size (n_blocks_time = comm.size // n_blocks_det). So we cannot avoid this situation unless we do the simulations with a well-balanced set of detectors.

Nonetheless, I tried to resolve some of the error which I am documenting here.

The first error I encountered was:

Traceback (most recent call last):
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/BrahMap/test_examples/cross_talk/detector_blocks_full_example.py", line 127, in <module>
    binned_maps = lbs.make_binned_map(nside, sim.observations)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/binner.py", line 325, in make_binned_map
    detector_mask_list = _build_mask_detector_split(detector_split, obs_list)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/common.py", line 515, in _build_mask_detector_split
    detector_mask.append(np.ones(cur_obs.n_detectors, dtype=bool))
                                 ^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/observations.py", line 212, in n_detectors
    return len(self.det_idx)
           ^^^^^^^^^^^^^^^^^
TypeError: object of type 'NoneType' has no len()

which I solved simply by adding the following:

if self.det_idx is None:
return 0
else:
return len(self.det_idx)


The next error I encountered was

Traceback (most recent call last):
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/BrahMap/test_examples/cross_talk/detector_blocks_full_example.py", line 127, in <module>
    binned_maps = lbs.make_binned_map(nside, sim.observations)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/binner.py", line 329, in make_binned_map
    nobs_matrix = _build_nobs_matrix(
                  ^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/binner.py", line 209, in _build_nobs_matrix
    cur_weights = get_map_making_weights(cur_obs, check=True)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/common.py", line 107, in get_map_making_weights
    weights = observations.sampling_rate_hz * observations.net_ukrts**2
                                              ~~~~~~~~~~~~~~~~~~~~~~^^~
TypeError: unsupported operand type(s) for ** or pow(): 'NoneType' and 'int'

It appears because setattr_det_global() sets the detector attributes to None on the processes that are unused. I was able to fix it by setting the numerical attributes to 0 instead of None:

if not is_in_grid: # The process does not own any detector (and TOD)
null_det = DetectorInfo()
attribute = getattr(null_det, name, None)
value = 0 if isinstance(attribute, numbers.Number) else None
setattr(self, name, value)
return


The next error I encountered is

Traceback (most recent call last):
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/BrahMap/test_examples/cross_talk/detector_blocks_full_example.py", line 127, in <module>
    binned_maps = lbs.make_binned_map(nside, sim.observations)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/binner.py", line 329, in make_binned_map
    nobs_matrix = _build_nobs_matrix(
                  ^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/binner.py", line 224, in _build_nobs_matrix
    pixidx_all, polang_all = _compute_pixel_indices(
                             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/common.py", line 222, in _compute_pixel_indices
    del curr_pointings_det
        ^^^^^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'curr_pointings_det' where it is not associated with a value

It can be solved by replacing line 222 of litebird_sim/common.py by following:

        try:
            del curr_pointings_det
        except UnboundLocalError:
            pass

Next one I encountered while calling the destriper:

Traceback (most recent call last):
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/destriper.py", line 1619, in make_destriped_map
    len(params.samples_per_baseline)
TypeError: object of type 'int' has no len()

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/BrahMap/test_examples/cross_talk/detector_blocks_full_example.py", line 131, in <module>
    destriped_maps = lbs.make_destriped_map(nside, sim.observations)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/destriper.py", line 1630, in make_destriped_map
    baseline_lengths_list = [
                            ^
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/destriper.py", line 1631, in <listcomp>
    split_items_evenly(
  File "/mnt/Data/Projects/uniroma2/coding/dev_cosmomap2/lbsim_v0.13.0/litebird_sim/mapmaking/destriper.py", line 101, in split_items_evenly
    assert sub_n < n, "sub_n={0} is not smaller than n={1}".format(sub_n, n)
           ^^^^^^^^^
AssertionError: sub_n=100 is not smaller than n=0

This one seems little non-trivial to solve.

At the end we may face errors elsewhere as well that I have not tested. And it brings me to the question: should we worry about fixing this issue or should we leave it to the user to supply a well-balanced set of detectors?

@ziotom78
Copy link
Member

At the end we may face errors elsewhere as well that I have not tested. And it brings me to the question: should we worry about fixing this issue or should we leave it to the user to supply a well-balanced set of detectors?

The simplest solution would probably be to check in Simulation.__init__ and crash if the number of MPI processes is too large. I’m not sure that this is the best option, we should ask the opinion of people of the Simulation Production Team.

@anand-avinash
Copy link
Contributor Author

As we discussed in the telecon, I have now added a new subcommunicator that excludes the processes from MPI.COMM_WORLD where there are no detectors. It includes all the processes for which rank < n_blocks_det*n_blocks_time. The new subcommunicator is encapsulated in a class instance (litebird_sim.mpi.comm_grid) defined in mpi.py, with explicitly defined setter function so that it cannot be overwritten accidently. The new subcommunicator can be accessed with litebird_sim.mpi.comm_grid.COMM_OBS_GRID. By default, this subcommnicator is equal to MPI.COMM_WORLD unless the observations are defined. Depending on the number of detector blocks and time blocks, the value of the subcommunicator is updated.

For the processes that do not belong to this new sub-communicator, the sub-communicator object points to a NULL communicator (MPI.COMM_NULL). When using litebird_sim.mpi.COMM_OBS_GRID, one must check if it is not same as the NULL communicator. The general NULL communicator can be accessed with litebird_sim.mpi.comm_grid.COMM_NULL.

To resolve the issues I reported previously, I have replaced the global communicator with the subcommunicator at various places. However, there might be some other places where this replacement is necessary, for example io.py, hwp_sys.py, etc. We can update them if someone encounters an error in future.

@ziotom78 @paganol please let me know if we can proceed with this solution.

Copy link
Member

@ziotom78 ziotom78 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it! Nice work!

Comment on lines 64 to 65
comm_grid = _GridCommClass()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being a global variable that is not expected to be reassigned, I would name it MPI_COMM_GRID, in analogy with MPI_COMM_WORLD, MPI_ENABLED, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactoring a variable name is trivial if you use an advanced IDE like PyCharm. If you don't, let me know, and I'll rename it. (It’s just a matter of one mouse click.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being a global variable that is not expected to be reassigned, I would name it MPI_COMM_GRID, in analogy with MPI_COMM_WORLD, MPI_ENABLED, etc.

Do you mean removing the _GridCommClass altogether and instead simply use MPI_COMM_GRID?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being a global variable that is not expected to be reassigned, I would name it MPI_COMM_GRID, in analogy with MPI_COMM_WORLD, MPI_ENABLED, etc.

Do you mean removing the _GridCommClass altogether and instead simply use MPI_COMM_GRID?

No, it's just a matter of renaming comm_grid, as in this way it will match the naming convention of the other global variables in the mpi module.

@anand-avinash
Copy link
Contributor Author

I have renamed comm_grid to MPI_COMM_GRID and added documentation for the same. It can be merged now. I don't have any idea for adding a test for the sub-communicator though.

@anand-avinash
Copy link
Contributor Author

anand-avinash commented Nov 27, 2024

I discovered today that a sub-communicator same as my implementation was already in place as a local variable:

is_in_grid = self.comm.rank < self._n_blocks_det * self._n_blocks_time
comm_grid = self.comm.Split(int(is_in_grid))

There are also the sub-communicators for detector and time blocks:

comm_col = comm_grid.Split(my_col)

comm_row = comm_grid.Split(comm_grid.rank // self._n_blocks_time)

It might be useful to make the sub-communicators for detector and time blocks the members of Observation class. The detector block sub-communicator can be used to pick up all the ranks where the TOD chunks of a given detector reside. Similarly the time block sub-communicator can be used to pick all the ranks where TOD chunks of a given time interval is located.

@ziotom78
Copy link
Member

My gosh, thank you so much for having spotted this, @anand-avinash ! It was added four years ago but never documented.

It might be useful to make the sub-communicators for detector and time blocks the members of Observation class. The detector block sub-communicator can be used to pick up all the ranks where the TOD chunks of a given detector reside. Similarly the time block sub-communicator can be used to pick all the ranks where TOD chunks of a given time interval is located.

Absolutely, I agree!

@anand-avinash
Copy link
Contributor Author

I have added the sub-communicators for the detector and time blocks, and updated the documentation. It is now ready to be merged.

P.S.: A new version of poetry was released two days ago. With this update, the command poetry export --without-hashes > requirements.txt no longer works. To address this, I changed the poetry version in .readthedocs.yml and .github/workflows/tests.yml from latest to 1.8.5 in the last commit. This issue should be properly addressed in a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants