Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Provide equivalence of MPICH_ASYNC_PROGRESS #13088

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

hominhquan
Copy link

This PR is follow-up of #13074.

@janjust
Copy link
Contributor

janjust commented Feb 11, 2025

@hominhquan thanks for the PR, do you have any performance data and/or testing data?

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A step in the right direction. I still have an issue with the fact that the thread is not bound and cannot be bound, but we can address that later.

@hominhquan hominhquan force-pushed the mqho/async_progress branch 2 times, most recently from b1f4dad to f3e8fc3 Compare February 12, 2025 14:03
@hominhquan
Copy link
Author

hominhquan commented Feb 12, 2025

@hominhquan thanks for the PR, do you have any performance data and/or testing data?

@janjust Yes, I shared in #10374 that we observed a gain of upto x1.4 in OSU_Ireduce. Note that most OSU benchmarks only run on two MPI processes, so memory contention and CPU occupation is not stressed enough to their limit.

@devreal
Copy link
Contributor

devreal commented Feb 12, 2025

@hominhquan What is the impact on collective operations, both in shared and distributed memory? I imagine there to be more contention...

hppritcha
hppritcha previously approved these changes Feb 12, 2025
Copy link
Member

@hppritcha hppritcha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many other things to consider in terms of performance but in terms of changes to just support the async thread capability this looks okay. Low risk since its buy in.

opal_progress_set_event_flag(OPAL_EVLOOP_ONCE | OPAL_EVLOOP_NONBLOCK);
#endif
/* shutdown async progress thread before tearing down further services */
if (opal_async_progress_thread_spawned) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is okay for now but does leave a hole to plug for the sessions model, but since this is an buy-in option for the application user it should be okay for how.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind elaborate on this @hppritcha, I fail to see the issue with the session model.

@hominhquan
Copy link
Author

hominhquan commented Feb 12, 2025

@hominhquan What is the impact on collective operations, both in shared and distributed memory? I imagine there to be more contention...

I only compared with OSU_Ireduce. I'll run the whole OSU collective suite this night for a larger view, but only on a single-node shared-mem config, since I don't have access to any cluster system.

bosilca
bosilca previously approved these changes Feb 12, 2025
- The SW-based async progress thread has been planned long time ago in
  683efcb, but has never been enabled/implemented since.
- This commit enables the spawn of an async progress thread to execute
  _opal_progress() routine when OPAL_ENABLE_PROGRESS_THREADS is set at
  both compile time and runtime (env OPAL_ASYNC_PROGRESS).
- Fix minor typo in opal_progress.h doxygen comment

Signed-off-by: Minh Quan Ho <[email protected]>
@hominhquan
Copy link
Author

@hominhquan What is the impact on collective operations, both in shared and distributed memory? I imagine there to be more contention...

I only compared with OSU_Ireduce. I'll run the whole OSU collective suite this night for a larger view, but only on a single-node shared-mem config, since I don't have access to any cluster system.

@devreal below is our results on a single-node Grace CPU, in which we see the sweet-spot around 128-256 KB per rank. There was degradation on three operations: osu_iallgather, osu_iallgatherv and osu_ialltoall
osu_speedup_async_progress

@devreal
Copy link
Contributor

devreal commented Feb 13, 2025

Is that a 3x slowdown for small messages? That would be very concerning. Any idea what would cause that?

@hominhquan
Copy link
Author

Is that a 3x slowdown for small messages? That would be very concerning. Any idea what would cause that?

It would come from the synchronization overhead between the progress thread (now executing opal_progress()) and the main thread (waiting for the job done at each MPI_Test() or MPI_Wait()). This incompressible cost turns out to be important on small communications (Amdahl's law).

@devreal
Copy link
Contributor

devreal commented Feb 13, 2025

It's not Amdahl's law if the execution gets slower when you add more resources :) In a perfect world, the thread calling MPI_Iallreduce would trigger the progress thread, which then immediately goes ahead and executes the required communications before returning control to the main thread. This looks like many 10s of microseconds synchronization overhead, which is more than I would expect.

@hominhquan
Copy link
Author

As said @bosilca in #13074, (back in 2010-2014), it was hard to get an optimal solution for all message sizes and to all use-cases, and I confirm this conclusion. Spawning a thread introduces many side effects that we can hardly measure their impact. The idea behind this patch is to (re)open the door to further improvement and fine-tuning (core-binding ? time-based yield/progression ? work-stealing ?).

@bosilca
Copy link
Member

bosilca commented Feb 13, 2025

These results looks suspicious. OSU doesn't do overlap, it basically posts the non-blocking and wait for it.

For small messages I could understand 10 to 20 percent performance degradation, but not 3x. And for large messages on an iallgather a 2x increase in performance ? Where is that extra bandwidth coming from ?

What exactly is the speedup you report on this graph ?

@hominhquan
Copy link
Author

hominhquan commented Feb 13, 2025

Recent version of OSU added, for example osu_ireduce which does overlap and measure the overall time of MPI_Ireduce + dummy_compute(latency_seq) + MPI_Wait, in which dummy_compute() simulates a computation period of latency_seq = elapsed(MPI_Ireduce + MPI_Wait). The goal was to see whether the MPI library manages to overlap 100% the non-blocking collective with the dummy_compute(...). They also did the same with other osu_iallreduce.c, osu_ialltoall or osu_ibcast.c etc..

The results I showed above is on the overall_time of those non-blocking operations with OPAL_ASYNC_PROGRESS=[0|1], on the latest OSU benchmark (>= 7.x) https://mvapich.cse.ohio-state.edu/benchmarks/

The speedup is expected from the fact that some tasks (e.g. Ireduce or purely data bcast/scatter/gather) are done really in parallel of dummy_compute() by the new progress thread, and not at the end of communication (MPI_Wait()).

I tested with MPICH's MPICH_ASYNC_PROGRESS=1 and also observed some performance gain "à l'époque" (numbers to be refreshed by new runs).

Again, I know the limitation of OSU of only using two MPI processes, where resource contention is not stressed much far. They mimic anyway a real well-written overlapped non-blocking schema.

@hjelmn
Copy link
Member

hjelmn commented Feb 13, 2025

I have been out of the loop for awhile but I thought the idea was to do more targeted progress rather than the hammer that is just looping on opal_progress (which is predictably bad-- we knew this years ago). The concept I remember was signaled sends that could then wake up the remote process that would then make progress. Did that go nowhere?

The way I remember it, the reason this speeds up large messages is we can progress the RNDV without entering MPI. This is what signaled sends in the BTL were supposed to address. It would send the RTS which would trigger an RDMA get, RTR, whatever then go back to sleep.

@hominhquan hominhquan dismissed stale reviews from bosilca and hppritcha via b3a29bf February 15, 2025 13:08
@hominhquan
Copy link
Author

Ping, please tell me if this PR needs more discussion and has its merit to be merged ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants