-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Provide equivalence of MPICH_ASYNC_PROGRESS #13088
base: main
Are you sure you want to change the base?
Conversation
@hominhquan thanks for the PR, do you have any performance data and/or testing data? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A step in the right direction. I still have an issue with the fact that the thread is not bound and cannot be bound, but we can address that later.
b1f4dad
to
f3e8fc3
Compare
@janjust Yes, I shared in #10374 that we observed a gain of upto x1.4 in |
@hominhquan What is the impact on collective operations, both in shared and distributed memory? I imagine there to be more contention... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many other things to consider in terms of performance but in terms of changes to just support the async thread capability this looks okay. Low risk since its buy in.
opal_progress_set_event_flag(OPAL_EVLOOP_ONCE | OPAL_EVLOOP_NONBLOCK); | ||
#endif | ||
/* shutdown async progress thread before tearing down further services */ | ||
if (opal_async_progress_thread_spawned) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is okay for now but does leave a hole to plug for the sessions model, but since this is an buy-in option for the application user it should be okay for how.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind elaborate on this @hppritcha, I fail to see the issue with the session model.
I only compared with |
f3e8fc3
to
b3a29bf
Compare
- The SW-based async progress thread has been planned long time ago in 683efcb, but has never been enabled/implemented since. - This commit enables the spawn of an async progress thread to execute _opal_progress() routine when OPAL_ENABLE_PROGRESS_THREADS is set at both compile time and runtime (env OPAL_ASYNC_PROGRESS). - Fix minor typo in opal_progress.h doxygen comment Signed-off-by: Minh Quan Ho <[email protected]>
@devreal below is our results on a single-node Grace CPU, in which we see the sweet-spot around 128-256 KB per rank. There was degradation on three operations: |
Is that a 3x slowdown for small messages? That would be very concerning. Any idea what would cause that? |
It would come from the synchronization overhead between the progress thread (now executing |
It's not Amdahl's law if the execution gets slower when you add more resources :) In a perfect world, the thread calling |
As said @bosilca in #13074, (back in 2010-2014), it was hard to get an optimal solution for all message sizes and to all use-cases, and I confirm this conclusion. Spawning a thread introduces many side effects that we can hardly measure their impact. The idea behind this patch is to (re)open the door to further improvement and fine-tuning (core-binding ? time-based yield/progression ? work-stealing ?). |
These results looks suspicious. OSU doesn't do overlap, it basically posts the non-blocking and wait for it. For small messages I could understand 10 to 20 percent performance degradation, but not 3x. And for large messages on an iallgather a 2x increase in performance ? Where is that extra bandwidth coming from ? What exactly is the speedup you report on this graph ? |
Recent version of OSU added, for example The results I showed above is on the The speedup is expected from the fact that some tasks (e.g. I tested with MPICH's Again, I know the limitation of OSU of only using two MPI processes, where resource contention is not stressed much far. They mimic anyway a real well-written overlapped non-blocking schema. |
I have been out of the loop for awhile but I thought the idea was to do more targeted progress rather than the hammer that is just looping on opal_progress (which is predictably bad-- we knew this years ago). The concept I remember was signaled sends that could then wake up the remote process that would then make progress. Did that go nowhere? The way I remember it, the reason this speeds up large messages is we can progress the RNDV without entering MPI. This is what signaled sends in the BTL were supposed to address. It would send the RTS which would trigger an RDMA get, RTR, whatever then go back to sleep. |
Ping, please tell me if this PR needs more discussion and has its merit to be merged ? |
This PR is follow-up of #13074.