ompi/instance: fix cleanup function registration order #13073

hominhquan · 2025-01-31T09:23:35Z

Append PML cleanup into the finalize of the instance domain ('ompi_instance_common_domain') before RTE/OPAL init.
The reason is RTE init (ompi_rte_init()) will call opal_init(), which in turn will set the internal tracking domain to OPAL's one ('opal_init_domain'), and this PML cleanup function would be mis-registered as belonging to 'opal_init_domain' instead of the current 'ompi_instance_common_domain'.
The consequence of such mis-registration is that: at MPI_Finalize(), this PML cleanup (_del_procs()) will be executed by RTE; and, depending on their registration order, this may cut the grass under the feet of other running components (_progress())
This may be the root cause of issue Intermittent crashes inside MPI_Finalize #10117

github-actions · 2025-01-31T09:24:04Z

Hello! The Git Commit Checker CI bot found a few problems with this PR:

f33c3a2: ompi/instance: fix cleanup function registration o...

check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

hppritcha · 2025-02-01T20:59:21Z

This patch isn't going to work. The problem is by the time the ompi_mpi_instance_cleanup_pml is invoked the PML framework has already been closed. For some reason this happens to work for OB1 PML. Not sure exactly why that is yet.

Could you provide a traceback by chance from one of the failed CI tests when using Open MPI main or 5.0.x? This area of Open MPI was completely rewritten from the 4.1.x to 5.0.x releases so tracebacks from failures using the 4.1.x releases wouldn't be that helpful for addressing this issue for newer releases.

hominhquan · 2025-02-02T16:05:59Z

This patch isn't going to work. The problem is by the time the ompi_mpi_instance_cleanup_pml is invoked the PML framework has already been closed. For some reason this happens to work for OB1 PML. Not sure exactly why that is yet.

Could you provide a traceback by chance from one of the failed CI tests when using Open MPI main or 5.0.x? This area of Open MPI was completely rewritten from the 4.1.x to 5.0.x releases so tracebacks from failures using the 4.1.x releases wouldn't be that helpful for addressing this issue for newer releases.

Thanks @hppritcha for your comment. I admit that I don't know the complete init/finalize protocol between different modules. The only fact I notice is that the ompi_mpi_instance_cleanup_pml() was (somehow wrongly ?) registered as belonging to opal_init_domain, whereas I guess it should be in ompi_instance_common_domain ? The result is that this ompi_mpi_instance_cleanup_pml() is called by RTE-finalize, and not instance-finalize (ompi_mpi_instance_finalize()).

@hppritcha Can you confirm which should be the expected registration order of ompi_mpi_instance_cleanup_pml() compared to other finalize functions in RTE ?

hominhquan · 2025-02-02T16:13:12Z

or perhaps the opal_finalize_cleanup_domain (&ompi_instance_common_domain); should be moved to before all the mca_base_framework_close (ompi_framework_dependencies[j]), and always after ompi_rte_finalize() ?

hppritcha · 2025-02-02T19:49:35Z

This code has been used for who knows how many millions of mpirun job runs since the 5.0.0 release. Finalizing MPI as the last instance is being closed (an instance maps to an MPI session) needs to be carefully orchestrated. The pml clean up is attempting to make sure all procs have been removed from the PML (as well as any opened via the OSC framework) prior to closing the frameworks. You may have found a bug but this is not the way to solve it. Could you please provide a traceback of one of your CI failures using either one of the 5.0.x releases or main?

Also a way to double check your changes, you should configure your builds with --enable-mca-dso. This configure option makes a framework close call also cause the component to be dlclose'd. Doing this will make it obvious this patch won't work. Even the OB1 PML fails with a segfault:

#9  <signal handler called>
#10 0x00007fffea8cd851 in ?? ()
#11 0x00007ffff79f7341 in ompi_mpi_instance_cleanup_pml () at instance/instance.c:189
#12 0x00007ffff70c293e in opal_finalize_cleanup_domain (domain=0x7ffff7d9ad80 <ompi_instance_common_domain>) at runtime/opal_finalize_core.c:129
#13 0x00007ffff79f9ce4 in ompi_mpi_instance_finalize_common () at instance/instance.c:967
#14 0x00007ffff79f9e1a in ompi_mpi_instance_finalize (instance=0x7ffff7d9aa60 <ompi_mpi_instance_default>) at instance/instance.c:984
#15 0x00007ffff79ee2f3 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:294
#16 0x00007ffff7a364fd in PMPI_Finalize () at finalize.c:52
#17 0x0000000000400a56 in main (argc=1, argv=0x7fffffffd8d8) at ring_c.c:75

hominhquan · 2025-02-04T14:43:30Z

I observed the race when testing the enabling of async OPAL progress thread in #13074. The observation is as follow:

Each MPI process spawned a new thread to loop on opal_progress()
At the end of osu_ireduce, the main thread hits MPI_Finalize and calls RTE finalize, which calls ompi_mpi_instance_cleanup_pml() before other opal_*_finalize()
In the case of single-node shared-mem communication, ompi_mpi_instance_cleanup_pml() will call sm_del_procs() which will destroy all endpoints (mca_btl_base_endpoint_t), while the progress thread is still running all the registered *_progress() callbacks and will crash if one of them accesses to these endpoints (e.g. mca_btl_sm_component_progress()).
Below is an example backtrace of the crash

(gdb) break PMPI_Finalize
(gdb) continue

Thread 1 "osu_ireduce" hit Breakpoint 2, PMPI_Finalize () at ../../../../../../../../../../ompi/mpi/c/finalize.c:46
46          if (MPI_PARAM_CHECK) {
Missing separate debuginfos, use: dnf debuginfo-install zlib-1.2.11-40.el9.aarch64
(gdb) i th
  Id   Target Id                                         Frame
* 1    Thread 0xffffaee57440 (LWP 2744553) "osu_ireduce" PMPI_Finalize () at ../../../../../../../../../../ompi/mpi/c/finalize.c:46
  2    Thread 0xffffae7c90c0 (LWP 2744946) "osu_ireduce" _opal_progress () at ../../../../../../../../opal/runtime/opal_progress.c:308
  3    Thread 0xffffadc4f0c0 (LWP 2745057) "osu_ireduce" 0x0000ffffaeb3cd7c in epoll_pwait () from /lib64/libc.so.6
  4    Thread 0xffffacef00c0 (LWP 2745066) "osu_ireduce" 0x0000ffffaeb2e890 in read () from /lib64/libc.so.6

(gdb) t 2
[Switching to thread 2 (Thread 0xffffae7c90c0 (LWP 2744946))]
#0  _opal_progress () at ../../../../../../../../opal/runtime/opal_progress.c:308
308             events += (callbacks[i])();
(gdb) bt
#0  _opal_progress () at ../../../../../../../../opal/runtime/opal_progress.c:308
#1  0x0000ffffaeee8c5c in opal_progress_async_thread_engine (obj=<optimized out>) at ../../../../../../../../opal/runtime/opal_progress.c:350
#2  0x0000ffffaead26b8 in start_thread () from /lib64/libc.so.6
#3  0x0000ffffaeb3cbdc in thread_start () from /lib64/libc.so.6
(gdb) c
Continuing.
[Thread 0xffffacef00c0 (LWP 2745066) exited]

Thread 2 "osu_ireduce" received signal SIGSEGV, Segmentation fault.
mca_btl_sm_check_fboxes () at ../../../../../../../../../../../opal/mca/btl/sm/btl_sm_fbox.h:241
241                 const mca_btl_sm_fbox_hdr_t hdr = mca_btl_sm_fbox_read_header(
(gdb) bt
#0  mca_btl_sm_check_fboxes () at ../../../../../../../../../../../opal/mca/btl/sm/btl_sm_fbox.h:241
#1  mca_btl_sm_component_progress () at ../../../../../../../../../../../opal/mca/btl/sm/btl_sm_component.c:559
#2  0x0000ffffaeee8e80 in _opal_progress () at ../../../../../../../../opal/runtime/opal_progress.c:308
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

I think this crash rarely occurs since MPI_Finalize is most of the time called by the main thread, and application should have previously waited for all communication termination, so there is no *_progress() callback executed at the same time of finalize, but there may be some cases I have missed among all the available MCA components.

- Append PML cleanup into the finalize of the instance domain ('ompi_instance_common_domain') before RTE/OPAL init. - The reason is RTE init (ompi_rte_init()) will call opal_init(), which in turn will set the internal tracking domain to OPAL's one ('opal_init_domain'), and this PML cleanup function would be mis-registered as belonging to 'opal_init_domain' instead of the current 'ompi_instance_common_domain'. - The consequence of such mis-registration is that: at MPI_Finalize(), this PML cleanup (*_del_procs()) will be executed by RTE; and, depending on their registration order, this may cut the grass under the feet of other running components (*_progress()) - This may be the root cause of issue open-mpi#10117 Signed-off-by: Minh Quan Ho <[email protected]>

hjelmn · 2025-02-06T16:39:56Z

That suggest either the async thread should be shut down before the PML or there is some state we need to clean up to ensure (maybe removing a callback from opal_progress). Changing the order of when the pml is cleaned up is probably not the right approach.

hppritcha · 2025-02-06T17:25:23Z

The crash you are seeing is also in a part of the sm code that is using memory mapped in from another process - the fast box thing - so if other processes are busy tearing down the sm and the backing shared memory you will see segfaults in this location in the traceback if some process is lagging a bit and still trying to progress the sm btl.

Not that it solves the problem but it may help in understanding what's going on if you set the following MCA parameter and see if you still get segfaults.

export OMPI_MCA_btl_sm_fbox_max=0

I agree with @hjelmn you need to find a place to shut down all the progress threads early in the instance MPI_Finalize/MPI_Session_finalize process.

hominhquan · 2025-02-07T08:42:10Z

Setting env OMPI_MCA_btl_sm_fbox_max=0 did not evoke the segfault anymore on the progress thread.

@hjelmn @hppritcha If I sum up your recommendation:

Do not change the order of PML cleanup (so let it executed by RTE finalize)
Find a place to move the progress thread shutdown (currently opal_progress_finalize()) to earlier in the instance/session, probably before RTE finalize ?

hppritcha · 2025-02-09T16:57:55Z

So the complication here is that the sm fast box checking (see mca_btl_sm_component_progress) is still being done within the ob1 progress loop even after - for the MPI world model (MCW) - all messages must have been sent and received. See the comments at lines 275-292 in ompi/runtime/ompi_mpi_finalize.c.

One thing you could do is add an opal_progress_shutdown_async_threads global method that would be invoked prior to entering that code invoking PMIx_Fence_nb. It should hardly be necessary to have async progress threads running when all processes in MPI_COMM_WORLD are starting to enter the finalize procedure. This would not help for the sessions model however.

Another workaround would be to disable use of fast boxes in the sm btl when the opal progress thread is enabled. This approach would work for both the MCW and sessions models.

Another more complex approach would be develop some type of peer-to-peer shutdown protocol for the fast box path. This shutdown protocol could be engaged in the sm_del_procs perhaps. Touching the sm btl involves lots of subtleties and would probably be challenging to get accepted into the code base as it would be impacting one of the most important messaging paths when using the OB1 PML or RDMA OSC.

What I suspect in your test is that some processes have completed the async collective operation later than others and have only just reached the PMIx_Fence_nb while others have gone past the fence and started tearing down their ompi related resources. The opal progress thread is meanwhile spinning through the OB1 progress loop on those slower processes and boom they die in the check on fast boxes from processes that have already cleaned up their ob1/btl resources.

BTW, I bet your async code at may 65048a7 may break things for other PMLs which you aren't probably testing, like UCX and CM (OFI). Please open a PR for that commit marked as an RFC and let's see what else may or may not pass CI.

hppritcha · 2025-02-09T20:18:01Z

There may be possibly be another option for this sm btl problem. It looks like in the sm_del_procs method if you also scan the fbox_in_endpoints array for the ep being finalized in that loop and set it to NULL, then check in mca_btl_sm_check_fboxes whether that entry is NULL before trying to check for messages, that may work.

This might end up being a wack-a-mole problem though. It might be simpler just to shut down the progress thread near the beginning of MPI_Finalize.

hominhquan · 2025-02-11T14:31:34Z

@hjelmn @hppritcha I've tried your two proposals:

In sm_del_procs to scan for ep in fbox_in_endpoints array and set it to NULL, then check for NULL in mca_btl_sm_check_fboxes(). As predicted by @hppritcha, I still got problems due to concurrency access to the fbox_in_endpoints[] array between the main thread (mca_btl_sm_endpoint_destructor) and the progress thread (mca_btl_sm_check_fboxes()), despite my attempt to add lock in both of them.
I then switched to call a opal_progress_shutdown_async_progress_thread() early in ompi_mpi_finalize() (before PMIx_Fence_nb()). The progress thread is well shutdown before the service tearing down and MPI processes terminate smoothly now.

I'll close this PR and open a new one (with tag RFC) on the async progress thread feature. Thanks for your comments.

hominhquan · 2025-02-11T15:06:52Z

Closed and replaced by #13088

github-actions bot added the Target: main label Jan 31, 2025

hominhquan mentioned this pull request Jan 31, 2025

Intermittent crashes inside MPI_Finalize #10117

Open

hominhquan force-pushed the mqho/instance_finalize branch 2 times, most recently from 9631015 to 6d59b77 Compare January 31, 2025 17:18

hominhquan force-pushed the mqho/instance_finalize branch 2 times, most recently from f94d66e to 5070d7f Compare February 5, 2025 13:48

hominhquan closed this Feb 11, 2025

hominhquan mentioned this pull request Feb 12, 2025

RFC: Provide equivalence of MPICH_ASYNC_PROGRESS #13088

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ompi/instance: fix cleanup function registration order #13073

ompi/instance: fix cleanup function registration order #13073

Uh oh!

hominhquan commented Jan 31, 2025

Uh oh!

github-actions bot commented Jan 31, 2025

Uh oh!

hppritcha commented Feb 1, 2025

Uh oh!

hominhquan commented Feb 2, 2025

Uh oh!

hominhquan commented Feb 2, 2025

Uh oh!

hppritcha commented Feb 2, 2025

Uh oh!

hominhquan commented Feb 4, 2025

Uh oh!

hjelmn commented Feb 6, 2025

Uh oh!

hppritcha commented Feb 6, 2025

Uh oh!

hominhquan commented Feb 7, 2025

Uh oh!

hppritcha commented Feb 9, 2025

Uh oh!

hppritcha commented Feb 9, 2025

Uh oh!

hominhquan commented Feb 11, 2025

Uh oh!

hominhquan commented Feb 11, 2025

Uh oh!

Uh oh!

ompi/instance: fix cleanup function registration order #13073

ompi/instance: fix cleanup function registration order #13073

Uh oh!

Conversation

hominhquan commented Jan 31, 2025

Uh oh!

github-actions bot commented Jan 31, 2025

Uh oh!

hppritcha commented Feb 1, 2025

Uh oh!

hominhquan commented Feb 2, 2025

Uh oh!

hominhquan commented Feb 2, 2025

Uh oh!

hppritcha commented Feb 2, 2025

Uh oh!

hominhquan commented Feb 4, 2025

Uh oh!

hjelmn commented Feb 6, 2025

Uh oh!

hppritcha commented Feb 6, 2025

Uh oh!

hominhquan commented Feb 7, 2025

Uh oh!

hppritcha commented Feb 9, 2025

Uh oh!

hppritcha commented Feb 9, 2025

Uh oh!

hominhquan commented Feb 11, 2025

Uh oh!

hominhquan commented Feb 11, 2025

Uh oh!

Uh oh!