-
Notifications
You must be signed in to change notification settings - Fork 894
ompi/instance: fix cleanup function registration order #13073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello! The Git Commit Checker CI bot found a few problems with this PR: f33c3a2: ompi/instance: fix cleanup function registration o...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
9631015
to
6d59b77
Compare
This patch isn't going to work. The problem is by the time the ompi_mpi_instance_cleanup_pml is invoked the PML framework has already been closed. For some reason this happens to work for OB1 PML. Not sure exactly why that is yet. Could you provide a traceback by chance from one of the failed CI tests when using Open MPI main or 5.0.x? This area of Open MPI was completely rewritten from the 4.1.x to 5.0.x releases so tracebacks from failures using the 4.1.x releases wouldn't be that helpful for addressing this issue for newer releases. |
Thanks @hppritcha for your comment. I admit that I don't know the complete init/finalize protocol between different modules. The only fact I notice is that the @hppritcha Can you confirm which should be the expected registration order of |
or perhaps the |
This code has been used for who knows how many millions of mpirun job runs since the 5.0.0 release. Finalizing MPI as the last instance is being closed (an instance maps to an MPI session) needs to be carefully orchestrated. The pml clean up is attempting to make sure all procs have been removed from the PML (as well as any opened via the OSC framework) prior to closing the frameworks. You may have found a bug but this is not the way to solve it. Could you please provide a traceback of one of your CI failures using either one of the 5.0.x releases or main? Also a way to double check your changes, you should configure your builds with
|
I observed the race when testing the enabling of async OPAL progress thread in #13074. The observation is as follow:
|
f94d66e
to
5070d7f
Compare
- Append PML cleanup into the finalize of the instance domain ('ompi_instance_common_domain') before RTE/OPAL init. - The reason is RTE init (ompi_rte_init()) will call opal_init(), which in turn will set the internal tracking domain to OPAL's one ('opal_init_domain'), and this PML cleanup function would be mis-registered as belonging to 'opal_init_domain' instead of the current 'ompi_instance_common_domain'. - The consequence of such mis-registration is that: at MPI_Finalize(), this PML cleanup (*_del_procs()) will be executed by RTE; and, depending on their registration order, this may cut the grass under the feet of other running components (*_progress()) - This may be the root cause of issue open-mpi#10117 Signed-off-by: Minh Quan Ho <[email protected]>
That suggest either the async thread should be shut down before the PML or there is some state we need to clean up to ensure (maybe removing a callback from opal_progress). Changing the order of when the pml is cleaned up is probably not the right approach. |
The crash you are seeing is also in a part of the sm code that is using memory mapped in from another process - the fast box thing - so if other processes are busy tearing down the sm and the backing shared memory you will see segfaults in this location in the traceback if some process is lagging a bit and still trying to progress the sm btl. Not that it solves the problem but it may help in understanding what's going on if you set the following MCA parameter and see if you still get segfaults.
I agree with @hjelmn you need to find a place to shut down all the progress threads early in the instance MPI_Finalize/MPI_Session_finalize process. |
Setting env @hjelmn @hppritcha If I sum up your recommendation:
|
So the complication here is that the sm fast box checking (see One thing you could do is add an Another workaround would be to disable use of fast boxes in the sm btl when the opal progress thread is enabled. This approach would work for both the MCW and sessions models. Another more complex approach would be develop some type of peer-to-peer shutdown protocol for the fast box path. This shutdown protocol could be engaged in the What I suspect in your test is that some processes have completed the async collective operation later than others and have only just reached the PMIx_Fence_nb while others have gone past the fence and started tearing down their ompi related resources. The opal progress thread is meanwhile spinning through the OB1 progress loop on those slower processes and boom they die in the check on fast boxes from processes that have already cleaned up their ob1/btl resources. BTW, I bet your async code at may 65048a7 may break things for other PMLs which you aren't probably testing, like UCX and CM (OFI). Please open a PR for that commit marked as an RFC and let's see what else may or may not pass CI. |
There may be possibly be another option for this sm btl problem. It looks like in the This might end up being a wack-a-mole problem though. It might be simpler just to shut down the progress thread near the beginning of |
@hjelmn @hppritcha I've tried your two proposals:
I'll close this PR and open a new one (with tag RFC) on the async progress thread feature. Thanks for your comments. |
Closed and replaced by #13088 |
Append PML cleanup into the finalize of the instance domain ('ompi_instance_common_domain') before RTE/OPAL init.
The reason is RTE init (ompi_rte_init()) will call opal_init(), which in turn will set the internal tracking domain to OPAL's one ('opal_init_domain'), and this PML cleanup function would be mis-registered as belonging to 'opal_init_domain' instead of the current 'ompi_instance_common_domain'.
The consequence of such mis-registration is that: at MPI_Finalize(), this PML cleanup (_del_procs()) will be executed by RTE; and, depending on their registration order, this may cut the grass under the feet of other running components (_progress())
This may be the root cause of issue Intermittent crashes inside MPI_Finalize #10117