Description
Thanks again for your help with #720 - this one is unrelated (except that issue #720 lead us to create more comprehensive unit test revealing this new, probably unrelated segfault).
Summary of this problem: a segfault occurs when GC is triggered in a multithreaded+MPI context.
How to reproduce: I have create a draft PR adding a GC.gc() call in one of MPI.jl's existing multithreaded test: see PR Request #724
The draft PR is based off the most recent commit where all tests passed (Tag 0.20.8). In the output of "test-intel-linux", the salient output is
signal (11): Segmentation fault
in expression starting at /home/runner/work/MPI.jl/MPI.jl/test/test_threads.jl:18
ijl_gc_enable at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gc.c:2955
The change we made is in the file test/test_threads.jl, where we added the following if clause:
Threads.@threads for i = 1:N
reqs[N+i] = MPI.Irecv!(@view(recv_arr[i:i]), comm; source=src, tag=i)
reqs[i] = MPI.Isend(@view(send_arr[i:i]), comm; dest=dst, tag=i)
if i == 1
GC.gc()
end
end
We experience similar problems with MPICH 4.0 in our package (https://github.com/Julia-Tempering/Pigeons.jl), but not with MPICH 4.1.
Related discussions
This describes a similar issue in the context of UCX. However this problem does not seem limited to UCX from our investigations so far.
This describes a similar issue in the context of OpenMPI. However it seems that certain versions of MPICH and intel MPI (which is MPICH-derived) might suffer from a similar issue?
In light of these two sources, perhaps other environment variables in the style of
Line 133 in 6d513bb
Thank you so much for your time.