Skip to content

MPI implementations intercepting Signals is incompatible with Julia GC safepoint #725

Open
@alexandrebouchard

Description

@alexandrebouchard

Thanks again for your help with #720 - this one is unrelated (except that issue #720 lead us to create more comprehensive unit test revealing this new, probably unrelated segfault).

Summary of this problem: a segfault occurs when GC is triggered in a multithreaded+MPI context.

How to reproduce: I have create a draft PR adding a GC.gc() call in one of MPI.jl's existing multithreaded test: see PR Request #724

The draft PR is based off the most recent commit where all tests passed (Tag 0.20.8). In the output of "test-intel-linux", the salient output is

signal (11): Segmentation fault
in expression starting at /home/runner/work/MPI.jl/MPI.jl/test/test_threads.jl:18
ijl_gc_enable at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gc.c:2955

The change we made is in the file test/test_threads.jl, where we added the following if clause:

    Threads.@threads for i = 1:N
        reqs[N+i] = MPI.Irecv!(@view(recv_arr[i:i]), comm; source=src, tag=i)
        reqs[i] = MPI.Isend(@view(send_arr[i:i]), comm; dest=dst, tag=i)
        if i == 1 
            GC.gc()
        end

    end 

We experience similar problems with MPICH 4.0 in our package (https://github.com/Julia-Tempering/Pigeons.jl), but not with MPICH 4.1.

Related discussions

This describes a similar issue in the context of UCX. However this problem does not seem limited to UCX from our investigations so far.

This describes a similar issue in the context of OpenMPI. However it seems that certain versions of MPICH and intel MPI (which is MPICH-derived) might suffer from a similar issue?

In light of these two sources, perhaps other environment variables in the style of

ENV["UCX_ERROR_SIGNALS"] = "SIGILL,SIGBUS,SIGFPE"
could be set to address this issue? I was wondering if anyone might have some suggestion on whether that's a reasonable hypothesis? Having limited MPI experience I am not sure what these environment variables might be.

Thank you so much for your time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions