-
-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix close
error due to race in d_closeall
#248
Conversation
The below code that adds some code to regulate GC behavior can recreate the situation. julia> using Distributed
julia> addprocs(4);
julia> @everywhere begin
using Distributed, DistributedArrays
end
julia> A = rand(1:100, (100,100));
julia> DA = distribute(A);
julia> GC.enable(false)
true
julia> DA = nothing
julia> function DistributedArrays.d_from_weakref_or_d(id)
d = get(DistributedArrays.registry, id, nothing)
GC.enable(true)
GC.gc()
isa(d, WeakRef) && return d.value
return d
end
julia> DistributedArrays.d_closeall()
ERROR: MethodError: no method matching close(::Nothing)
Closest candidates are:
close(::Union{Base.AsyncCondition, Timer})
@ Base asyncevent.jl:162
close(::Union{FileWatching.FileMonitor, FileWatching.FolderMonitor, FileWatching.PollingFileWatcher})
@ FileWatching /data/Work/julia/binaries/julia-1.9.3/share/julia/stdlib/v1.9/FileWatching/src/FileWatching.jl:328
close(::DArray)
@ DistributedArrays ~/.julia/dev/DistributedArrays/src/core.jl:34
...
Stacktrace:
[1] d_closeall()
@ DistributedArrays ~/.julia/dev/DistributedArrays/src/core.jl:47
[2] top-level scope
@ REPL[9]:1 |
It seems possible that `DistributedArrays.d_closeall()` may encounter a condition where it finds a darray id in the `registry`, but the corresponding weakref value is `nothing` because the referenced darray got garbage collected. It has been enountered many times in CI and elsewhere, but is hard to replicate normally. Adding a check for the weakref value, before actually invoking `close` on it, to fix it.
Can we be sure that the remote memory is properly freed if the reference has been garbage collected? |
Yes, I think so, because the finalizer invokes julia> using Distributed
julia> addprocs(4);
julia> @everywhere begin
using Distributed, DistributedArrays
function print_registry_entries()
for k in keys(DistributedArrays.registry)
hasval = !(DistributedArrays.d_from_weakref_or_d(k) === nothing)
println("key $k hasval $hasval")
end
end
end
julia> @everywhere print_registry_entries()
julia> A = rand(1:100, (100,100));
julia> DA = distribute(A);
julia> GC.enable(false)
true
julia> DA = nothing
julia> @everywhere print_registry_entries()
key (1, 1) hasval true
From worker 5: key (1, 1) hasval true
From worker 3: key (1, 1) hasval true
From worker 2: key (1, 1) hasval true
From worker 4: key (1, 1) hasval true
julia> function DistributedArrays.d_from_weakref_or_d(id)
d = get(DistributedArrays.registry, id, nothing)
GC.enable(true)
GC.gc()
isa(d, WeakRef) && return d.value
return d
end
julia> @everywhere print_registry_entries()
key (1, 1) hasval false
From worker 2: key (1, 1) hasval true
From worker 3: key (1, 1) hasval true
From worker 5: key (1, 1) hasval true
From worker 4: key (1, 1) hasval true
julia> DistributedArrays.d_closeall()
julia> @everywhere print_registry_entries() Would it be good to add this as a test? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is fine as it is
It seems possible that
DistributedArrays.d_closeall()
may encounter a condition where it finds a darray id in theregistry
, but the corresponding weakref value isnothing
because the referenced darray got garbage collected. It has been enountered many times in CI and elsewhere, but is hard to replicate normally. Adding a check for the weakref value, before actually invokingclose
on it, to fix it.fixes #246