download fails with `FDWatcher: bad file descriptor (EBADF)` #197

kleinschmidt · 2022-06-07T18:53:38Z

On Julia 1.7.3, I've found that downloads sometimes fail with the following error:

UNHANDLED TASK ERROR: IOError: FDWatcher: bad file descriptor (EBADF)
Stacktrace:
[1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
@ Base ./task.jl:812
[2] wait()
@ Base ./task.jl:872
[3] wait(c::Base.GenericCondition{Base.Threads.SpinLock})
@ Base ./condition.jl:123
[4] wait(fdw::FileWatching._FDWatcher; readable::Bool, writable::Bool)
@ FileWatching /usr/local/julia/share/julia/stdlib/v1.7/FileWatching/src/FileWatching.jl:533
[5] wait
@ /usr/local/julia/share/julia/stdlib/v1.7/FileWatching/src/FileWatching.jl:504 [inlined]
[6] macro expansion
@ /usr/local/julia/share/julia/stdlib/v1.7/Downloads/src/Curl/Multi.jl:166 [inlined]

The line this points to in the Downloads.jl source is

Downloads.jl/src/Curl/Multi.jl

Line 166 in b0f23d0

events = try wait(watcher)

"Sometimes" here means "after millions of S3 requests in the span of multiple days of runtime with retry around the actual request-making code". (retry using the default settings, so with the default ExponentialBackOff schedule with a single retry). When this error occurred, it occurred multiple times, on multiple different pods (which by design are accessing different s3 URIs but still in the same region), so I'm wondering if it is somehow related to the "connection pool corruption" issue w/ AWS.jl. Another possibly relevant bit of context is that the code that actually is making the requests is actually doing an asyncmap over dozens (<100) of small s3 GET requests.

I'm afraid this happened in a long-running job that I can't interact with directly and don't have a reprex that I can share, but wanted to open an issue in case someone else has seen this or has advice on how to debug or what other information would be useful!

The text was updated successfully, but these errors were encountered:

ericphanson · 2022-06-07T18:59:12Z

"connection pool corruption" issue w/ AWS.jl

Ref JuliaCloud/AWS.jl#552

When this error occurred, it occurred multiple times, on multiple different pods (which by design are accessing different s3 URIs but still in the same region), so I'm wondering if it is somehow related to the "connection pool corruption" issue w/ AWS.jl

I don't think that AWS#552 would explain this if they are happening on different pods, since each pod should have it's own downloader object, so they aren't sharing any connection pools between them. But maybe it could be something like: AWS flips out at some point in time, sends back garbage to each pod independently at the same time, it corrupts the downloader object in each pod's AWS module somehow, and then because we don't retry with fresh downloaders (which is what AWS#552 is about) they each fail on the retried attempts as well, so they all fail at the same time. (?)

kleinschmidt · 2022-06-07T20:37:47Z

Yup, the latter was my hunch.

StefanKarpinski · 2022-06-09T16:24:34Z

@vtjnash: is there something we could do here? I'm guessing it's the sock object that's messed up here, not just the watcher object that's constructed from it. Given that, it seems like the only option is to error and terminate the download. I'm not entirely sure what the right way to do that is. Pass CURL_CSELECT_ERR into curl_multi_socket_action.

vtjnash · 2022-06-09T16:34:41Z

Isn't this a duplicate of #186?

StefanKarpinski · 2022-06-09T17:12:06Z

Possibly. @kleinschmidt, @ericphanson, can you try using a newer version of Downloads? You can load a newer version by dev'ing this package and then changing the UUID. Or if possible just try Julia 1.8.

kleinschmidt · 2022-06-09T17:56:41Z

Will do, although without a clear reproducer I'm not sure how far we'll get! I take it the fix for #186 didn't get back ported to 1.7.3?

StefanKarpinski · 2022-06-14T16:46:08Z

Not yet, but it could be.

StefanKarpinski · 2022-06-14T16:49:14Z

Well, actually at this point it probably doesn't make sense because there's unlikely to be a 1.7.4 release, so I think the fix will be to go to 1.8 for this.

albheim · 2022-07-05T12:24:17Z

I get the same error, but I get it on my local computer and I get it pretty much every time I install or update something. This happens both on 1.7.3 and 1.8.0-rc1.

EDIT: Think I might not have restarted my computer in a while, and now after restarting I haven't seen it again so far.
EDIT2: It's back on both 1.7 and 1.8

diversable · 2022-09-24T20:14:41Z

I'm still getting the same error with Julia 1.81...

giordano · 2022-09-24T20:51:58Z

Note that if #187 should have fixed this issue, it didn't make it to Julia v1.8, because the commit of Downloads.jl used in Julia v1.8.1 is 2a21b15, which predates #187.

ericphanson · 2022-09-30T11:01:41Z

I am seeing this in CI logs on 1.8.0 as well:

Unhandled Task ERROR: IOError: FDWatcher: bad file descriptor (EBADF)
Stacktrace:
 [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
   @ Base ./task.jl:871
 [2] wait()
   @ Base ./task.jl:931
 [3] wait(c::Base.GenericCondition{Base.Threads.SpinLock})
   @ Base ./condition.jl:124
 [4] _wait(fdw::FileWatching._FDWatcher, mask::FileWatching.FDEvent)
   @ FileWatching /opt/hostedtoolcache/julia/1.8.0/x64/share/julia/stdlib/v1.8/FileWatching/src/FileWatching.jl:535
 [5] wait(fdw::FileWatching.FDWatcher)
   @ FileWatching /opt/hostedtoolcache/julia/1.8.0/x64/share/julia/stdlib/v1.8/FileWatching/src/FileWatching.jl:563
 [6] macro expansion
   @ /opt/hostedtoolcache/julia/1.8.0/x64/share/julia/stdlib/v1.8/Downloads/src/Curl/Multi.jl:166 [inlined]
 [7] (::Downloads.Curl.var"#40#46"{Int32, FileWatching.FDWatcher, Downloads.Curl.Multi})()
   @ Downloads.Curl ./task.jl:484
Unhandled Task ERROR: IOError: FDWatcher: bad file descriptor (EBADF)
Stacktrace:
 [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
   @ Base ./task.jl:871
 [2] wait()
   @ Base ./task.jl:931
 [3] wait(c::Base.GenericCondition{Base.Threads.SpinLock})
   @ Base ./condition.jl:124
 [4] _wait(fdw::FileWatching._FDWatcher, mask::FileWatching.FDEvent)
   @ FileWatching /opt/hostedtoolcache/julia/1.8.0/x64/share/julia/stdlib/v1.8/FileWatching/src/FileWatching.jl:535
 [5] wait(fdw::FileWatching.FDWatcher)
   @ FileWatching /opt/hostedtoolcache/julia/1.8.0/x64/share/julia/stdlib/v1.8/FileWatching/src/FileWatching.jl:563
 [6] macro expansion
   @ /opt/hostedtoolcache/julia/1.8.0/x64/share/julia/stdlib/v1.8/Downloads/src/Curl/Multi.jl:166 [inlined]
 [7] (::Downloads.Curl.var"#40#46"{Int32, FileWatching.FDWatcher, Downloads.Curl.Multi})()
   @ Downloads.Curl ./task.jl:484

My tests didn't actually fail though; maybe it happened somewhere where I have retries.

I see Julia v1.8.2 is on the same commit as 1.8.1, so sounds like that won't be a fix either. Maybe we can backport it for 1.8.3?

StefanKarpinski · 2022-10-04T13:21:34Z

I don't think we've added any features to Downloads since then so we can probably just bump Downloads on 1.8 to latest.

albheim mentioned this issue Jul 5, 2022

Package installation error for freshly installed Linux: wait(): IOError: FDWatcher: bad file descriptor JuliaLang/julia#45661

Closed

tanmaykm mentioned this issue Jul 17, 2022

make compatible to Julia 1.7 Downloads.jl updates JuliaComputing/gRPCClient.jl#26

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

download fails with `FDWatcher: bad file descriptor (EBADF)` #197

download fails with `FDWatcher: bad file descriptor (EBADF)` #197

kleinschmidt commented Jun 7, 2022

ericphanson commented Jun 7, 2022

kleinschmidt commented Jun 7, 2022

StefanKarpinski commented Jun 9, 2022

vtjnash commented Jun 9, 2022

StefanKarpinski commented Jun 9, 2022

kleinschmidt commented Jun 9, 2022

StefanKarpinski commented Jun 14, 2022

StefanKarpinski commented Jun 14, 2022

albheim commented Jul 5, 2022 •

edited

Loading

diversable commented Sep 24, 2022

giordano commented Sep 24, 2022

ericphanson commented Sep 30, 2022

StefanKarpinski commented Oct 4, 2022

download fails with FDWatcher: bad file descriptor (EBADF) #197

download fails with FDWatcher: bad file descriptor (EBADF) #197

Comments

kleinschmidt commented Jun 7, 2022

ericphanson commented Jun 7, 2022

kleinschmidt commented Jun 7, 2022

StefanKarpinski commented Jun 9, 2022

vtjnash commented Jun 9, 2022

StefanKarpinski commented Jun 9, 2022

kleinschmidt commented Jun 9, 2022

StefanKarpinski commented Jun 14, 2022

StefanKarpinski commented Jun 14, 2022

albheim commented Jul 5, 2022 • edited Loading

diversable commented Sep 24, 2022

giordano commented Sep 24, 2022

ericphanson commented Sep 30, 2022

StefanKarpinski commented Oct 4, 2022

download fails with `FDWatcher: bad file descriptor (EBADF)` #197

download fails with `FDWatcher: bad file descriptor (EBADF)` #197

albheim commented Jul 5, 2022 •

edited

Loading