Fix AsyncioConnection race conditions causing EBADF errors (#614) by dkropachev · Pull Request #697 · scylladb/python-driver

dkropachev · 2026-02-12T17:49:08Z

Summary

Fix close() to wait for _close() completion from non-event-loop threads, eliminating the window where is_closed=True but the socket fd is still open (root cause of EBADF)
Set last_error on server EOF in handle_read() so factory() detects dead connections instead of returning them
Add is_closed/is_defunct guard in push() to reject writes on closed connections
Treat BrokenPipeError/ConnectionResetError in handle_write() as clean peer disconnections instead of defuncting, and skip defunct() in both I/O handlers if the connection is already shutting down

Test plan

New unit tests in tests/unit/io/test_asyncio_race_614.py cover all four race conditions using real socket pairs
Full unit test suite passes with no regressions
Integration tests with TLS and node restart scenarios

fruch · 2026-02-22T07:13:30Z

@dkropachev would recommand testing SSL with this fix, we know SSL was failing/stuck with asyncio connection, hence all the fuss with it being a default or not

Lorak-mmk · 2026-02-23T11:01:00Z

we know SSL was failing/stuck with asyncio connection

It was not failing because of race conditions, but because setting up SSL socket has a totally different API in asyncio, and it was just never implemented here.

Lorak-mmk · 2026-02-23T11:06:10Z

+                try:
+                    self._socket.close()
+                except OSError:
+                    # Ignore if socket is already closed
+                    pass


Why would the socket be already closed? self._socket.close() is only called in this one place. _close is only called from close, which checks and sets is_closed, so _close has no way of running multiple times.

Replied here - #697 (comment)

Lorak-mmk · 2026-02-23T11:08:32Z

+                except Exception:
+                    # It is not critical if it fails, driver can keep working,
+                    # but it should not be happening, so logged as error
+                    log.error("Unexpected error removing reader for %s",
+                              self.endpoint, exc_info=True)
+


In previous version you were catching OsError, and explained that it is thrown when the socket is closed. Again, I don't see how this is possible, because the only place where the socket is being closed is after this code.

You're right, the socket can't be "already closed" here since _close() is the only place that closes it and it's guarded by is_closed. The comment was misleading — fixed it. The except OSError is still there as a safety net because close() can technically fail for other OS-level reasons (e.g. EIO on certain transports), even on first call.

Lorak-mmk · 2026-02-23T11:08:46Z

+            fd = self._socket.fileno()
+            if fd >= 0:


Why? When can fd be 0?

socket.fileno() returns -1 when the socket is already closed. The fd >= 0 check guards against calling remove_writer/remove_reader with an invalid fd. The >= 0 rather than > 0 is because 0 is technically a valid file descriptor (though rare for sockets — it would only happen if stdin was closed before the socket was created).

Lorak-mmk · 2026-02-23T11:09:12Z

+# Errno values that indicate the remote peer has disconnected.
+_PEER_DISCONNECT_ERRNOS = frozenset((
+    errno.ENOTCONN, errno.ESHUTDOWN,
+    errno.ECONNRESET, errno.ECONNABORTED,
+    errno.EBADF,
+))


Won't it be a ConnectionError in all those cases?

No, not all of them. Python only maps specific errno values to ConnectionError subclasses:

ECONNRESET → ConnectionResetError ✓

ECONNABORTED → ConnectionAbortedError ✓

ESHUTDOWN → BrokenPipeError ✓

ENOTCONN → plain OSError ✗

EBADF → plain OSError ✗

So catching only ConnectionError would miss ENOTCONN and EBADF.

Lorak-mmk · 2026-02-23T11:10:57Z

+        if self.is_closed or self.is_defunct:
+            raise ConnectionShutdown(
+                "Connection to %s is already closed" % self.endpoint)


As far as I can tell, push is only called from Connection::send_msg. send_msg already starts with:

if self.is_defunct: raise ConnectionShutdown("Connection to %s is defunct" % self.endpoint) elif self.is_closed: raise ConnectionShutdown("Connection to %s is closed" % self.endpoint) elif not self._socket_writable: raise ConnectionBusy("Connection %s is overloaded" % self.endpoint)

So what does this check give us?

you are right, let's remove it.

Fix race conditions in AsyncioConnection that cause "[Errno 9] Bad file descriptor" errors during node restarts, especially with TLS: 1. close() now waits for _close() to complete when called from outside the event loop thread, eliminating the window where is_closed=True but the socket fd is still open. 2. handle_read() sets last_error on server EOF so factory() detects dead connections instead of returning them to callers. 3. handle_write() treats peer disconnections as clean close instead of defuncting, and both I/O handlers skip defunct() if the connection is already shutting down. Peer-disconnect detection is extracted into _is_peer_disconnect() helper covering platform-specific behaviors: - Windows: ProactorEventLoop raises plain OSError with winerror 10054 (WSAECONNRESET) or 10053 (WSAECONNABORTED) instead of ConnectionResetError. Detection uses ConnectionError base class plus winerror check. - macOS: Raises OSError(57) ENOTCONN when writing to a peer-disconnected socket, which is not a ConnectionError subclass. Detection uses errno-based checks for ENOTCONN, ESHUTDOWN, ECONNRESET, and ECONNABORTED. - Windows _close(): ProactorEventLoop does not support remove_reader/remove_writer (raises NotImplementedError). These calls are wrapped so the socket is always closed regardless, and try/finally ensures connected_event is always set even if cleanup fails.

github-code-quality Bot found potential problems Feb 12, 2026

View reviewed changes

Comment thread tests/unit/io/test_asyncio_race_614.py Fixed

Comment thread cassandra/io/asyncioreactor.py Fixed

github-code-quality Bot found potential problems Feb 12, 2026

View reviewed changes

Comment thread cassandra/io/asyncioreactor.py Fixed

Comment thread cassandra/io/asyncioreactor.py Fixed

github-code-quality Bot found potential problems Feb 12, 2026

View reviewed changes

Comment thread cassandra/io/asyncioreactor.py Fixed

Comment thread cassandra/io/asyncioreactor.py Fixed

Comment thread cassandra/io/asyncioreactor.py Fixed

dkropachev force-pushed the fix/asyncio-close-race-614 branch 3 times, most recently from 46524ac to 7b815c8 Compare February 16, 2026 13:09

dkropachev self-assigned this Feb 16, 2026

dkropachev requested review from Lorak-mmk and sylwiaszunejko February 16, 2026 13:26

dkropachev marked this pull request as ready for review February 16, 2026 13:26

Lorak-mmk requested changes Feb 16, 2026

View reviewed changes

dkropachev force-pushed the fix/asyncio-close-race-614 branch from 2b9d555 to 85d08e1 Compare February 17, 2026 04:52

github-code-quality Bot found potential problems Feb 17, 2026

View reviewed changes

Comment thread cassandra/io/asyncioreactor.py Fixed

Comment thread cassandra/io/asyncioreactor.py Fixed

Comment thread cassandra/io/asyncioreactor.py Fixed

dkropachev force-pushed the fix/asyncio-close-race-614 branch from 85d08e1 to 776d2d3 Compare February 17, 2026 12:55

github-code-quality Bot found potential problems Feb 17, 2026

View reviewed changes

Comment thread cassandra/io/asyncioreactor.py Fixed

dkropachev force-pushed the fix/asyncio-close-race-614 branch from d73e85d to 9915064 Compare February 17, 2026 20:40

dkropachev requested a review from Lorak-mmk February 17, 2026 20:40

Lorak-mmk requested changes Feb 23, 2026

View reviewed changes

dkropachev force-pushed the fix/asyncio-close-race-614 branch from 9915064 to 4bfcd27 Compare March 13, 2026 15:49

Lorak-mmk force-pushed the master branch from f2a9e87 to 763af09 Compare June 15, 2026 10:57

Uh oh!

Conversation

dkropachev commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fruch commented Feb 22, 2026

Uh oh!

Lorak-mmk commented Feb 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dkropachev commented Feb 12, 2026 •

edited

Loading