Skip to content

Fix AsyncioConnection race conditions causing EBADF errors (#614)#697

Open
dkropachev wants to merge 1 commit into
masterfrom
fix/asyncio-close-race-614
Open

Fix AsyncioConnection race conditions causing EBADF errors (#614)#697
dkropachev wants to merge 1 commit into
masterfrom
fix/asyncio-close-race-614

Conversation

@dkropachev

@dkropachev dkropachev commented Feb 12, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Fix close() to wait for _close() completion from non-event-loop threads, eliminating the window where is_closed=True but the socket fd is still open (root cause of EBADF)
  • Set last_error on server EOF in handle_read() so factory() detects dead connections instead of returning them
  • Add is_closed/is_defunct guard in push() to reject writes on closed connections
  • Treat BrokenPipeError/ConnectionResetError in handle_write() as clean peer disconnections instead of defuncting, and skip defunct() in both I/O handlers if the connection is already shutting down

Test plan

  • New unit tests in tests/unit/io/test_asyncio_race_614.py cover all four race conditions using real socket pairs
  • Full unit test suite passes with no regressions
  • Integration tests with TLS and node restart scenarios

Comment thread tests/unit/io/test_asyncio_race_614.py Fixed
Comment thread cassandra/io/asyncioreactor.py Fixed
Comment thread cassandra/io/asyncioreactor.py Fixed
Comment thread cassandra/io/asyncioreactor.py Fixed
Comment thread cassandra/io/asyncioreactor.py Fixed
Comment thread cassandra/io/asyncioreactor.py Fixed
Comment thread cassandra/io/asyncioreactor.py Fixed
@dkropachev dkropachev force-pushed the fix/asyncio-close-race-614 branch 3 times, most recently from 46524ac to 7b815c8 Compare February 16, 2026 13:09
@dkropachev dkropachev self-assigned this Feb 16, 2026
@dkropachev dkropachev marked this pull request as ready for review February 16, 2026 13:26
Comment thread tests/unit/io/test_asyncio_race_614.py Outdated
Comment thread cassandra/io/asyncioreactor.py Outdated
Comment thread cassandra/io/asyncioreactor.py Outdated
Comment thread cassandra/io/asyncioreactor.py Outdated
Comment thread cassandra/io/asyncioreactor.py Outdated
Comment thread cassandra/io/asyncioreactor.py Outdated
Comment thread cassandra/io/asyncioreactor.py Outdated
Comment thread cassandra/io/asyncioreactor.py
@dkropachev dkropachev force-pushed the fix/asyncio-close-race-614 branch from 2b9d555 to 85d08e1 Compare February 17, 2026 04:52
Comment thread cassandra/io/asyncioreactor.py Fixed
Comment thread cassandra/io/asyncioreactor.py Fixed
Comment thread cassandra/io/asyncioreactor.py Fixed
@dkropachev dkropachev force-pushed the fix/asyncio-close-race-614 branch from 85d08e1 to 776d2d3 Compare February 17, 2026 12:55
Comment thread cassandra/io/asyncioreactor.py Fixed
@dkropachev dkropachev force-pushed the fix/asyncio-close-race-614 branch from d73e85d to 9915064 Compare February 17, 2026 20:40
@dkropachev dkropachev requested a review from Lorak-mmk February 17, 2026 20:40
@fruch

fruch commented Feb 22, 2026

Copy link
Copy Markdown

@dkropachev would recommand testing SSL with this fix, we know SSL was failing/stuck with asyncio connection, hence all the fuss with it being a default or not

@Lorak-mmk

Copy link
Copy Markdown

we know SSL was failing/stuck with asyncio connection

It was not failing because of race conditions, but because setting up SSL socket has a totally different API in asyncio, and it was just never implemented here.

Comment on lines +205 to +209
try:
self._socket.close()
except OSError:
# Ignore if socket is already closed
pass

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would the socket be already closed? self._socket.close() is only called in this one place. _close is only called from close, which checks and sets is_closed, so _close has no way of running multiple times.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replied here - #697 (comment)

Comment on lines +199 to +204
except Exception:
# It is not critical if it fails, driver can keep working,
# but it should not be happening, so logged as error
log.error("Unexpected error removing reader for %s",
self.endpoint, exc_info=True)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In previous version you were catching OsError, and explained that it is thrown when the socket is closed. Again, I don't see how this is possible, because the only place where the socket is being closed is after this code.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, the socket can't be "already closed" here since _close() is the only place that closes it and it's guarded by is_closed. The comment was misleading — fixed it. The except OSError is still there as a safety net because close() can technically fail for other OS-level reasons (e.g. EIO on certain transports), even on first call.

Comment on lines +174 to +175
fd = self._socket.fileno()
if fd >= 0:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? When can fd be 0?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

socket.fileno() returns -1 when the socket is already closed. The fd >= 0 check guards against calling remove_writer/remove_reader with an invalid fd. The >= 0 rather than > 0 is because 0 is technically a valid file descriptor (though rare for sockets — it would only happen if stdin was closed before the socket was created).

Comment on lines +16 to +21
# Errno values that indicate the remote peer has disconnected.
_PEER_DISCONNECT_ERRNOS = frozenset((
errno.ENOTCONN, errno.ESHUTDOWN,
errno.ECONNRESET, errno.ECONNABORTED,
errno.EBADF,
))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't it be a ConnectionError in all those cases?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not all of them. Python only maps specific errno values to ConnectionError subclasses:

  • ECONNRESETConnectionResetError
  • ECONNABORTEDConnectionAbortedError
  • ESHUTDOWNBrokenPipeError
  • ENOTCONN → plain OSError
  • EBADF → plain OSError

So catching only ConnectionError would miss ENOTCONN and EBADF.

Comment thread cassandra/io/asyncioreactor.py Outdated
Comment on lines +224 to +226
if self.is_closed or self.is_defunct:
raise ConnectionShutdown(
"Connection to %s is already closed" % self.endpoint)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, push is only called from Connection::send_msg. send_msg already starts with:

        if self.is_defunct:
            raise ConnectionShutdown("Connection to %s is defunct" % self.endpoint)
        elif self.is_closed:
            raise ConnectionShutdown("Connection to %s is closed" % self.endpoint)
        elif not self._socket_writable:
            raise ConnectionBusy("Connection %s is overloaded" % self.endpoint)

So what does this check give us?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, let's remove it.

Fix race conditions in AsyncioConnection that cause "[Errno 9] Bad
file descriptor" errors during node restarts, especially with TLS:

1. close() now waits for _close() to complete when called from outside
   the event loop thread, eliminating the window where is_closed=True
   but the socket fd is still open.

2. handle_read() sets last_error on server EOF so factory() detects dead
   connections instead of returning them to callers.

3. handle_write() treats peer disconnections as clean close instead of
   defuncting, and both I/O handlers skip defunct() if the connection
   is already shutting down.

Peer-disconnect detection is extracted into _is_peer_disconnect() helper
covering platform-specific behaviors:

- Windows: ProactorEventLoop raises plain OSError with winerror 10054
  (WSAECONNRESET) or 10053 (WSAECONNABORTED) instead of
  ConnectionResetError. Detection uses ConnectionError base class plus
  winerror check.

- macOS: Raises OSError(57) ENOTCONN when writing to a
  peer-disconnected socket, which is not a ConnectionError subclass.
  Detection uses errno-based checks for ENOTCONN, ESHUTDOWN,
  ECONNRESET, and ECONNABORTED.

- Windows _close(): ProactorEventLoop does not support
  remove_reader/remove_writer (raises NotImplementedError). These calls
  are wrapped so the socket is always closed regardless, and try/finally
  ensures connected_event is always set even if cleanup fails.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants