Skip to content

Segfault in free-threaded Python 3.14t during cluster shutdown (logging race in Cythonized cluster.so) #717

Description

@dkropachev

Summary

The test libev (3.14t) integration test job segfaults at ~95% through cqlengine tests (during test_ifexists.py) on free-threaded Python 3.14t. The crash is a thread-safety race condition in logging calls that format Host/EndPoint objects while those objects are being concurrently torn down during cluster shutdown.

Stack Trace

Fatal Python error: Segmentation fault

<Cannot show all threads while the GIL is disabled>
Stack (most recent call first):
  File ".../logging/__init__.py", line 1154 in emit
  File ".../concurrent/futures/thread.py", line 73 in run
  ...

Current thread's C stack trace (most recent call first):
  ... at _PyUnicodeWriter_WriteStr+0x77
  ... cassandra/cluster.cpython-314t-x86_64-linux-gnu.so, at +0x13b1cc
  ... cassandra/cluster.cpython-314t-x86_64-linux-gnu.so, at +0x14771a (PyObject_VectorcallMethod)
  ... cassandra/cluster.cpython-314t-x86_64-linux-gnu.so, at +0x1121f9

Root Cause

The crash is a race condition between cluster shutdown and executor threads doing logging with %s/%r formatting of Host objects. With the GIL disabled in 3.14t, this is no longer safe.

The race:

  • Thread A (main): Cluster.shutdown()Session.shutdown() → iterates/clears _pools and shuts down pools, potentially triggering cleanup of Host/EndPoint objects
  • Thread B (executor worker): still running a submitted task (e.g. run_add_or_renew_pool(), on_down_potentially_blocking(), or a future callback), hits a logging call like log.debug("... %s", host) which calls Host.__str__()str(self.endpoint)DefaultEndPoint.__str__()"%s:%d" % (self._address, self._port)

The segfault occurs in _PyUnicodeWriter_WriteStr because the endpoint's _address string (or the endpoint object itself) is being garbage collected by Thread A while Thread B is trying to format it.

Shutdown order in Cluster.shutdown() (cluster.py:1772):

  1. self.is_shutdown = True
  2. self.scheduler.shutdown()
  3. self.control_connection.shutdown()
  4. Session shutdown → pool shutdown
  5. self.executor.shutdown() ← executor tasks may still be running during steps 2-4

The executor is shut down last, so in-flight tasks submitted before is_shutdown was set can still be executing during pool/session teardown.

Likely logging call sites involved:

  • cluster.py:3247log.debug("Removed connection pool for %r", host) in remove_pool()
  • cluster.py:3236log.debug("Added pool for host %s to session", host) in run_add_or_renew_pool()
  • cluster.py:1955-1958log.debug("... %s", host) in _start_reconnector(), called from @run_in_executor decorated on_down_potentially_blocking()
  • cluster.py:1843-1852log.error/debug/info("... %s", host) in _on_up_future_completed()

Observed In

logs_58180451190.zip

Possible Fixes

  1. Reorder shutdown: shut down (or at least drain) the executor before shutting down sessions/pools, so no executor tasks are running during teardown
  2. Defensive string caching: cache str(host) / repr(host) results so they don't access mutable state during formatting
  3. Guard logging calls: check is_shutdown before logging in executor-submitted callbacks, or catch exceptions in __str__/__repr__

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions