Fix explicit wait_for_schema_agreement() failing by nyh · Pull Request #837 · scylladb/python-driver

nyh · 2026-05-03T15:56:05Z

The Python driver's wait_for_schema_agreement() function has an optional connection parameter, and whether or not this parameter is given has two different purposes:

connection is given when the user's DDL request generated a "RESULT_KIND_SCHEMA_CHANGE" response, and the driver is required to use the same node.
connection is not given, the driver may connection to any node, and will typically use its "control connection".

The second use case sometimes failed if one of the node recently went down. The problem is that it was possible that the control connection was to that node that went down, and the driver did not yet set up a new control connection. The wait_for_schema_agreement() had an error on that one specific control connection, and instead of retrying on a new control connection, it gave up immediately - which was wrong. The first patch in this series fixes this wait_for_schema_agreement(connection=None) case, and the second patch adds reproducing tests using in-process MockConnection/MockCluster objects.

This fix will improve, for example, ScyllaDB's dtest. dtest's create_ks_query() runs session.cluster.control_connection.wait_for_schema_agreement(wait_time=120). It is arguable why it needs to do this (since the DDL that preceeded it will have already waited for schema agreement!), but it does. And that used to fail - rarely but not rarely enough - when one of the nodes went down just before trying to call this function.

Fixes #604.

…nection When wait_for_schema_agreement() is called without an explicit connection (e.g., via the public API or from dtest's create_ks()), it uses the control connection. If that connection is closed concurrently — for example because a node was stopped — a ConnectionShutdown exception is raised and propagated to the caller, causing an unrecoverable failure. The root cause is that the connection was captured once at the start of the function and never refreshed. The control connection can reconnect to another node at any time, but the function was unaware of this. Fix this by distinguishing between two call modes: 1. Explicit connection (passed after DDL RESULT_KIND_SCHEMA_CHANGE): The caller must verify schema agreement through the specific coordinator node. A ConnectionShutdown here is a real error and is still re-raised. 2. Control-connection mode (connection=None): There is no hard requirement on which node is queried. When a ConnectionShutdown occurs, wait briefly for the control connection to reconnect, pick up the new self._connection, and retry within the existing timeout budget. This matches the behavior for OperationTimedOut, which already retried in a loop. This avoids the need for callers to work around the bug by discarding and recreating their session after a node stop, as seen for example in several scylla-dtest workarounds (e.g., SCYLLADB-1256). Fixes: scylladb#604

…ction Add two unit tests to ControlConnectionTest that cover the bug fixed in the previous commit (scylladb#604): test_wait_for_schema_agreement_recovers_from_closed_control_connection: Simulates a node being stopped while wait_for_schema_agreement() is running in control-connection mode (no explicit connection passed). The mock initial connection raises ConnectionShutdown on its first query and simultaneously installs a replacement connection, mirroring what the driver's reconnect logic does in production. Before the fix this exception propagated to the caller. After the fix the function transparently picks up the new control connection and returns True. test_wait_for_schema_agreement_explicit_connection_raises_on_shutdown: Guards the DDL code path, where the caller passes an explicit connection to ensure schema agreement is verified through the DDL coordinator. In this case ConnectionShutdown must still be re-raised — there is no safe fallback to a different node. This test passes both before and after the fix, confirming that the fix does not weaken the explicit-connection path. Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Copilot

Pull request overview

This PR fixes ControlConnection.wait_for_schema_agreement(connection=None) to recover when the control connection is closed during the schema agreement polling loop (e.g., a node goes down) by retrying using an updated/reconnected control connection, and adds unit test coverage for the regression.

Changes:

Retry schema agreement checks on ConnectionShutdown when no explicit connection was provided, switching to a refreshed control connection when available.
Preserve existing behavior when an explicit connection is passed (propagate ConnectionShutdown).
Add unit tests covering both the recovery path (implicit control connection) and the explicit-connection failure path.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`cassandra/cluster.py`	Adds `ConnectionShutdown` recovery logic for the implicit-control-connection schema agreement path.
`tests/unit/test_control_connection.py`	Adds regression tests for control connection shutdown recovery and explicit connection behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

tests/unit/test_control_connection.py:26

There are now two separate imports from cassandra.connection (ConnectionShutdown and EndPoint/DefaultEndPoint/DefaultEndPointFactory). Consider combining them into a single import statement to avoid duplication and keep imports organized (helps linters like isort/flake8).

from cassandra import OperationTimedOut, SchemaTargetType, SchemaChangeType
from cassandra.connection import ConnectionShutdown
from cassandra.protocol import ResultMessage, RESULT_KIND_ROWS
from cassandra.cluster import ControlConnection, _Scheduler, ProfileManager, EXEC_PROFILE_DEFAULT, ExecutionProfile
from cassandra.pool import Host
from cassandra.connection import EndPoint, DefaultEndPoint, DefaultEndPointFactory
from cassandra.policies import (SimpleConvictionPolicy, RoundRobinPolicy,

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

nyh · 2026-05-04T06:12:58Z

@avikivity @mykaul this is the kind of issues that caused numerous different Scylla tests over the years to sporadically fail (one of the recent ones, https://scylladb.atlassian.net/browse/SCYLLADB-1256), and it's important to finally fix.

Lorak-mmk

Looks good, thanks!

Lorak-mmk · 2026-05-04T11:23:42Z

I'll wait with merging so that @dkropachev can take a look too.

sylwiaszunejko

@dkropachev ping, could you also take a look?

dkropachev · 2026-05-06T10:24:13Z

@Lorak-mmk , i wonder, doesn't it has the same correctness problem ?
I think that we shouldn't mix two cases:

When driver calls it, it knows from which connection/host exactly it came from, so we are checking on exactly that host
When it is called from the outside of the driver, we don't know where schema update event happened and we need to check schema on all the nodes (probably with some scope, dc/whole-cluster)

And I feel like for both cases we need separate API, for first one - current API on CC is ok, for second one - it is definitely should be on the Session.

Lorak-mmk · 2026-05-06T10:57:45Z

i wonder, doesn't it has the same correctness problem ?

No, this PR doesn't introduce any new correctness problem. It only touches the case without explicit connection (so the case where user called it).
In that case, the current semantics is that it awaits schema agreement on control connection. This PR makes it resistant to CC reconnection happening during this wait. I see no bug - the semantics is still that after await returns, there is schema agreement on current CC (the only thing that changes is that the CC may be different than before the call).

What you propose (checking schema on all nodes) is imo a good idea for semantics change - but its a separate task that should not block this PR.
Btw this is already how we do schema agreement check in Rust Driver - we issue system.local query to all nodes, instead of checking system.peers and system.local on a single node.

dkropachev · 2026-05-06T11:56:12Z

i wonder, doesn't it has the same correctness problem ?

No, this PR doesn't introduce any new correctness problem. It only touches the case without explicit connection (so the case where user called it). In that case, the current semantics is that it awaits schema agreement on control connection. This PR makes it resistant to CC reconnection happening during this wait. I see no bug - the semantics is still that after await returns, there is schema agreement on current CC (the only thing that changes is that the CC may be different than before the call).

What you propose (checking schema on all nodes) is imo a good idea for semantics change - but its a separate task that should not block this PR. Btw this is already how we do schema agreement check in Rust Driver - we issue system.local query to all nodes, instead of checking system.peers and system.local on a single node.

I did not put it clearly, what I was saying is that user should not call session.cluster.control_connection.wait_for_schema_agreement, in fact it should be private, since there is no way user may know if control connection has enough information to conclude if cluster is agreed on schema or not, if we keep it this way it will lead to false positives.
In other words, this PR fixes case that should not even be considered in this API, instead we should implement Session API and update dtest to use it instead of current.

Lorak-mmk · 2026-05-06T12:18:47Z

I was saying is that user should not call session.cluster.control_connection.wait_for_schema_agreement

I agree that control_connection should not be accessed. Unfortunately it is.

since there is no way user may know if control connection has enough information to conclude if cluster is agreed on schema or not

And I also agree that the semantics of the current API may not be very useful in some situations, and that we should introduce a better one.
Still, this does not conflict with this PR. We can both introduce new API and make the current one a bit better.

nyh added 2 commits May 3, 2026 18:36

nyh requested review from Copilot, dkropachev and sylwiaszunejko May 3, 2026 15:56

Copilot started reviewing on behalf of nyh May 3, 2026 15:56 View session

Copilot AI reviewed May 3, 2026

View reviewed changes

Comment thread tests/unit/test_control_connection.py

nyh added the CI-Stability label May 3, 2026

nyh requested a review from Copilot May 3, 2026 19:43

Copilot started reviewing on behalf of nyh May 3, 2026 19:44 View session

Copilot AI reviewed May 3, 2026

View reviewed changes

sylwiaszunejko requested a review from Lorak-mmk May 4, 2026 10:30

Lorak-mmk approved these changes May 4, 2026

View reviewed changes

sylwiaszunejko approved these changes May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix explicit wait_for_schema_agreement() failing #837

Fix explicit wait_for_schema_agreement() failing #837
nyh wants to merge 2 commits intoscylladb:masterfrom
nyh:fix-604

nyh commented May 3, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

nyh commented May 4, 2026

Uh oh!

Lorak-mmk left a comment

Uh oh!

Lorak-mmk commented May 4, 2026

Uh oh!

sylwiaszunejko left a comment

Uh oh!

dkropachev commented May 6, 2026 •

edited

Loading

Uh oh!

Lorak-mmk commented May 6, 2026

Uh oh!

dkropachev commented May 6, 2026 •

edited

Loading

Uh oh!

Lorak-mmk commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

nyh commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

nyh commented May 4, 2026

Uh oh!

Lorak-mmk left a comment

Choose a reason for hiding this comment

Uh oh!

Lorak-mmk commented May 4, 2026

Uh oh!

sylwiaszunejko left a comment

Choose a reason for hiding this comment

Uh oh!

dkropachev commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lorak-mmk commented May 6, 2026

Uh oh!

dkropachev commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lorak-mmk commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nyh commented May 3, 2026 •

edited

Loading

dkropachev commented May 6, 2026 •

edited

Loading

dkropachev commented May 6, 2026 •

edited

Loading