Commit fb3ce33
committed
session: fix pool renewal race causing double statement execution
When two or more nodes are bootstrapped concurrently the Python driver
can execute the same CQL statement twice, causing spurious "already
exists" errors in the caller. This has been observed as flaky test
failures across the ScyllaDB test suite for the past two years, and
worked around by using idempotent DDL forms (IF NOT EXISTS / IF EXISTS)
in dozens of tests.
Root cause
----------
The race unfolds as follows:
1. Two on_add notifications arrive at roughly the same time, one for
each new node. Each one calls session.add_or_renew_pool(), which
submits run_add_or_renew_pool() to the thread pool and returns.
Both submissions are in-flight concurrently.
2. The first add_or_renew_pool() finishes and calls _finalize_add(),
which notifies load-balancing policies and then calls
session.update_created_pools() for every live session.
3. update_created_pools() iterates all known hosts. For the second
host, whose run_add_or_renew_pool() has not yet completed, it sees
self._pools.get(host) == None (or a shut-down pool) and therefore
submits *another* run_add_or_renew_pool() for that host.
4. Now two tasks are connecting to the same host. The first one
finishes and installs pool-A in self._pools, then runs a statement
(e.g. CREATE ROLE) that is in-flight on pool-A.
5. The second task finishes, reads the stale `previous = self._pools.get(host)`
value (captured *before* the lock was taken — another bug), installs
pool-B and then shuts down pool-A. The in-flight CREATE ROLE request
is orphaned; the driver retries it on pool-B. The server executes it
a second time and returns "Role ... already exists".
Fix
---
Three coordinated changes to cassandra/cluster.py:
* Session.__init__: add self._pending_pool_futures = {}, a dict mapping
host -> Future for any in-flight pool creation, guarded by _lock.
* add_or_renew_pool: before submitting run_add_or_renew_pool(), check
_pending_pool_futures under _lock. If an in-flight future already
exists for the host, return it immediately — this is the primary fix
that prevents the duplicate submission from update_created_pools.
Additionally, move the `previous = self._pools.get(host)` read inside
the lock so the live-pool check is atomic with the installation of the
new pool: if a concurrent creation has already installed a live pool
by the time we finish connecting, discard our new pool instead of
replacing the live one (defense-in-depth).
Cleanup of _pending_pool_futures is handled by a done_callback
registered on the future immediately after it is stored, both
operations performed under _lock. The callback only removes the entry
if it still points at the same future it was registered on, so a
concurrent remove_pool followed by a new add_or_renew_pool is not
affected. This guarantees cleanup under all exit paths including
unhandled exceptions inside run_add_or_renew_pool, and avoids the
race where a fast-completing task pops the key before the outer code
has stored the future.
* remove_pool: clear _pending_pool_futures[host] under _lock so that
if a host is removed and immediately re-added, add_or_renew_pool
submits a fresh creation rather than reusing a stale done future.
Tests
-----
Five new unit tests are added in PoolRenewalRaceTest
(tests/unit/test_cluster.py). They exercise the new code paths without
requiring a real cluster connection by constructing a minimal Session
via object.__new__ and mocking the executor and profile manager:
* test_add_or_renew_pool_reuses_inflight_future: places a pending
Future in _pending_pool_futures and verifies that add_or_renew_pool
returns it without submitting a new task to the executor.
* test_add_or_renew_pool_discards_duplicate_when_live_pool_exists:
exercises the real production code path by patching HostConnection
to a lightweight stub and using a synchronous executor shim that
runs the submitted callable inline. Pre-installs a live pool for
the host, then calls add_or_renew_pool() and asserts that the live
pool is not replaced and the newly connected stub pool is shut down.
* test_remove_pool_clears_pending_future: verifies that remove_pool
clears _pending_pool_futures so the next add_or_renew_pool call
submits a fresh task.
* test_done_callback_clears_pending_future: verifies that the
done_callback fires and removes the entry from _pending_pool_futures
once the future completes.
* test_done_callback_does_not_clear_newer_future: verifies the identity
guard — an old future's callback does not evict a newer future that
was installed in its place after a remove_pool + add_or_renew_pool.
Fixes: #3171 parent cd9f525 commit fb3ce33
2 files changed
Lines changed: 255 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2615 | 2615 | | |
2616 | 2616 | | |
2617 | 2617 | | |
| 2618 | + | |
| 2619 | + | |
| 2620 | + | |
| 2621 | + | |
| 2622 | + | |
| 2623 | + | |
2618 | 2624 | | |
2619 | 2625 | | |
2620 | 2626 | | |
| |||
3256 | 3262 | | |
3257 | 3263 | | |
3258 | 3264 | | |
3259 | | - | |
3260 | 3265 | | |
3261 | 3266 | | |
3262 | 3267 | | |
| |||
3276 | 3281 | | |
3277 | 3282 | | |
3278 | 3283 | | |
3279 | | - | |
| 3284 | + | |
| 3285 | + | |
| 3286 | + | |
| 3287 | + | |
| 3288 | + | |
| 3289 | + | |
| 3290 | + | |
| 3291 | + | |
| 3292 | + | |
| 3293 | + | |
| 3294 | + | |
| 3295 | + | |
| 3296 | + | |
| 3297 | + | |
| 3298 | + | |
| 3299 | + | |
| 3300 | + | |
| 3301 | + | |
3280 | 3302 | | |
3281 | 3303 | | |
| 3304 | + | |
| 3305 | + | |
| 3306 | + | |
3282 | 3307 | | |
3283 | 3308 | | |
3284 | 3309 | | |
3285 | 3310 | | |
3286 | 3311 | | |
3287 | | - | |
| 3312 | + | |
| 3313 | + | |
| 3314 | + | |
| 3315 | + | |
| 3316 | + | |
| 3317 | + | |
| 3318 | + | |
| 3319 | + | |
| 3320 | + | |
| 3321 | + | |
| 3322 | + | |
| 3323 | + | |
| 3324 | + | |
| 3325 | + | |
| 3326 | + | |
| 3327 | + | |
| 3328 | + | |
| 3329 | + | |
| 3330 | + | |
| 3331 | + | |
| 3332 | + | |
| 3333 | + | |
| 3334 | + | |
| 3335 | + | |
| 3336 | + | |
| 3337 | + | |
| 3338 | + | |
3288 | 3339 | | |
3289 | 3340 | | |
3290 | | - | |
| 3341 | + | |
| 3342 | + | |
| 3343 | + | |
| 3344 | + | |
| 3345 | + | |
| 3346 | + | |
3291 | 3347 | | |
3292 | 3348 | | |
3293 | 3349 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
| 18 | + | |
| 19 | + | |
18 | 20 | | |
19 | 21 | | |
20 | 22 | | |
| |||
339 | 341 | | |
340 | 342 | | |
341 | 343 | | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
| 497 | + | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
| 504 | + | |
| 505 | + | |
| 506 | + | |
| 507 | + | |
| 508 | + | |
| 509 | + | |
| 510 | + | |
| 511 | + | |
| 512 | + | |
| 513 | + | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
| 517 | + | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
| 529 | + | |
| 530 | + | |
| 531 | + | |
| 532 | + | |
| 533 | + | |
| 534 | + | |
| 535 | + | |
| 536 | + | |
342 | 537 | | |
343 | 538 | | |
344 | 539 | | |
| |||
0 commit comments