-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Fix BackupS3BlobCorrectness race condition with backup workers #12482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Here is running the single test only 100k times:
|
Convert to draft till sure of this fix (100k runs of the test doesn't guarantee 100k of mixed tests will pass) |
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
cf02d2c
to
aebfe72
Compare
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
The test was failing intermittently because checkAndDisableBackupWorkers() (called automatically after submitBackup) would disable backup workers before the snapshot could be created, causing restore_invalid_version errors. Fixes: - Re-enable backup workers immediately after submitBackup() to ensure snapshot creation completes - Set waitForQuiescenceBegin=false to avoid unnecessary quietDatabase() delay
…tBackup so backup agents don't go away
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general question on how to debug mock S3 tests: how can you check the files stored on mock S3 after the test failure? In the past, I've been using the backup files left on the disk to aid debugging. For mock S3, the files are cached in memory? Then it might be hard to debug its usage. Is there anyway to make it easy?
// Wait for backup to become restorable if it's still in progress | ||
// This handles cases where cluster recoveries delay snapshot completion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make this a separate actor for clarity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np
true, | ||
StopWhenDone{ !stopDifferentialDelay }, | ||
UsePartitionedLog::False, | ||
UsePartitionedLog::True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if this was False
, then we are using backup V1, which doesn't need backup workers. Why would the test fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that having it false, after calling submitBackup, when we came back, the backup workers were being shut down which disrupted snapshot range tasks... they wouldn't complete... No snapshot manifest was written so recovery would never run (If you have an alternative, I should try, I'd be happy to give it a go).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried without setting UsePartitionedLog to true and removing the backup worker stuff and
20251022-020938-stack_backup_9-04e2387024d4e9d8 compressed=True data_size=55712465 duration=6193024 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=1:06:49 sanity=False started=100000 stopped=20251022-031627 submitted=20251022-020938 timeout=5400 username=stack_backup_9
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Yeah. Mocks3 is memory based. Tests verify what was stored on s3 by variations on reading back what they wrote (in the s3client tests its explicit; in the backup tests its backup, clear, restore, and then verify the chain of keyvalues we've been writing). Up to this, if test fails, I've been looking at http back and forth headers or dumping content if necessary to figure where we went wrong. I can look at adding persistence. Intentionally I've avoided it up to this in effort at making sure we have constant read/write so tests can pass determinism checks. |
…he UsePartitionedLog=True code
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Ran 250k
Two failures:
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
The test was failing intermittently because checkAndDisableBackupWorkers() (called automatically after submitBackup) would disable backup workers before the snapshot could be created, causing restore_invalid_version errors.
Fixes: