Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_peer_recovery is flaky because of unreliable writing of system_id in safekeeper's WAL #10596

Open
alexanderlaw opened this issue Jan 30, 2025 · 0 comments
Labels
a/test/flaky/investigated a/test/flaky Area: related to flaky tests c/storage/safekeeper Component: storage: safekeeper t/bug Issue Type: Bug

Comments

@alexanderlaw
Copy link
Contributor

While investigating a recent failure of test_peer_recovery:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10559/13030789349/index.html#/testresult/f73891f617c521e9

test_runner/regress/test_wal_acceptor.py:1472: in test_peer_recovery
    cmp_sk_wal([sk1, sk2], tenant_id, timeline_id)
test_runner/regress/test_wal_acceptor.py:1284: in cmp_sk_wal
    assert (mismatch, not_regular) == (
E   AssertionError: WAL segs /tmp/test_output/test_peer_recovery[release-pg15]-1/repo/safekeepers/sk1/42fdc4b6fbf4334a922994b0a6c92297/dcffdf8a5cfeaada20697ff12e7eed5c/000000010000000000000001 and /tmp/test_output/test_peer_recovery[release-pg15]-1/repo/safekeepers/sk2/42fdc4b6fbf4334a922994b0a6c92297/dcffdf8a5cfeaada20697ff12e7eed5c/000000010000000000000001 on sks 1 and 2 are not identic
E   assert (['0000000100...0000001'], []) == ([], [])
E     At index 0 diff: ['000000010000000000000001'] != []
E     Full diff:
E     - ([], [])
E     + (['000000010000000000000001'], [])

I reproduced it locally (within 100 iterations with 10 test instances running in parallel) and discovered, that the difference between sk1 and sk2 WAL is always at offset 0x18, e. g.:

--- .../test_output/test_peer_recovery_4[release-pg15]/repo/safekeepers/sk1/9035d8c94ba074867572ea7258046648/8ae4021ce4d1668ddc8aed7dd6b94b7f/000000010000000000000001.hex   2025-01-30 14:21:25.466694628 +0200
+++ .../test_output/test_peer_recovery_4[release-pg15]/repo/safekeepers/sk2/9035d8c94ba074867572ea7258046648/8ae4021ce4d1668ddc8aed7dd6b94b7f/000000010000000000000001.hex   2025-01-30 14:21:25.758689647 +0200
@@ -1,5 +1,5 @@
 00000000: 10d1 0200 0100 0000 0000 0001 0000 0000  ................
-00000010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
+00000010: 0000 0000 0000 0000 7516 6e5b bd6e 9b67  ........u.n[.n.g
 00000020: 0000 0001 0020 0000 0000 0000 0000 0000  ..... ..........
 00000030: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 00000040: 0000 0000 0000 0000 0000 0000 0000 0000  ................

or:

--- .../test_output/test_peer_recovery_4[release-pg15]/repo/safekeepers/sk1/639027be5aa883cf798993a9e984b01b/c30980843ada44c730bacd0fa6af834e/000000010000000000000001.hex   2025-01-30 15:37:00.663555859 +0200
+++ .../test_output/test_peer_recovery_4[release-pg15]/repo/safekeepers/sk2/639027be5aa883cf798993a9e984b01b/c30980843ada44c730bacd0fa6af834e/000000010000000000000001.hex   2025-01-30 15:37:00.952550954 +0200
@@ -1,5 +1,5 @@
 00000000: 10d1 0200 0100 0000 0000 0001 0000 0000  ................
-00000010: 0000 0000 0000 0000 ebf8 b5c2 7580 9b67  ............u..g
+00000010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 00000020: 0000 0001 0020 0000 0000 0000 0000 0000  ..... ..........
 00000030: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 00000040: 0000 0000 0000 0000 0000 0000 0000 0000  ................

As far as I can see, it's the location of system_id stored in XLogLongPageHeaderData.

That is, the test fails when one of the two safekeepers has a zero system_id while the other has non-zero. When both system_ids are zero, this goes unnoticed.

Really, with the following addition for the test:

--- a/test_runner/regress/test_wal_acceptor.py
+++ b/test_runner/regress/test_wal_acceptor.py
@@ -1249,6 +1249,13 @@ def cmp_sk_wal(sks: list[Safekeeper], tenant_id: TenantId, timeline_id: Timeline
             term_flush_lsns[0] == tfl
         ), f"(last_log_term, flush_lsn) are not equal on sks {sks[0].id} and {sk.id}: {term_flush_lsns[0]} != {tfl}"
 
+    sk0 = sks[0]
+    for seg in sk0.list_segments(tenant_id, timeline_id):
+        log.info(f"seg: {seg}")
+        f1 =  sk0.timeline_dir(tenant_id, timeline_id) / seg
+        res = subprocess.run(f"xxd {f1} | grep '00000010: 0000 0000 0000 0000 0000 0000 0000 0000'", shell=True)
+        assert (res.returncode != 0), "safekeeper's segment contains zero system_id"
+

we can see it immediately failing as below:

POETRY_VIRTUALENVS_PATH=virtualenvs DEFAULT_PG_VERSION=17 BUILD_TYPE=release poetry run pytest test_runner/regress/test_wal_acceptor.py::test_peer_recovery -- --preserve-database-files
...
>           assert (res.returncode != 0), "safekeeper's segment contains zero system_id"
E           AssertionError: safekeeper's segment contains zero system_id
E           assert 0 != 0
E            +  where 0 = CompletedProcess(args="xxd /t/neon/test_output/test_peer_recovery[release-pg17]/repo/safekeepers/sk1/6679cc263df24fd6556327a5996fd1f3/05f34240206e62a0bbd96c1b39b61f48/000000010000000000000001 | grep '00000010: 0000 0000 0000 0000 0000 0000 0000 0000'", returncode=0).returncode

xxd test_output/test_peer_recovery[release-pg17]/repo/safekeepers/sk1/6679cc263df24fd6556327a5996fd1f3/05f34240206e62a0bbd96c1b39b61f48/000000010000000000000001 | head -2
00000000: 16d1 0200 0100 0000 0000 0001 0000 0000  ................
00000010: 0000 0000 0000 0000 0000 0000 0000 0000  ................

This means that a correct system_id is stored in a safekeeper's segment quite rarely, only when handle_greeting() with non-zero msg.system_id called before PhysicalStorage::new() (and before initialize_first_segment()).

@alexanderlaw alexanderlaw added the t/bug Issue Type: Bug label Jan 30, 2025
@arpad-m arpad-m added the c/storage/safekeeper Component: storage: safekeeper label Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/test/flaky/investigated a/test/flaky Area: related to flaky tests c/storage/safekeeper Component: storage: safekeeper t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

2 participants