test_peer_recovery is flaky because of unreliable writing of system_id in safekeeper's WAL #10596

alexanderlaw · 2025-01-30T20:01:52Z

While investigating a recent failure of test_peer_recovery:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10559/13030789349/index.html#/testresult/f73891f617c521e9

test_runner/regress/test_wal_acceptor.py:1472: in test_peer_recovery
    cmp_sk_wal([sk1, sk2], tenant_id, timeline_id)
test_runner/regress/test_wal_acceptor.py:1284: in cmp_sk_wal
    assert (mismatch, not_regular) == (
E   AssertionError: WAL segs /tmp/test_output/test_peer_recovery[release-pg15]-1/repo/safekeepers/sk1/42fdc4b6fbf4334a922994b0a6c92297/dcffdf8a5cfeaada20697ff12e7eed5c/000000010000000000000001 and /tmp/test_output/test_peer_recovery[release-pg15]-1/repo/safekeepers/sk2/42fdc4b6fbf4334a922994b0a6c92297/dcffdf8a5cfeaada20697ff12e7eed5c/000000010000000000000001 on sks 1 and 2 are not identic
E   assert (['0000000100...0000001'], []) == ([], [])
E     At index 0 diff: ['000000010000000000000001'] != []
E     Full diff:
E     - ([], [])
E     + (['000000010000000000000001'], [])

I reproduced it locally (within 100 iterations with 10 test instances running in parallel) and discovered, that the difference between sk1 and sk2 WAL is always at offset 0x18, e. g.:

--- .../test_output/test_peer_recovery_4[release-pg15]/repo/safekeepers/sk1/9035d8c94ba074867572ea7258046648/8ae4021ce4d1668ddc8aed7dd6b94b7f/000000010000000000000001.hex   2025-01-30 14:21:25.466694628 +0200
+++ .../test_output/test_peer_recovery_4[release-pg15]/repo/safekeepers/sk2/9035d8c94ba074867572ea7258046648/8ae4021ce4d1668ddc8aed7dd6b94b7f/000000010000000000000001.hex   2025-01-30 14:21:25.758689647 +0200
@@ -1,5 +1,5 @@
 00000000: 10d1 0200 0100 0000 0000 0001 0000 0000  ................
-00000010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
+00000010: 0000 0000 0000 0000 7516 6e5b bd6e 9b67  ........u.n[.n.g
 00000020: 0000 0001 0020 0000 0000 0000 0000 0000  ..... ..........
 00000030: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 00000040: 0000 0000 0000 0000 0000 0000 0000 0000  ................

or:

--- .../test_output/test_peer_recovery_4[release-pg15]/repo/safekeepers/sk1/639027be5aa883cf798993a9e984b01b/c30980843ada44c730bacd0fa6af834e/000000010000000000000001.hex   2025-01-30 15:37:00.663555859 +0200
+++ .../test_output/test_peer_recovery_4[release-pg15]/repo/safekeepers/sk2/639027be5aa883cf798993a9e984b01b/c30980843ada44c730bacd0fa6af834e/000000010000000000000001.hex   2025-01-30 15:37:00.952550954 +0200
@@ -1,5 +1,5 @@
 00000000: 10d1 0200 0100 0000 0000 0001 0000 0000  ................
-00000010: 0000 0000 0000 0000 ebf8 b5c2 7580 9b67  ............u..g
+00000010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 00000020: 0000 0001 0020 0000 0000 0000 0000 0000  ..... ..........
 00000030: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 00000040: 0000 0000 0000 0000 0000 0000 0000 0000  ................

As far as I can see, it's the location of system_id stored in XLogLongPageHeaderData.

That is, the test fails when one of the two safekeepers has a zero system_id while the other has non-zero. When both system_ids are zero, this goes unnoticed.

Really, with the following addition for the test:

--- a/test_runner/regress/test_wal_acceptor.py
+++ b/test_runner/regress/test_wal_acceptor.py
@@ -1249,6 +1249,13 @@ def cmp_sk_wal(sks: list[Safekeeper], tenant_id: TenantId, timeline_id: Timeline
             term_flush_lsns[0] == tfl
         ), f"(last_log_term, flush_lsn) are not equal on sks {sks[0].id} and {sk.id}: {term_flush_lsns[0]} != {tfl}"
 
+    sk0 = sks[0]
+    for seg in sk0.list_segments(tenant_id, timeline_id):
+        log.info(f"seg: {seg}")
+        f1 =  sk0.timeline_dir(tenant_id, timeline_id) / seg
+        res = subprocess.run(f"xxd {f1} | grep '00000010: 0000 0000 0000 0000 0000 0000 0000 0000'", shell=True)
+        assert (res.returncode != 0), "safekeeper's segment contains zero system_id"
+

we can see it immediately failing as below:

POETRY_VIRTUALENVS_PATH=virtualenvs DEFAULT_PG_VERSION=17 BUILD_TYPE=release poetry run pytest test_runner/regress/test_wal_acceptor.py::test_peer_recovery -- --preserve-database-files
...
>           assert (res.returncode != 0), "safekeeper's segment contains zero system_id"
E           AssertionError: safekeeper's segment contains zero system_id
E           assert 0 != 0
E            +  where 0 = CompletedProcess(args="xxd /t/neon/test_output/test_peer_recovery[release-pg17]/repo/safekeepers/sk1/6679cc263df24fd6556327a5996fd1f3/05f34240206e62a0bbd96c1b39b61f48/000000010000000000000001 | grep '00000010: 0000 0000 0000 0000 0000 0000 0000 0000'", returncode=0).returncode

xxd test_output/test_peer_recovery[release-pg17]/repo/safekeepers/sk1/6679cc263df24fd6556327a5996fd1f3/05f34240206e62a0bbd96c1b39b61f48/000000010000000000000001 | head -2
00000000: 16d1 0200 0100 0000 0000 0001 0000 0000  ................
00000010: 0000 0000 0000 0000 0000 0000 0000 0000  ................

This means that a correct system_id is stored in a safekeeper's segment quite rarely, only when handle_greeting() with non-zero msg.system_id called before PhysicalStorage::new() (and before initialize_first_segment()).

The text was updated successfully, but these errors were encountered:

alexanderlaw added the t/bug Issue Type: Bug label Jan 30, 2025

arpad-m added the c/storage/safekeeper Component: storage: safekeeper label Jan 30, 2025

alexanderlaw added a/test/flaky Area: related to flaky tests a/test/flaky/investigated labels Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_peer_recovery is flaky because of unreliable writing of system_id in safekeeper's WAL #10596

test_peer_recovery is flaky because of unreliable writing of system_id in safekeeper's WAL #10596

alexanderlaw commented Jan 30, 2025

test_peer_recovery is flaky because of unreliable writing of system_id in safekeeper's WAL #10596

test_peer_recovery is flaky because of unreliable writing of system_id in safekeeper's WAL #10596

Comments

alexanderlaw commented Jan 30, 2025