test_peer_recovery is flaky because of unreliable writing of system_id in safekeeper's WAL #10596
Labels
a/test/flaky/investigated
a/test/flaky
Area: related to flaky tests
c/storage/safekeeper
Component: storage: safekeeper
t/bug
Issue Type: Bug
While investigating a recent failure of test_peer_recovery:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10559/13030789349/index.html#/testresult/f73891f617c521e9
I reproduced it locally (within 100 iterations with 10 test instances running in parallel) and discovered, that the difference between sk1 and sk2 WAL is always at offset 0x18, e. g.:
or:
As far as I can see, it's the location of system_id stored in XLogLongPageHeaderData.
That is, the test fails when one of the two safekeepers has a zero system_id while the other has non-zero. When both system_ids are zero, this goes unnoticed.
Really, with the following addition for the test:
we can see it immediately failing as below:
This means that a correct system_id is stored in a safekeeper's segment quite rarely, only when handle_greeting() with non-zero msg.system_id called before PhysicalStorage::new() (and before initialize_first_segment()).
The text was updated successfully, but these errors were encountered: