KAFKA-18981: Fix flaky test QuorumControllerTest.testMinIsrUpdateWithElr #20645
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
[KAFKA-18981] Deflake
testMinIsrUpdateWithElr
by maintaining broker heartbeats during waitsWhat
This PR removes nondeterminism in
testMinIsrUpdateWithElr
by ensuring broker sessions do not expire while the test is waiting/sleeping, through a lightweight background heartbeat pumper.JIRA
KAFKA-18981
Reproduction of flakiness
The test is stable until you insert a sleep equal to the controller session timeout immediately after topic creation. With the original code:
In
testMinIsrUpdateWithElr
, add the line right after:Insert:
Run the test loop a few times:
Why this flakes
sessionTimeoutMillis
.1
) to lapse its session and get fenced transiently.How (the fix)
Add a minimal background heartbeat pumper inside the test:
periodMs = max(50ms, sessionTimeoutMillis / 3)
to guarantee multiple heartbeats per timeout window.brokersToKeepUnfenced
right beforewaitForCondition
that fences brokers 2 and 3.This preserves the test’s intent (only broker
1
remains unfenced; brokers2,3
fence) while eliminating timing races introduced by sleeps/awaits.Verification of stability
With the heartbeat pumper in place and the flakiness sleep enabled, the loop below completed without failures:
Note: I ran the loop 50 times (compared to the 5 times in reproducing the flakiness case) to increase trust in the fix
Scope