Skip to content

Conversation

Anshul-creator
Copy link

@Anshul-creator Anshul-creator commented Oct 6, 2025

[KAFKA-18981] Deflake testMinIsrUpdateWithElr by maintaining broker heartbeats during waits

What

This PR removes nondeterminism in testMinIsrUpdateWithElr by ensuring broker sessions do not expire while the test is waiting/sleeping, through a lightweight background heartbeat pumper.

JIRA

KAFKA-18981

Reproduction of flakiness

The test is stable until you insert a sleep equal to the controller session timeout immediately after topic creation. With the original code:

  1. In testMinIsrUpdateWithElr, add the line right after:

    CreateTopicsResponseData createTopicsResponseData = active.createTopics(
        ANONYMOUS_CONTEXT, createTopicsRequestData, Set.of("foo", "bar")).get();

    Insert:

    Thread.sleep(sessionTimeoutMillis);
  2. Run the test loop a few times:

    pass=0; fail=0; for i in {1..5}; do \
     echo "=== Run $i ==="; \
     ./gradlew :metadata:test \
     --tests 'org.apache.kafka.controller.QuorumControllerTest' \
     -Pkafka.test.run.flaky=true \
     -PmaxParallelForks=1 -Dorg.gradle.workers.max=1 -Dtest.forkEvery=1 \
     --rerun-tasks --no-daemon --no-build-cache && pass=$((pass+1)) || fail=$((fail+1)); \
    done; echo "Summary: $pass pass / $fail fail"

Why this flakes

  • The controller fences brokers that miss heartbeats for sessionTimeoutMillis.
  • After initially unfencing all brokers, the test performs a sleep/wait before beginning the fencing phase.
  • That pause allows the survivor broker (id 1) to lapse its session and get fenced transiently.
  • Subsequent ISR/ELR assertions sometimes observe this unintended state, causing nondeterministic failures.

How (the fix)

Add a minimal background heartbeat pumper inside the test:

  • Schedule rate: periodMs = max(50ms, sessionTimeoutMillis / 3) to guarantee multiple heartbeats per timeout window.
  • Phase 1 (stabilize): heartbeat all brokers while creating topics and applying the initial broker-level config.
  • Phase 2 (fencing): set a flag to heartbeat only brokersToKeepUnfenced right before waitForCondition that fences brokers 2 and 3.

This preserves the test’s intent (only broker 1 remains unfenced; brokers 2,3 fence) while eliminating timing races introduced by sleeps/awaits.

Verification of stability

With the heartbeat pumper in place and the flakiness sleep enabled, the loop below completed without failures:

pass=0; fail=0; for i in {1..50}; do \
 echo "=== Run $i ==="; \
 ./gradlew :metadata:test \
 --tests 'org.apache.kafka.controller.QuorumControllerTest' \
 -Pkafka.test.run.flaky=true \
 -PmaxParallelForks=1 -Dorg.gradle.workers.max=1 -Dtest.forkEvery=1 \
 --rerun-tasks --no-daemon --no-build-cache && pass=$((pass+1)) || fail=$((fail+1)); \
done; echo "Summary: $pass pass / $fail fail"

Note: I ran the loop 50 times (compared to the 5 times in reproducing the flakiness case) to increase trust in the fix

Scope

  • Test-only change.
  • No production behavior/APIs altered.
  • Keeps existing assertions and ISR/ELR expectations intact.

@github-actions github-actions bot added triage PRs from the community tests Test fixes (including flaky tests) kraft labels Oct 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kraft tests Test fixes (including flaky tests) triage PRs from the community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant