Fix Paxos proposer busy loop, livelock and make tests deterministic#23
Open
saiashok0981 wants to merge 1 commit into
Open
Fix Paxos proposer busy loop, livelock and make tests deterministic#23saiashok0981 wants to merge 1 commit into
saiashok0981 wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue 1: Livelock, CPU Busy Loop and Infinite Proposing in Paxos Implementation
Issue Description
Title: Bug: Proposer enters CPU busy loop on prepare rejection and lacks termination mechanism
Description: The current Paxos Proposer implementation in candidates/siddhantprateek/paxos/main.go has multiple issues causing high CPU utilization, infinite execution, and test flakiness:
Busy Loop on Rejection: If the proposer fails to obtain a prepare quorum (e.g. prepareResponsesCount < quorum), the loop immediately runs again with the same proposalNum. Because the proposal number is not incremented and there is no backoff sleep, the proposer tightly loops forever, consuming 100% of a CPU core.
Infinite Proposing: Once consensus is successfully reached and learners are notified of the decided value, the Proposer continues looping and proposing higher numbers infinitely without ever terminating.
Flaky and Arbitrary Test Sleep: The tests in candidates/siddhantprateek/paxos/main_test.go rely on arbitrary time.Sleep calls (e.g. 1 second and 100 milliseconds) for coordination, which is brittle, slow, and prone to flakiness under resource-constrained runners.
Proposer Stuck at proposalNum = 0: In TestPaxosProposer, the proposer is initialized with proposalNum = 0. Since the acceptor's promisedNum is initialized to 0 (the default struct value), the check n > a.promisedNum (0 > 0) is false, causing the proposer to get stuck on the first step forever.
Expected Behavior:
Proposers should back off briefly (e.g., sleep 10ms) when failing to get a quorum or failing the accept phase to prevent CPU busy loops.
Proposers should increment their proposal number on prepare failure to ensure progress in subsequent rounds.
Proposers should terminate once consensus has been decided.
Tests should use event-driven channels (decided channel in Learners) for deterministic and fast execution.
Proposers in tests should start with proposalNum >= 1.
Pull Request Description
Title: Fix Paxos proposer busy loop, livelock, and make tests deterministic
Branch: paxos-livelock-fix
Description of Changes:
Termination & Backoff: Added a stop channel to Proposer and implemented Stop(). Added time.Sleep(10 * time.Millisecond) backoffs in the Propose() loop when failing to reach prepare/accept quorums.
Progress Guarantee: Incremented the proposalNum when the prepare phase fails to obtain a quorum, ensuring subsequent attempts use a higher number.
Consensus Termination: Exited the Propose() loop gracefully as soon as consensus is reached and all learners have received the decided value.
Deterministic Testing: Added a decided channel to Learner to signal when consensus has been achieved. Replaced all brittle time.Sleep calls in main_test.go with fast, event-driven channel select statements.
Fixed Proposer Initialization: Updated proposer instantiation in tests to start with proposalNum >= 1 so that the acceptor's promisedNum is exceeded on the first prepare call.