Skip to content

Fix Paxos proposer busy loop, livelock and make tests deterministic#23

Open
saiashok0981 wants to merge 1 commit into
BiniWorld:mainfrom
saiashok0981:paxos-livelock-fix
Open

Fix Paxos proposer busy loop, livelock and make tests deterministic#23
saiashok0981 wants to merge 1 commit into
BiniWorld:mainfrom
saiashok0981:paxos-livelock-fix

Conversation

@saiashok0981

Copy link
Copy Markdown

Issue 1: Livelock, CPU Busy Loop and Infinite Proposing in Paxos Implementation
Issue Description
Title: Bug: Proposer enters CPU busy loop on prepare rejection and lacks termination mechanism

Description: The current Paxos Proposer implementation in candidates/siddhantprateek/paxos/main.go has multiple issues causing high CPU utilization, infinite execution, and test flakiness:

Busy Loop on Rejection: If the proposer fails to obtain a prepare quorum (e.g. prepareResponsesCount < quorum), the loop immediately runs again with the same proposalNum. Because the proposal number is not incremented and there is no backoff sleep, the proposer tightly loops forever, consuming 100% of a CPU core.
Infinite Proposing: Once consensus is successfully reached and learners are notified of the decided value, the Proposer continues looping and proposing higher numbers infinitely without ever terminating.
Flaky and Arbitrary Test Sleep: The tests in candidates/siddhantprateek/paxos/main_test.go rely on arbitrary time.Sleep calls (e.g. 1 second and 100 milliseconds) for coordination, which is brittle, slow, and prone to flakiness under resource-constrained runners.
Proposer Stuck at proposalNum = 0: In TestPaxosProposer, the proposer is initialized with proposalNum = 0. Since the acceptor's promisedNum is initialized to 0 (the default struct value), the check n > a.promisedNum (0 > 0) is false, causing the proposer to get stuck on the first step forever.
Expected Behavior:

Proposers should back off briefly (e.g., sleep 10ms) when failing to get a quorum or failing the accept phase to prevent CPU busy loops.
Proposers should increment their proposal number on prepare failure to ensure progress in subsequent rounds.
Proposers should terminate once consensus has been decided.
Tests should use event-driven channels (decided channel in Learners) for deterministic and fast execution.
Proposers in tests should start with proposalNum >= 1.
Pull Request Description
Title: Fix Paxos proposer busy loop, livelock, and make tests deterministic

Branch: paxos-livelock-fix

Description of Changes:

Termination & Backoff: Added a stop channel to Proposer and implemented Stop(). Added time.Sleep(10 * time.Millisecond) backoffs in the Propose() loop when failing to reach prepare/accept quorums.
Progress Guarantee: Incremented the proposalNum when the prepare phase fails to obtain a quorum, ensuring subsequent attempts use a higher number.
Consensus Termination: Exited the Propose() loop gracefully as soon as consensus is reached and all learners have received the decided value.
Deterministic Testing: Added a decided channel to Learner to signal when consensus has been achieved. Replaced all brittle time.Sleep calls in main_test.go with fast, event-driven channel select statements.
Fixed Proposer Initialization: Updated proposer instantiation in tests to start with proposalNum >= 1 so that the acceptor's promisedNum is exceeded on the first prepare call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant