polygon/sync: add post-syncToTip catch-up mode for event loop lag recovery#120
Open
madumas wants to merge 1 commit into0xPolygon:release/3.2from
Open
polygon/sync: add post-syncToTip catch-up mode for event loop lag recovery#120madumas wants to merge 1 commit into0xPolygon:release/3.2from
madumas wants to merge 1 commit into0xPolygon:release/3.2from
Conversation
…overy When the Polygon sync event loop falls behind (tip age > 30s), break out and re-enter syncToTip which uses efficient waypoint-based batch downloading. This prevents a death spiral where the event loop processes blocks one at a time (~2s/block) while new blocks arrive every 2s, causing the node to fall progressively further behind. The fix refactors Run() to loop: syncToTip -> initialiseCcb -> runEventLoop, where runEventLoop returns needsCatchUp=true when lastTipAge exceeds 30s after processing block events (not milestones, which are finality metadata). Fixes 0xPolygon#116, 0xPolygon#112
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #116, #112
The Polygon sync event loop processes blocks one at a time via p2p tip events. On bor-mainnet with its 2-second block time, this creates a harmful feedback loop: when a forkchoice cycle takes slightly longer than 2s (due to execution, DB commit, or Heimdall overhead), the node falls behind by one block. The next cycle now has to process that extra block, making it even slower, which accumulates more blocks, becoming a death spiral that can leave the node thousands of blocks behind within hours.
The current architecture has no recovery mechanism once in this state: the event loop keeps processing blocks one-by-one while
syncToTip(which uses efficient waypoint-based batch downloading) is never re-entered.The fix
Refactor
Run()into an outer catch-up loop:syncToTip → initialiseCcb → runEventLoop → re-enter if behind.lastTipAge(time since tip block timestamp) incommitExecution()lastTipAge > 30srunEventLoopreturnsneedsCatchUp=true, breaking back tosyncToTipwhich can process hundreds of blocks per cycle via waypoint batchingDesign decisions
Age check on block events only, not milestones. Milestones are finality metadata that can arrive in bursts. Checking
lastTipAgeafter a milestone could trigger a false catch-up while the node is actually at the tip.No tight-loop risk. Each re-entry goes through
syncToTip → initialiseCcb → runEventLoop. The node must actually drift 30s behind before triggering again, preventing thrashing between modes.initialCyclestaysfalseon re-entries. Setting it totruewould activate aggressive pruning and other first-boot-only code paths. The trade-off is conservative pruning during catch-ups, which is acceptable for short recovery windows.Fresh canonical chain builder on each re-entry.
initialiseCcbis called after everysyncToTip, so we never carry stale CCB state across catch-up boundaries.30s threshold accounts for span rotations. Every ~128 blocks (~256s), the producer set update adds ~12s of overhead. The threshold is set well above this transient spike to avoid spurious triggers.
Production data
Tested on bor-mainnet nodes running v3.3.7:
Changes
polygon/sync/sync.go: extractrunEventLoop()method, add outer catch-up loop inRun(), tracklastTipAgepolygon/sync/sync_test.go: unit tests for tip age tracking and threshold constant