Skip to content

polygon/sync: add post-syncToTip catch-up mode for event loop lag recovery#120

Open
madumas wants to merge 1 commit into0xPolygon:release/3.2from
ellipfra:fix/post-synctotip-catchup-mode
Open

polygon/sync: add post-syncToTip catch-up mode for event loop lag recovery#120
madumas wants to merge 1 commit into0xPolygon:release/3.2from
ellipfra:fix/post-synctotip-catchup-mode

Conversation

@madumas
Copy link

@madumas madumas commented Feb 6, 2026

Fixes #116, #112

The Polygon sync event loop processes blocks one at a time via p2p tip events. On bor-mainnet with its 2-second block time, this creates a harmful feedback loop: when a forkchoice cycle takes slightly longer than 2s (due to execution, DB commit, or Heimdall overhead), the node falls behind by one block. The next cycle now has to process that extra block, making it even slower, which accumulates more blocks, becoming a death spiral that can leave the node thousands of blocks behind within hours.

The current architecture has no recovery mechanism once in this state: the event loop keeps processing blocks one-by-one while syncToTip (which uses efficient waypoint-based batch downloading) is never re-entered.

The fix

Refactor Run() into an outer catch-up loop: syncToTip → initialiseCcb → runEventLoop → re-enter if behind.

  • Track lastTipAge (time since tip block timestamp) in commitExecution()
  • After processing block events (not milestones, which are finality metadata), check if lastTipAge > 30s
  • If so, runEventLoop returns needsCatchUp=true, breaking back to syncToTip which can process hundreds of blocks per cycle via waypoint batching
  • Once caught up, the node re-enters the event loop at the tip

Design decisions

  • Age check on block events only, not milestones. Milestones are finality metadata that can arrive in bursts. Checking lastTipAge after a milestone could trigger a false catch-up while the node is actually at the tip.

  • No tight-loop risk. Each re-entry goes through syncToTip → initialiseCcb → runEventLoop. The node must actually drift 30s behind before triggering again, preventing thrashing between modes.

  • initialCycle stays false on re-entries. Setting it to true would activate aggressive pruning and other first-boot-only code paths. The trade-off is conservative pruning during catch-ups, which is acceptable for short recovery windows.

  • Fresh canonical chain builder on each re-entry. initialiseCcb is called after every syncToTip, so we never carry stale CCB state across catch-up boundaries.

  • 30s threshold accounts for span rotations. Every ~128 blocks (~256s), the producer set update adds ~12s of overhead. The threshold is set well above this transient spike to avoid spurious triggers.

Production data

Tested on bor-mainnet nodes running v3.3.7:

  • Before: 28 unrecoverable lag events in 9 hours on a cold node, with age drifting to minutes behind
  • After: catch-up triggers appropriately, recovers via syncToTip in 2-5 minutes, steady-state age 5-7s

Changes

  • polygon/sync/sync.go: extract runEventLoop() method, add outer catch-up loop in Run(), track lastTipAge
  • polygon/sync/sync_test.go: unit tests for tip age tracking and threshold constant

…overy

When the Polygon sync event loop falls behind (tip age > 30s), break out
and re-enter syncToTip which uses efficient waypoint-based batch downloading.

This prevents a death spiral where the event loop processes blocks one at a
time (~2s/block) while new blocks arrive every 2s, causing the node to fall
progressively further behind.

The fix refactors Run() to loop: syncToTip -> initialiseCcb -> runEventLoop,
where runEventLoop returns needsCatchUp=true when lastTipAge exceeds 30s
after processing block events (not milestones, which are finality metadata).

Fixes 0xPolygon#116, 0xPolygon#112
@madumas madumas marked this pull request as ready for review February 6, 2026 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Polygon sync enters feedback loop when milestones accumulate

1 participant