polygon/sync: add post-syncToTip catch-up mode for event loop lag recovery by madumas · Pull Request #120 · 0xPolygon/erigon

madumas · 2026-02-06T18:05:29Z

The Polygon sync event loop processes blocks one at a time via p2p tip events. On bor-mainnet with its 2-second block time, this creates a harmful feedback loop: when a forkchoice cycle takes slightly longer than 2s (due to execution, DB commit, or Heimdall overhead), the node falls behind by one block. The next cycle now has to process that extra block, making it even slower, which accumulates more blocks, becoming a death spiral that can leave the node thousands of blocks behind within hours.

The current architecture has no recovery mechanism once in this state: the event loop keeps processing blocks one-by-one while syncToTip (which uses efficient waypoint-based batch downloading) is never re-entered.

The fix

Refactor Run() into an outer catch-up loop: syncToTip → initialiseCcb → runEventLoop → re-enter if behind.

Track lastTipAge (time since tip block timestamp) in commitExecution()
After processing block events (not milestones, which are finality metadata), check if lastTipAge > 30s
If so, runEventLoop returns needsCatchUp=true, breaking back to syncToTip which can process hundreds of blocks per cycle via waypoint batching
Once caught up, the node re-enters the event loop at the tip

Design decisions

Age check on block events only, not milestones. Milestones are finality metadata that can arrive in bursts. Checking lastTipAge after a milestone could trigger a false catch-up while the node is actually at the tip.
No tight-loop risk. Each re-entry goes through syncToTip → initialiseCcb → runEventLoop. The node must actually drift 30s behind before triggering again, preventing thrashing between modes.
initialCycle stays false on re-entries. Setting it to true would activate aggressive pruning and other first-boot-only code paths. The trade-off is conservative pruning during catch-ups, which is acceptable for short recovery windows.
Fresh canonical chain builder on each re-entry. initialiseCcb is called after every syncToTip, so we never carry stale CCB state across catch-up boundaries.
30s threshold accounts for span rotations. Every ~128 blocks (~256s), the producer set update adds ~12s of overhead. The threshold is set well above this transient spike to avoid spurious triggers.

Production data

Tested on bor-mainnet nodes running v3.3.7:

Before: 28 unrecoverable lag events in 9 hours on a cold node, with age drifting to minutes behind
After: catch-up triggers appropriately, recovers via syncToTip in 2-5 minutes, steady-state age 5-7s

Changes

polygon/sync/sync.go: extract runEventLoop() method, add outer catch-up loop in Run(), track lastTipAge
polygon/sync/sync_test.go: unit tests for tip age tracking and threshold constant

…overy When the Polygon sync event loop falls behind (tip age > 30s), break out and re-enter syncToTip which uses efficient waypoint-based batch downloading. This prevents a death spiral where the event loop processes blocks one at a time (~2s/block) while new blocks arrive every 2s, causing the node to fall progressively further behind. The fix refactors Run() to loop: syncToTip -> initialiseCcb -> runEventLoop, where runEventLoop returns needsCatchUp=true when lastTipAge exceeds 30s after processing block events (not milestones, which are finality metadata). Fixes 0xPolygon#116, 0xPolygon#112

madumas marked this pull request as ready for review February 6, 2026 18:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

polygon/sync: add post-syncToTip catch-up mode for event loop lag recovery#120

polygon/sync: add post-syncToTip catch-up mode for event loop lag recovery#120
madumas wants to merge 1 commit into0xPolygon:release/3.2from
ellipfra:fix/post-synctotip-catchup-mode

madumas commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

madumas commented Feb 6, 2026

The fix

Design decisions

Production data

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant