Skip to content

fix(traffic): bound the Vercel sync drain so a dense/slow window can't wedge a source#681

Open
arberx wants to merge 1 commit into
mainfrom
fix/vercel-sync-resilience
Open

fix(traffic): bound the Vercel sync drain so a dense/slow window can't wedge a source#681
arberx wants to merge 1 commit into
mainfrom
fix/vercel-sync-resilience

Conversation

@arberx
Copy link
Copy Markdown
Member

@arberx arberx commented Jun 4, 2026

Problem

A Vercel traffic sync drains its request-logs window synchronously inside the sync request. Two cases made that drain run for many minutes — timing out the caller (surfacing as a misleading "could not connect") and leaving the run stuck running while the source ingested nothing until a manual reset:

  • A drifted watermark (source idle while its traffic-sync schedule was paused/missing) requests a pull back to the 30-day default window.
  • A dense or briefly-slow window needs many adaptive sub-window pulls.

The adapter already had a per-fetch 30s timeout + retry, but nothing bounded the TOTAL drain — only a 5000 sub-window count cap.

Fix

Two bounds on the incremental Vercel sync:

  1. Window cap (VERCEL_MAX_SYNC_WINDOW_MS = 24h): clamp the start forward so a drifted watermark can't request a multi-day pull. The skipped pre-cap span is surfaced via warn (a backfill recovers it), never silently dropped.

  2. Drain wall-clock deadline (DEFAULT_VERCEL_SYNC_DEADLINE_MS = 4m, override vercelSyncDeadlineMs): drainVercelTrafficEvents stops before starting a sub-window once the budget elapses and reports drainedThroughMs (the last fully-drained boundary). The route then:

    • partial progress → commits the partial window and advances lastSyncedAt only to drainedThroughMs (the incremental rollup is additive, so a partial window is safe), so the next sync resumes from the boundary — a dense backlog converges over several syncs instead of one unbounded grind;
    • zero progress → fails the run (visible) rather than orphaning a running row.

Retention handling is unchanged and still takes precedence (a clamped-to-tail window fails so lastSyncedAt never advances across missing history).

No API surface change (internal route/adapter options only), so no SDK regen. Patch bump 4.70.0 → 4.70.1.

Tests

  • integration-vercel drain unit: full drain reports drainedThroughMs == endDate + deadlineReached: false; an already-passed deadline stops before the first pull (zero progress); a mid-window deadline stops with partial progress and reports the boundary.
  • api-routes route: a zero budget fails the run without advancing the watermark; a 5-day-drifted watermark is capped to the last 24h.

Full suite 1140 passing; typecheck clean; 0 lint errors.

Follow-up (not in this PR)

The sync route is still synchronous — a manual traffic sync CLI call can out-wait its own HTTP timeout even though the daemon now bounds and completes the work. Making the sync route async (return runId immediately, like backfill does) would remove the misleading "could not connect" entirely; worth a separate change.

🤖 Generated with Claude Code

…t wedge a source

A drifted watermark or a dense/slow request-logs window made the synchronous
Vercel sync drain run for many minutes — timing out the caller and leaving the
run stuck 'running', the source ingesting nothing until a manual reset.

Two bounds on the incremental Vercel sync, on top of the existing per-fetch
30s timeout + retry (which never bounded the TOTAL drain):

- Window cap (VERCEL_MAX_SYNC_WINDOW_MS = 24h): clamp the start forward so a
  watermark that drifted past the cap can't request a multi-day pull. The
  skipped span is surfaced (warn), not silent — a backfill recovers it.

- Drain wall-clock deadline (DEFAULT_VERCEL_SYNC_DEADLINE_MS = 4m, override via
  vercelSyncDeadlineMs): the adaptive drain stops before a sub-window once the
  budget elapses and reports how far it got. The route commits that partial
  window and advances lastSyncedAt only to there (the additive rollup makes a
  partial window safe), so the next sync resumes from the boundary instead of
  one sync grinding unbounded. If nothing drained before the budget the run
  fails (visible) instead of orphaning a 'running' row.

No API surface change (internal options only), so no SDK regen. 4.70.0 -> 4.70.1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant