Skip to content

fix(worker): count bank-serialized consolidation ops in [PENDING_BREAKDOWN] so a stuck-bank deadlock stops reporting phantom claimable (#2359)#2360

Closed
r266-tech wants to merge 1 commit into
vectorize-io:mainfrom
r266-tech:fix/pending-breakdown-bank-serialized-2359
Closed

fix(worker): count bank-serialized consolidation ops in [PENDING_BREAKDOWN] so a stuck-bank deadlock stops reporting phantom claimable (#2359)#2360
r266-tech wants to merge 1 commit into
vectorize-io:mainfrom
r266-tech:fix/pending-breakdown-bank-serialized-2359

Conversation

@r266-tech

Copy link
Copy Markdown
Contributor

Addresses #2359

The reporter sees consolidation stalled in v0.8.3 with a [PENDING_BREAKDOWN]
line reporting claimable=1 assigned=0 while no worker ever claims the op. This
PR makes that diagnostic honest; the full root cause and the remaining
(state-mutating) half are written up below for a maintainer call.

Root cause — a permanent bank-serialization deadlock

  1. claim_batch builds busy_bank_ids = SELECT DISTINCT bank_id WHERE operation_type='consolidation' AND status='processing'
    (ops_postgresql.py), and the consolidation claim excludes those banks via
    bank_id != ALL($1::text[]) (_claim_consolidation_plain).
  2. An op stuck in status='processing' under a dead/foreign worker_id
    keeps its bank in busy_bank_ids forever, so the pending op that new
    /consolidate POSTs dedupe to for that same bank is excluded on every poll.
  3. The only reset of stuck processing rows, recover_own_tasks, filters
    WHERE status='processing' AND worker_id=$1 AND result_metadata->>'batch_id' IS NULLthis worker, at startup only. A foreign/dead worker_id is
    never reclaimed by any live worker, and the run loop has no periodic
    stale/orphan reaper. The repo has no worker liveness/heartbeat/registry, so
    "reclaim ops whose worker is dead" can only be keyed on claimed_at age.
  4. The op also can't be cleared by hand: cancel_operation rejects non-pending
    ops and retry_failed_consolidation only clears unit flags, not the
    async_operations row.

What this PR changes (diagnostic only)

[PENDING_BREAKDOWN]'s claimable was
total - payload_null - retry_blocked - assigned — it never modelled the
bank-serialization exclusion, so a permanently bank-blocked op showed up as
phantom claimable. This adds a bank_serialized bucket that mirrors the claim
query's bank exclusion (bank_id is NOT NULL by schema, so bank_id IN (busy) is the exact negation of bank_id != ALL(busy)) and subtracts it from
claimable (floored at 0). The reporter's line now reads
consolidation: total=1 claimable=0 ... bank_serialized=1, pointing straight at
the stuck bank instead of an impossible "claimable" op.

This is read-only logging — it makes the deadlock legible but does not by
itself reclaim the orphaned op.

The remaining half (deferred for a maintainer call)

To actually unstick it, an orphaned processing consolidation op needs to be
reset to pending so its bank un-serializes. Since there's no heartbeat, the
only signal is claimed_at age — e.g. broaden a recover-style reset to any
worker, gated on claimed_at older than a bounded threshold, run periodically
in the loop or as one of the existing maintenance routines. I deliberately did
not hard-code that threshold here: it's TOCTOU-sensitive (a legitimately
long-running op must not be double-claimed) and the right bound depends on the
deployment's max consolidation time, which is your call. Happy to follow up with
that change in whatever shape you prefer.

Tests

tests/test_worker.py::TestPendingBreakdownClaimable (no DB) pins the new
formula: a bank-serialized op is not counted as claimable, the reporter's lone
deadlocked op reports claimable=0, buckets without the new key keep the legacy
formula, and the result is floored at 0. The SQL predicate was cross-checked
against _claim_consolidation_plain and the bank_id NOT NULL schema
constraint.

…KDOWN]

A pending consolidation op whose bank already has a status='processing'
consolidation is serialized out of the claim query
(bank_id != ALL(busy_bank_ids)) until that bank frees. The
[PENDING_BREAKDOWN] diagnostic did not account for this, so such an op was
reported as claimable even though no worker could ever pick it up — the
exact 'claimable=1 assigned=0 but never claimed' signal in vectorize-io#2359.

Add a bank_serialized bucket mirroring the claim query's bank-exclusion
predicate (bank_id is NOT NULL by schema, so the IN-set test is the exact
negation of bank_id != ALL(busy)) and subtract it from claimable (floored at
0). Read-only diagnostic change; it makes the stuck-bank deadlock legible but
does not on its own reclaim the orphaned 'processing' op (see PR description).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants