fix(worker): count bank-serialized consolidation ops in [PENDING_BREAKDOWN] so a stuck-bank deadlock stops reporting phantom claimable (#2359)#2360
Closed
r266-tech wants to merge 1 commit into
Conversation
…KDOWN] A pending consolidation op whose bank already has a status='processing' consolidation is serialized out of the claim query (bank_id != ALL(busy_bank_ids)) until that bank frees. The [PENDING_BREAKDOWN] diagnostic did not account for this, so such an op was reported as claimable even though no worker could ever pick it up — the exact 'claimable=1 assigned=0 but never claimed' signal in vectorize-io#2359. Add a bank_serialized bucket mirroring the claim query's bank-exclusion predicate (bank_id is NOT NULL by schema, so the IN-set test is the exact negation of bank_id != ALL(busy)) and subtract it from claimable (floored at 0). Read-only diagnostic change; it makes the stuck-bank deadlock legible but does not on its own reclaim the orphaned 'processing' op (see PR description).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Addresses #2359
The reporter sees consolidation stalled in v0.8.3 with a
[PENDING_BREAKDOWN]line reporting
claimable=1 assigned=0while no worker ever claims the op. ThisPR makes that diagnostic honest; the full root cause and the remaining
(state-mutating) half are written up below for a maintainer call.
Root cause — a permanent bank-serialization deadlock
claim_batchbuildsbusy_bank_ids = SELECT DISTINCT bank_id WHERE operation_type='consolidation' AND status='processing'(
ops_postgresql.py), and the consolidation claim excludes those banks viabank_id != ALL($1::text[])(_claim_consolidation_plain).status='processing'under a dead/foreignworker_idkeeps its bank in
busy_bank_idsforever, so the pending op that new/consolidatePOSTs dedupe to for that same bank is excluded on every poll.processingrows,recover_own_tasks, filtersWHERE status='processing' AND worker_id=$1 AND result_metadata->>'batch_id' IS NULL— this worker, at startup only. A foreign/deadworker_idisnever reclaimed by any live worker, and the run loop has no periodic
stale/orphan reaper. The repo has no worker liveness/heartbeat/registry, so
"reclaim ops whose worker is dead" can only be keyed on
claimed_atage.cancel_operationrejects non-pendingops and
retry_failed_consolidationonly clears unit flags, not theasync_operationsrow.What this PR changes (diagnostic only)
[PENDING_BREAKDOWN]'sclaimablewastotal - payload_null - retry_blocked - assigned— it never modelled thebank-serialization exclusion, so a permanently bank-blocked op showed up as
phantom
claimable. This adds abank_serializedbucket that mirrors the claimquery's bank exclusion (
bank_idisNOT NULLby schema, sobank_id IN (busy)is the exact negation ofbank_id != ALL(busy)) and subtracts it fromclaimable(floored at 0). The reporter's line now readsconsolidation: total=1 claimable=0 ... bank_serialized=1, pointing straight atthe stuck bank instead of an impossible "claimable" op.
This is read-only logging — it makes the deadlock legible but does not by
itself reclaim the orphaned op.
The remaining half (deferred for a maintainer call)
To actually unstick it, an orphaned
processingconsolidation op needs to bereset to
pendingso its bank un-serializes. Since there's no heartbeat, theonly signal is
claimed_atage — e.g. broaden arecover-style reset to anyworker, gated on
claimed_atolder than a bounded threshold, run periodicallyin the loop or as one of the existing maintenance routines. I deliberately did
not hard-code that threshold here: it's TOCTOU-sensitive (a legitimately
long-running op must not be double-claimed) and the right bound depends on the
deployment's max consolidation time, which is your call. Happy to follow up with
that change in whatever shape you prefer.
Tests
tests/test_worker.py::TestPendingBreakdownClaimable(no DB) pins the newformula: a bank-serialized op is not counted as claimable, the reporter's lone
deadlocked op reports
claimable=0, buckets without the new key keep the legacyformula, and the result is floored at 0. The SQL predicate was cross-checked
against
_claim_consolidation_plainand thebank_idNOT NULLschemaconstraint.