fix: prevent concurrent state overwrites (lost update bug)#1532
fix: prevent concurrent state overwrites (lost update bug)#1532cristiam86 merged 4 commits intomainfrom
Conversation
Root cause: when two consensus workers process transactions for the same contract concurrently, update_contract_state() does a read-modify-write on the full JSONB blob. The second commit silently overwrites the first's changes because it reads stale state. This caused 336+ lost submissions in Rally production (March 2026). Fixes applied: 1. Advisory locks in all claim CTEs (claim_next_transaction, claim_next_finalization, claim_next_appeal) via pg_try_advisory_xact_lock(hashtext(to_address)) — closes the TOCTOU window where two workers both pass the NOT EXISTS check before either commits blocked_at. 2. SELECT FOR UPDATE + populate_existing() in update_contract_state() — serializes concurrent writes and ensures fresh state is read from DB, preventing cross-field clobber (accepted vs finalized). Also adds: - DB-level regression tests (test_concurrent_state_update.py) - E2E state integrity test (Counter contract + invariant check) - CI runs with CONSENSUS_WORKERS=3 for realistic multi-worker testing
📝 WalkthroughWalkthroughAdds per-contract advisory advisory locks to consensus claim queries, applies row-level FOR UPDATE locking when updating contract state, introduces concurrency regression and load tests to detect lost-update scenarios, and updates CI to run state-integrity tests against a multi-worker Docker Compose setup. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant Consensus as Consensus Worker
participant DB as PostgreSQL
participant Indexer as Indexer/API
rect rgba(200, 230, 255, 0.5)
Note over Client,Indexer: State integrity test flow (deploy + N parallel increments)
end
Client->>Consensus: Submit deploy tx
Consensus->>DB: INSERT tx / persist
DB-->>Consensus: tx stored
Consensus->>Indexer: Index tx (async)
Indexer-->>Client: Deployment confirmed (contract address)
par Dispatch N increment txs
Client->>Consensus: Submit increment tx (ephemeral sender)
and
Client->>Consensus: Submit increment tx (other sender)
end
rect rgba(200, 230, 255, 0.5)
Note over Consensus,DB: Claim handling with advisory lock
end
Consensus->>DB: pg_try_advisory_xact_lock(hashtext(contract))
DB-->>Consensus: LOCK ACQUIRED / DENIED
Consensus->>DB: SELECT CurrentState ... FOR UPDATE
DB-->>Consensus: CurrentState row (locked)
Consensus->>DB: UPDATE CurrentState (merge fields)
DB-->>Consensus: UPDATE OK
Note over Consensus: Release locks at tx end
Consensus->>Indexer: Index increment tx (async)
Indexer-->>Client: Receipt / finalization
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/consensus/worker.py (1)
300-309:⚠️ Potential issue | 🟠 MajorSerialize appeals behind any in-flight work for the same contract.
This
NOT EXISTSonly excludes other blocked appeals. Once a regular transaction or finalization has committed itsblocked_at, this query can still claim an appeal for the sameto_address, even though the appeal handlers below still pass aContractProcessorand can mutate the same row again. Please align this predicate with the broader same-contract exclusion used in the other claim paths.Suggested fix
AND NOT EXISTS ( - -- Ensure no other appeal for same contract is being processed + -- Ensure no other work for same contract is being processed SELECT 1 FROM transactions t2 WHERE t2.to_address = t.to_address - AND t2.appealed = true AND t2.blocked_at IS NOT NULL AND t2.blocked_at > NOW() - CAST(:timeout AS INTERVAL) AND t2.hash != t.hash )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/consensus/worker.py` around lines 300 - 309, The NOT EXISTS predicate currently only excludes other blocked appeals by checking t2.appealed = true; change it to exclude any in-flight work for the same contract by removing the appealed filter so the subquery simply checks for any row with the same to_address, blocked_at IS NOT NULL, recent blocked_at (NOW() - CAST(:timeout AS INTERVAL)) and t2.hash != t.hash; update the subquery in the claim SQL (the SELECT 1 FROM transactions t2 ... WHERE t2.to_address = t.to_address ...) so it no longer requires t2.appealed = true, thereby aligning it with the same-contract exclusion used elsewhere (keep the pg_try_advisory_xact_lock(hashtext(t.to_address)) call as-is).
🧹 Nitpick comments (2)
tests/load/test_state_integrity.py (1)
42-57: Avoid the “first address-looking value wins” fallback.If
to_address/recipientare absent, scanningraw.values()can just as easily return the deployer or another 20-byte field, and the rest of the script will exercise the wrong contract. Prefer the deterministic address from the deployment result/receipt instead of heuristically walking the transaction payload.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/load/test_state_integrity.py` around lines 42 - 57, The current get_contract_address function should not fall back to the first address-like value in the transaction payload; instead call the transaction receipt API to deterministically obtain the deployed contract address. Modify get_contract_address to: 1) keep the initial rpc_call to eth_getTransactionByHash only for validation if desired, 2) perform rpc_call(api_url, "eth_getTransactionReceipt", [tx_hash]) and extract the contractAddress field from the receipt (or its canonical key if different), 3) return that contractAddress if present and non-zero, and 4) remove the heuristic loop over raw.values(); if no contractAddress is present, raise a RuntimeError as before. Reference get_contract_address and rpc_call to locate changes..github/workflows/load-test-oha.yml (1)
87-90: Please verify the job actually reaches the multi-worker path.This step requests
CONSENSUS_WORKERS: 3, but the workflow never asserts that threeconsensus-workercontainers are running before the integrity test. Adding an explicit post-start replica-count check would prevent this regression from silently passing on a single-worker setup.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/load-test-oha.yml around lines 87 - 90, The workflow sets CONSENSUS_WORKERS: 3 in the "Run Docker Compose with multiple workers" step but never verifies that three consensus-worker containers are actually up; add a follow-up step that queries the running containers for the consensus-worker service (e.g., via docker compose ps or docker ps filtered by name) and compares the count against the CONSENSUS_WORKERS value (3) and fails the job if the counts mismatch so the integrity test only runs when the expected replica count is reached.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/db-sqlalchemy/test_concurrent_state_update.py`:
- Around line 115-122: The test currently joins t_a and t_b with a timeout but
doesn't fail if a thread stayed alive; after t_a.join(timeout=10) and
t_b.join(timeout=10) add assertions that both threads finished (e.g., assert not
t_a.is_alive() and assert not t_b.is_alive() with a clear message like "Writer
thread did not exit") so the test fails fast on deadlock; apply the same change
to the other two-thread case that uses t_a/t_b and worker_a/worker_b later in
the file.
In `@tests/load/contracts/counter.py`:
- Around line 3-10: Replace the wildcard import by importing only the gl
namespace used (e.g., change "from genlayer import *" to import gl from
genlayer) and add the missing return type on the Counter.__init__ signature
(make __init__(self) -> None) so the class Counter (subclassing gl.Contract) has
an explicit import and correct type hint.
In `@tests/load/test_state_integrity.py`:
- Around line 230-234: In the failing test prints, several print(...) calls use
inert f-strings with no placeholders (e.g., the prints that output " FAIL: LOST
UPDATE DETECTED", the blank print(), " This confirms the concurrent state
overwrite bug.", and " See: Rally2/docs/genvm-state-mismatch-bug.md"); remove
the unnecessary leading f from those string literals so they are plain string
literals (keep the other prints that interpolate lost/succeeded/loss_rate as
f-strings).
---
Outside diff comments:
In `@backend/consensus/worker.py`:
- Around line 300-309: The NOT EXISTS predicate currently only excludes other
blocked appeals by checking t2.appealed = true; change it to exclude any
in-flight work for the same contract by removing the appealed filter so the
subquery simply checks for any row with the same to_address, blocked_at IS NOT
NULL, recent blocked_at (NOW() - CAST(:timeout AS INTERVAL)) and t2.hash !=
t.hash; update the subquery in the claim SQL (the SELECT 1 FROM transactions t2
... WHERE t2.to_address = t.to_address ...) so it no longer requires t2.appealed
= true, thereby aligning it with the same-contract exclusion used elsewhere
(keep the pg_try_advisory_xact_lock(hashtext(t.to_address)) call as-is).
---
Nitpick comments:
In @.github/workflows/load-test-oha.yml:
- Around line 87-90: The workflow sets CONSENSUS_WORKERS: 3 in the "Run Docker
Compose with multiple workers" step but never verifies that three
consensus-worker containers are actually up; add a follow-up step that queries
the running containers for the consensus-worker service (e.g., via docker
compose ps or docker ps filtered by name) and compares the count against the
CONSENSUS_WORKERS value (3) and fails the job if the counts mismatch so the
integrity test only runs when the expected replica count is reached.
In `@tests/load/test_state_integrity.py`:
- Around line 42-57: The current get_contract_address function should not fall
back to the first address-like value in the transaction payload; instead call
the transaction receipt API to deterministically obtain the deployed contract
address. Modify get_contract_address to: 1) keep the initial rpc_call to
eth_getTransactionByHash only for validation if desired, 2) perform
rpc_call(api_url, "eth_getTransactionReceipt", [tx_hash]) and extract the
contractAddress field from the receipt (or its canonical key if different), 3)
return that contractAddress if present and non-zero, and 4) remove the heuristic
loop over raw.values(); if no contractAddress is present, raise a RuntimeError
as before. Reference get_contract_address and rpc_call to locate changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b745b041-470f-40d1-a301-c73a073a1b4a
📒 Files selected for processing (7)
.github/workflows/load-test-oha.ymlbackend/consensus/worker.pybackend/database_handler/contract_processor.pytests/db-sqlalchemy/test_concurrent_state_update.pytests/load/contracts/counter.pytests/load/run_state_integrity_test.shtests/load/test_state_integrity.py
| t_a = threading.Thread(target=worker_a) | ||
| t_b = threading.Thread(target=worker_b) | ||
| t_a.start() | ||
| t_b.start() | ||
| t_a.join(timeout=10) | ||
| t_b.join(timeout=10) | ||
|
|
||
| assert not errors, f"Worker errors: {errors}" |
There was a problem hiding this comment.
Fail fast if a writer thread never exits.
These joins time out, but the test never asserts that both threads actually finished. If the new lock path blocks or deadlocks, the suite can keep going with half-written state or hang at process shutdown.
Suggested fix
t_a.join(timeout=10)
t_b.join(timeout=10)
+ assert not t_a.is_alive(), "worker A did not finish"
+ assert not t_b.is_alive(), "worker B did not finish"Apply the same guard to the second two-thread case below.
Also applies to: 182-189
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/db-sqlalchemy/test_concurrent_state_update.py` around lines 115 - 122,
The test currently joins t_a and t_b with a timeout but doesn't fail if a thread
stayed alive; after t_a.join(timeout=10) and t_b.join(timeout=10) add assertions
that both threads finished (e.g., assert not t_a.is_alive() and assert not
t_b.is_alive() with a clear message like "Writer thread did not exit") so the
test fails fast on deadlock; apply the same change to the other two-thread case
that uses t_a/t_b and worker_a/worker_b later in the file.
| from genlayer import * | ||
|
|
||
|
|
||
| class Counter(gl.Contract): | ||
| count: int | ||
|
|
||
| def __init__(self): | ||
| self.count = 0 |
There was a problem hiding this comment.
Replace the star import here.
from genlayer import * is already tripping Ruff F403/F405, and this file only uses the gl namespace. While touching it, please also add the missing -> None on __init__.
Suggested fix
-from genlayer import *
+import genlayer as gl
@@
- def __init__(self):
+ def __init__(self) -> None:As per coding guidelines, **/*.py: Include type hints in all Python code.
🧰 Tools
🪛 Ruff (0.15.5)
[error] 3-3: from genlayer import * used; unable to detect undefined names
(F403)
[error] 6-6: gl may be undefined, or defined from star imports
(F405)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/load/contracts/counter.py` around lines 3 - 10, Replace the wildcard
import by importing only the gl namespace used (e.g., change "from genlayer
import *" to import gl from genlayer) and add the missing return type on the
Counter.__init__ signature (make __init__(self) -> None) so the class Counter
(subclassing gl.Contract) has an explicit import and correct type hint.
| print(f" FAIL: LOST UPDATE DETECTED") | ||
| print(f" Lost {lost}/{succeeded} increments ({loss_rate:.1f}% loss rate)") | ||
| print() | ||
| print(f" This confirms the concurrent state overwrite bug.") | ||
| print(f" See: Rally2/docs/genvm-state-mismatch-bug.md") |
There was a problem hiding this comment.
Drop the inert f prefixes.
These literals have no placeholders, so Ruff will flag them as F541.
🧰 Tools
🪛 Ruff (0.15.5)
[error] 230-230: f-string without any placeholders
Remove extraneous f prefix
(F541)
[error] 233-233: f-string without any placeholders
Remove extraneous f prefix
(F541)
[error] 234-234: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/load/test_state_integrity.py` around lines 230 - 234, In the failing
test prints, several print(...) calls use inert f-strings with no placeholders
(e.g., the prints that output " FAIL: LOST UPDATE DETECTED", the blank print(),
" This confirms the concurrent state overwrite bug.", and " See:
Rally2/docs/genvm-state-mismatch-bug.md"); remove the unnecessary leading f from
those string literals so they are plain string literals (keep the other prints
that interpolate lost/succeeded/loss_rate as f-strings).
The endpoint load test step runs sim_deleteAllValidators and sim_clearDbTables, leaving no validators for the subsequent state integrity test which then fails.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
backend/consensus/worker.py (1)
418-418: Consider usingpg_catalog.hashtextextended()for 64-bit advisory lock keys to improve collision distribution.
hashtext()returns a 32-bit hash; when implicitly cast tobigintforpg_try_advisory_xact_lock(), the upper 32 bits remain zero, concentrating keys and increasing false serialization risk at scale.PostgreSQL 12+ supports
hashtextextended(text, bigint), which generates a true 64-bit key. When using it, explicitly applyCOLLATE "C"to ensure deterministic hashing across environments:pg_try_advisory_xact_lock(pg_catalog.hashtextextended(t.to_address COLLATE "C", 0))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/consensus/worker.py` at line 418, Replace the 32-bit hash usage with PostgreSQL's 64-bit hash function: change the call to pg_try_advisory_xact_lock(hashtext(t.to_address)) to use pg_catalog.hashtextextended and a deterministic collation, i.e. call pg_try_advisory_xact_lock(pg_catalog.hashtextextended(t.to_address COLLATE "C", 0)); update references to hashtext in this query to hashtextextended and ensure COLLATE "C" is applied to t.to_address to avoid environment-dependent hashing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/consensus/worker.py`:
- Line 418: Replace the 32-bit hash usage with PostgreSQL's 64-bit hash
function: change the call to pg_try_advisory_xact_lock(hashtext(t.to_address))
to use pg_catalog.hashtextextended and a deterministic collation, i.e. call
pg_try_advisory_xact_lock(pg_catalog.hashtextextended(t.to_address COLLATE "C",
0)); update references to hashtext in this query to hashtextextended and ensure
COLLATE "C" is applied to t.to_address to avoid environment-dependent hashing.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0a539046-49ce-4d20-8511-7a0583490a4e
📒 Files selected for processing (1)
backend/consensus/worker.py
|
🎉 This PR is included in version 0.111.0 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
Summary
Fixes the lost update bug where concurrent consensus workers processing transactions for the same contract silently overwrite each other's state changes. This caused 336+ lost submissions in Rally production (March 2026).
Root cause
update_contract_state()does a read-modify-write on the full JSONBcurrent_stateblob. When two workers call it concurrently for the same contract, the second commit overwrites the first's changes because:NOT EXISTScheck onblocked_atruns underREAD COMMITTEDisolation — two workers can both pass it before either commitsupdate_contract_state()reads the state row withoutFOR UPDATE, so the second reader gets stale dataaccepted/finalizeddict, not individual fieldsFixes
pg_try_advisory_xact_lock(hashtext(to_address))in all three claim CTEs (claim_next_transaction,claim_next_finalization,claim_next_appeal) — atomically prevents two workers from claiming transactions for the same contract within the same SQL transaction windowSELECT ... FOR UPDATE+populate_existing()inupdate_contract_state()— serializes concurrent writes to the same contract row and ensures SQLAlchemy refreshes from DB (not identity map cache). Prevents cross-field clobber (accepted writer doesn't clobber finalized)Tests
tests/db-sqlalchemy/test_concurrent_state_update.py): 3 tests using threading barriers to force concurrent executiontest_concurrent_accepted_updates_preserve_both(xfail — same-field replacement by design, prevented upstream by advisory locks)test_concurrent_accepted_and_finalized_preserve_both— PASSES with theFOR UPDATEfixtest_sequential_updates_preserve_all_state— sanity baselinetests/load/test_state_integrity.py): deploys a Counter contract, fires N concurrent increments, assertsget_count() == NCONSENSUS_WORKERS=3for realistic multi-worker load testingKnown remaining risks (follow-up)
Reviewed by Codex (gpt-5.3). This fix closes race path 1 (TOCTOU in NOT EXISTS). Two other paths remain:
blocked_atlease expiry — Default 30-min timeout with no heartbeat. If consensus exceeds 30 min (possible: leader 10min + validator 15min × 6 retries = 90min worst case), another worker treats the tx as dead and reclaims it. Fix: add heartbeat during processing.recover_stuck_transactions()— Runs every 60s in every worker loop, resets expiredblocked_attxs back to PENDING while original worker may still be running. Fix: add worker_id/version fencing on downstream writes.Both are lower probability than path 1 (which matches CryptoColaX's exact scenario — 4-min consensus, 11s apart) and are best addressed in follow-up PRs.
Evidence
See Rally2/docs/genvm-state-mismatch-bug.md for full investigation:
Summary by CodeRabbit
New Features
Bug Fixes
Tests