Prevent database corruption during stateroot computation #3705

bhartnett · 2025-09-23T06:54:45Z

This fix is related to this issue: #3699

I have found that when computing the stateroot it is possible for in memory vertexes (that have a level lower than the baseTxFrame) to be written to the database and this is very likely to be the cause of the database corruption. When computing the stateroot we should only write directly to the database if the vertex was just fetched from the database.

I confirmed that in the forked chain tests this scenario was occurring (but without the data corruption) when processing a single linear chain of blocks. The problem is basically that the snapshot data can contain vertexes older than base after the base is updated. It seems likely that when dealing with multiple branches/forks, this scenario could lead to a database corruption.

With this change, when computing the stateroot we only allow writing vertexes to the database that were fetched directly from the database.

…during computeKey.

…with lower level than baseTxFrame.

…ower level than baseTxFrame when computing stateroot.

arnetheduck · 2025-09-23T07:39:30Z

execution_chain/core/chain/forked_chain/chain_serialize.nim

    txFrame = parentFrame.txFrameBegin

+  # Update the snapshot before processing the block so that any vertexes in snapshots
+  # from lower levels than the baseTxFrame are removed from the snapshot before running


this comment looks somewhat concerning - there should be no verticies from below baseTxFrame in memory at any point - when we persist, we move everything below baseTxFrame to the on-disk database and empty the baseTxFrame, breaking its "link" with txframe below. In fact, once we've performed persist, any "levels" below baseTxFrame are no longer connected and should not be reachable so how come they're still being added to?

The data in each snapshot contains a level (integer). Every time we create a snapshot on a txFrame the snapshot is moved from the older frame but the level from which the vertex originated is kept in the snapshot so that even after we persist and clean up all the lower level txFrames (as you've mentioned) the vertex data from the lower level can remain in the snapshot which is part of a newer (not yet persisted) txFrame.

Note that in this scenario (when importing blocks via forked chain sync) the txFrame that get persisted (during updateBase) lags behind the txFrame at the head which holds the snapshot. In other words we are not persisting the txFrame which is holding the snapshot but rather an older txFrame with a lower level.

Ah, I think I see the problem - this used to be cleared out by persist but I guess the way snapshots are being built now influences it.

in that case, shouldn't this be fixable by comparing the level of the snapshot with the basetxframe level and discarding on access if it's lower? that would remove the need to call updateSnapshot (which feels more like a workaround - the level stuff should be transparent to the user of the api, it's an internal optimization mechanism)

or maybe not even discarding, but returning "dblevel" in that scenario .. ie if a snapshot was written to the database and there's no "newer" information in higher levels, it seems to me it should still be valid? what am I missing?

Well the main point of this change is that we shouldn't be writing vertexes to the db during the stateroot computation, that are older/at a lower level then the database level. My reasoning for this is that it could cause the db corruption if the vertex being written happens to be older then what is already in the database. If we never deal with multiple branches/forks then this is probably safe because the full history of writes would be in the snapshot, but in practice when dealing with multiple branches and reorgs we have a dag of txFrames and multiple distinct snapshots. If I return dbLevel then the older vertex would get written to the db as was done before this change so no point in doing that.

in that case, shouldn't this be fixable by comparing the level of the snapshot with the basetxframe level and discarding on access if it's lower? that would remove the need to call updateSnapshot (which feels more like a workaround - the level stuff should be transparent to the user of the api, it's an internal optimization mechanism)

@arnetheduck Yes I like this idea. I've updated the the aristo layers to filter out any vertexes and hash keys that have a level lower than the base level. Would be good to get another review from you after these latest changes.

arnetheduck · 2025-09-23T11:38:47Z

tests/test_aristo/test_tx_frame.nim

+    try:
+      discard tx4.fetchStateRoot().get()
+    except AssertionDefect:
+      # Should throw assertion defect here because tx4 is at lower level than
+      # the baseTxFrame level.
+      discard


Suggested change

try:

discard tx4.fetchStateRoot().get()

except AssertionDefect:

# Should throw assertion defect here because tx4 is at lower level than

# the baseTxFrame level.

discard

expect(Defect):

discard tx4.fetchStateRoot()

execution_chain/core/chain/forked_chain.nim

arnetheduck · 2025-09-23T11:43:21Z

tests/test_aristo/test_tx_frame.nim

+
+    block:
+      let batch = db.putBegFn().expect("working batch")
+      db.persist(batch, tx3) # after this the baseTxFrame is at level 4


Typo, should be 3. Will update.

arnetheduck

I wonder if it would make sense to set level to int.low or something similar in dispose (or .. maybe use 0 as an invalid level value and make sure level is offset on "valid" construction to avoid 0) - then we could check for that value in strategic places to detect use-after-dispose, which would be another source of bugs of this kind.

Not sure where to check though, such that it doesn't impact perf - rstack / get operations need as tight a loop as possible since they get absurd amounts of lookup traffic, but maybe someplace else would make sense.

Anyway, it's not necessarily for this PR, but it could be added as another safeguard.

Co-authored-by: Jacek Sieka <[email protected]>

…rrupted-state

bhartnett · 2025-09-23T12:46:46Z

execution_chain/core/chain/forked_chain.nim

+  # Update the snapshot before processing the block so that any vertexes in snapshots
+  # from lower levels than the baseTxFrame are removed from the snapshot before running
+  # the stateroot computation.
+  c.updateSnapshot(blk, txFrame)


@jangko Will moving this updateSnapshot call up above processBlock cause any issues that you are aware of? It seems ok in my testing.

…e level.

arnetheduck · 2025-09-24T08:57:47Z

execution_chain/db/aristo/aristo_desc.nim


    if stopAtSnapshot and tx.snapshot.level.isSome():
      break
+    if stopAtBaseLevel and tx.level <= tx.db.baseTxFrame().level:


I can't think of any valid sequence of txframe interation that would lead to this scenario - as such, this would probably be better off as an assert - the question though is whether this is a good place for it - in particular, the get operation runs very frequently so I'm not convinced we want to burden it the performance implications we'd see here - instead, we could focus on catching this in persist which has to examine all entries anyway.

Good point, I agree that it shouldn't occur under normal conditions but it probably could occur when dealing with multiple heads/forks if base is updated/advanced forward on one chain but then the txFrames on another chain would become invalid so if they get used before getting cleaned up then it might happen.

If this did happen it would be a bug in our code and not something that we need to support. I agree that it should be an assert so that we can catch these types of issues though and considering that these code paths are hotspots, I'll go with my previous change.

bhartnett added 4 commits September 22, 2025 22:11

Create test to reproduce state corruption scenario.

a0cb906

Don't allow writing vertexes with lower level than baseTxFrame level …

8c34056

…during computeKey.

Verify defect is thrown when trying to compute state root on txFrame …

3f0c3c6

…with lower level than baseTxFrame.

Move update snapshot in ForkedChain to prevent fetching vertexes at l…

e2f4f68

…ower level than baseTxFrame when computing stateroot.

arnetheduck reviewed Sep 23, 2025

View reviewed changes

execution_chain/core/chain/forked_chain.nim Outdated Show resolved Hide resolved

arnetheduck reviewed Sep 23, 2025

View reviewed changes

arnetheduck approved these changes Sep 23, 2025

View reviewed changes

bhartnett and others added 3 commits September 23, 2025 20:30

Update execution_chain/core/chain/forked_chain.nim

03f60a0

Co-authored-by: Jacek Sieka <[email protected]>

Update test.

4ee9579

Merge remote-tracking branch 'origin/fix-corrupted-state' into fix-co…

45c4f4a

…rrupted-state

bhartnett commented Sep 23, 2025

View reviewed changes

bhartnett added 2 commits September 24, 2025 13:33

Filter out vertexes at lower level than the baseTxFrame level.

16b44fe

Don't return any vertexes or hashkeys from in memory layers below bas…

8ba8aae

…e level.

bhartnett requested a review from arnetheduck September 24, 2025 06:34

Minor fix.

0e6f29a

arnetheduck reviewed Sep 24, 2025

View reviewed changes

Revert recent changes.

6bffff3

bhartnett merged commit 814810e into master Sep 24, 2025
20 checks passed

bhartnett deleted the fix-corrupted-state branch September 24, 2025 13:22

bhartnett mentioned this pull request Sep 24, 2025

EL getting into tx nonce mismatch #3699

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prevent database corruption during stateroot computation #3705

Prevent database corruption during stateroot computation #3705

Uh oh!

bhartnett commented Sep 23, 2025 •

edited

Loading

Uh oh!

arnetheduck Sep 23, 2025

Uh oh!

bhartnett Sep 23, 2025

Uh oh!

bhartnett Sep 23, 2025 •

edited

Loading

Uh oh!

arnetheduck Sep 23, 2025

Uh oh!

arnetheduck Sep 23, 2025

Uh oh!

bhartnett Sep 23, 2025

Uh oh!

bhartnett Sep 24, 2025

Uh oh!

arnetheduck Sep 23, 2025

Uh oh!

Uh oh!

arnetheduck Sep 23, 2025

Uh oh!

bhartnett Sep 23, 2025

Uh oh!

arnetheduck left a comment

Uh oh!

bhartnett Sep 23, 2025

Uh oh!

arnetheduck Sep 24, 2025

Uh oh!

bhartnett Sep 24, 2025

Uh oh!

Uh oh!

Uh oh!

Prevent database corruption during stateroot computation #3705

Prevent database corruption during stateroot computation #3705

Uh oh!

Conversation

bhartnett commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bhartnett Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arnetheduck left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bhartnett commented Sep 23, 2025 •

edited

Loading

bhartnett Sep 23, 2025 •

edited

Loading