Skip to content

Conversation

bhartnett
Copy link
Contributor

@bhartnett bhartnett commented Sep 23, 2025

This fix is related to this issue: #3699

I have found that when computing the stateroot it is possible for in memory vertexes (that have a level lower than the baseTxFrame) to be written to the database and this is very likely to be the cause of the database corruption. When computing the stateroot we should only write directly to the database if the vertex was just fetched from the database.

I confirmed that in the forked chain tests this scenario was occurring (but without the data corruption) when processing a single linear chain of blocks. The problem is basically that the snapshot data can contain vertexes older than base after the base is updated. It seems likely that when dealing with multiple branches/forks, this scenario could lead to a database corruption.

With this change, when computing the stateroot we only allow writing vertexes to the database that were fetched directly from the database.

txFrame = parentFrame.txFrameBegin

# Update the snapshot before processing the block so that any vertexes in snapshots
# from lower levels than the baseTxFrame are removed from the snapshot before running
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment looks somewhat concerning - there should be no verticies from below baseTxFrame in memory at any point - when we persist, we move everything below baseTxFrame to the on-disk database and empty the baseTxFrame, breaking its "link" with txframe below. In fact, once we've performed persist, any "levels" below baseTxFrame are no longer connected and should not be reachable so how come they're still being added to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data in each snapshot contains a level (integer). Every time we create a snapshot on a txFrame the snapshot is moved from the older frame but the level from which the vertex originated is kept in the snapshot so that even after we persist and clean up all the lower level txFrames (as you've mentioned) the vertex data from the lower level can remain in the snapshot which is part of a newer (not yet persisted) txFrame.

Copy link
Contributor Author

@bhartnett bhartnett Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that in this scenario (when importing blocks via forked chain sync) the txFrame that get persisted (during updateBase) lags behind the txFrame at the head which holds the snapshot. In other words we are not persisting the txFrame which is holding the snapshot but rather an older txFrame with a lower level.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think I see the problem - this used to be cleared out by persist but I guess the way snapshots are being built now influences it.

in that case, shouldn't this be fixable by comparing the level of the snapshot with the basetxframe level and discarding on access if it's lower? that would remove the need to call updateSnapshot (which feels more like a workaround - the level stuff should be transparent to the user of the api, it's an internal optimization mechanism)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe not even discarding, but returning "dblevel" in that scenario .. ie if a snapshot was written to the database and there's no "newer" information in higher levels, it seems to me it should still be valid? what am I missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the main point of this change is that we shouldn't be writing vertexes to the db during the stateroot computation, that are older/at a lower level then the database level. My reasoning for this is that it could cause the db corruption if the vertex being written happens to be older then what is already in the database. If we never deal with multiple branches/forks then this is probably safe because the full history of writes would be in the snapshot, but in practice when dealing with multiple branches and reorgs we have a dag of txFrames and multiple distinct snapshots. If I return dbLevel then the older vertex would get written to the db as was done before this change so no point in doing that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case, shouldn't this be fixable by comparing the level of the snapshot with the basetxframe level and discarding on access if it's lower? that would remove the need to call updateSnapshot (which feels more like a workaround - the level stuff should be transparent to the user of the api, it's an internal optimization mechanism)

@arnetheduck Yes I like this idea. I've updated the the aristo layers to filter out any vertexes and hash keys that have a level lower than the base level. Would be good to get another review from you after these latest changes.

Comment on lines 269 to 274
try:
discard tx4.fetchStateRoot().get()
except AssertionDefect:
# Should throw assertion defect here because tx4 is at lower level than
# the baseTxFrame level.
discard
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
try:
discard tx4.fetchStateRoot().get()
except AssertionDefect:
# Should throw assertion defect here because tx4 is at lower level than
# the baseTxFrame level.
discard
expect(Defect):
discard tx4.fetchStateRoot()


block:
let batch = db.putBegFn().expect("working batch")
db.persist(batch, tx3) # after this the baseTxFrame is at level 4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 or 3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, should be 3. Will update.

Copy link
Member

@arnetheduck arnetheduck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would make sense to set level to int.low or something similar in dispose (or .. maybe use 0 as an invalid level value and make sure level is offset on "valid" construction to avoid 0) - then we could check for that value in strategic places to detect use-after-dispose, which would be another source of bugs of this kind.

Not sure where to check though, such that it doesn't impact perf - rstack / get operations need as tight a loop as possible since they get absurd amounts of lookup traffic, but maybe someplace else would make sense.

Anyway, it's not necessarily for this PR, but it could be added as another safeguard.

# Update the snapshot before processing the block so that any vertexes in snapshots
# from lower levels than the baseTxFrame are removed from the snapshot before running
# the stateroot computation.
c.updateSnapshot(blk, txFrame)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jangko Will moving this updateSnapshot call up above processBlock cause any issues that you are aware of? It seems ok in my testing.


if stopAtSnapshot and tx.snapshot.level.isSome():
break
if stopAtBaseLevel and tx.level <= tx.db.baseTxFrame().level:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of any valid sequence of txframe interation that would lead to this scenario - as such, this would probably be better off as an assert - the question though is whether this is a good place for it - in particular, the get operation runs very frequently so I'm not convinced we want to burden it the performance implications we'd see here - instead, we could focus on catching this in persist which has to examine all entries anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I agree that it shouldn't occur under normal conditions but it probably could occur when dealing with multiple heads/forks if base is updated/advanced forward on one chain but then the txFrames on another chain would become invalid so if they get used before getting cleaned up then it might happen.

If this did happen it would be a bug in our code and not something that we need to support. I agree that it should be an assert so that we can catch these types of issues though and considering that these code paths are hotspots, I'll go with my previous change.

@bhartnett bhartnett merged commit 814810e into master Sep 24, 2025
20 checks passed
@bhartnett bhartnett deleted the fix-corrupted-state branch September 24, 2025 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants