-
Notifications
You must be signed in to change notification settings - Fork 76
Create the storage abstraction for the new forest #700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
6f31889 to
042b935
Compare
265ab72 to
84056a0
Compare
84056a0 to
5556cd8
Compare
| /// All methods are intended to handle potential storage errors by returning the trait's [`Result`] | ||
| /// type. Suggested errors are documented for each method, but these may be changed by any concrete | ||
| /// implementation of the trait. | ||
| pub trait Storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we are overcomplicating this interface quite a bit. Specifically, I think we should be able to leave it up to the specific storage implementation to decide how to manage in-memory vs. on-disk data and how to implement transaction boundaries.
I'd rather make the storage interface simpler and move the complexity into the storage implementations. There are very few things that I think we need from the storage:
- Batch mutation: given a tree root and a list of updates (both insertions and removals), update the underlying tree and return
MutationSetwith inverse mutations. This could be equivalent to a combination ofLargeSmt::compute_mutations()andLargeSmt::apply_mutations_with_reversion(). - Lookups: ability to get an
SmtProoffor a given key in a given tree - basically equivalent toLargeSmt::open().
There may be other utility methods that could be required (e.g., getting all tree roots or something similar), but these would be the main ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm misunderstanding your point here, but that really seems to me like effectively pushing all of the logic for the forest itself into the storage. I was under the impression that we wanted to keep storage dumb as rocks, where providing these operations requires the storage to be more complex, which goes against our previous discussions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "storage" is not the right term here. Basically, at the SmtForest level, I'd like to end up with two high-level components:
- In-memory histories - i.e.,
histories: Map<Word, History> - Latest versions of SMTs. This component would have the interface similar I described above.
Maybe there would be another layer within the second component that makes it customizable - but we could also decide on the interface for this later. For example, the simple in-memory implementation of the second component could be:
pub struct InMemorySmtStorage {
maps: Map<Word, Smt>,
}The RocksDB-based one would look quite different of course - but I'm not sure we need to re-invent much for the in-memory approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will summarise the discussion on Slack here when we get somewhere, but my instinct here is that this separation will reduce the available scope for data-parallelism, which is something that the future persistence will rely on heavily.
More to come...
|
|
||
| /// The backing storage for the large SMT forest, providing the necessary methods for reading | ||
| /// from and writing to an arbitrary storage structure while allowing the forest itself to be | ||
| /// storage agnostic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this interface could be simplified by pushing more logic into the forest and keeping the storage trait minimal.
For comparison, the minibytes crate from Sapling (https://github.com/facebook/sapling/blob/13b6958fbbf863cb4b08d09d158caff95e834518/eden/scm/lib/minibytes/src/lib.rs) uses a tiny BytesOwner trait that's basically just AsRef<[u8]> + Send + Sync + 'static. All the slicing, cloning, and lifetime logic lives in the Bytes wrapper type, not the trait. The owner just provides raw bytes.
That pattern seems to align with what @bobbinth is suggesting above: let the storage be a simple data store, and let the forest handle tree semantics. The existing SmtStorage trait in large/storage/mod.rs does something similar with its apply(StorageUpdates) method, it takes a batch of changes rather than exposing fine-grained primitives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately treating storage simply as bytes is not sufficient here. Storage has to have some knowledge of what operations are required to be able to efficiently perform access patterns for performance.
I want to keep storage as stupid as possible, and indeed I think what I have proposed here is that. It doesn't perform any actual tree operations—which is why the equivalent to apply cannot exist and we use transactions instead—but it provides semantically-relevant operations to allow the actual forest to perform the tree operations in a performant way.
You point to SmtStorage, but it also provides a whole host of finer-grained methods that are used by LargeSmt where relevant.
| /// - [`StorageError::Leaf`] if a malformed leaf is found during the query. | ||
| fn entry_count(&self) -> Result<Word>; | ||
|
|
||
| /// Sets the number of **unique entries** that are currently stored across all leaf nodes for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use usize instead of Word? Everywhere else in the codebase entry_count is a usize:
large/storage/mod.rs:49:fn entry_count(&self) -> Result<usize, StorageError>large/mod.rs:286:entry_count: usizelarge/batch_ops.rs: usesisizefor deltas
A Word is 256 bits which seems like overkill for a count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I picked Word because technically there can be usize.
| /// did not allocate to the caller. | ||
| fn commit(&self, handle: Self::TransactionHandle) -> Result<()>; | ||
|
|
||
| /// Gets a handle to the storage as if it only consists of the data for the tree with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there intentionally no rollback or abort method? The docs mention that backends supporting transactions should guard operations on there being an active transaction, but if something fails between begin() and commit(), how does the caller signal that the transaction should be abandoned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm mostly considering data-consistency for persistent data stores in the case of crashes, rather than any kind of transaction failure that could occur at runtime. Can you explain the kind of scenarios that you are envisioning?
6ada5f6 to
1571406
Compare
This commit contains the `Storage` and `StoredTreeHandle` traits that encapsulate the operations that must be provided by a storage backend. They are designed so that the backend provides simple data operations, while the actual logic resides in the forest itself. This gives possible implementations enough leeway to be efficient while also providing the necessary functionality. With no concrete implementations included in this commit, this serves primarily as documentation of the design space. The first concrete implementation of `Storage` and `StoredTreeHandle` will come in the commit that contains the `SmtForest` type itself.
1571406 to
416bd71
Compare
|
Closing in favour of an upcoming PR with a different approach. |
Describe your changes
This commit contains the
StorageandStoredTreeHandletraits that encapsulate the operations that must be provided by a storage backend. They are designed so that the backend provides simple data operations, while the actual logic resides in the forest itself. This gives possible implementations enough leeway to be efficient while also providing the necessary functionality.With no concrete implementations included in this commit, this serves primarily as documentation of the design space. The first concrete implementation of
StorageandStoredTreeHandlewill come in the commit that contains theSmtForesttype itself.A few places in the current code are marked as
#[allow(dead_code)]. This is to prevent warnings from failing the build while still allowing the private code to be merged incrementally.I also argue that, as this is internal code, it does not require a changelog entry. Please let me know if you disagree.
Checklist before requesting a review
nextaccording to naming convention.