-
Notifications
You must be signed in to change notification settings - Fork 586
Rework, speed up root ledger population from genesis #17874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Instead of having a populate_root method to populate an existing, empty root ledger from genesis, the genesis ledger now has a create_root method. This method assumes the responsibility for creating the root ledger itself. The create_root method can thus checkpoint the genesis ledger directly when the genesis ledger is backed by a database.
5f2699f to
af98edb
Compare
|
!ci-build-me |
|
!ci-nightly-me |
|
Currently waiting on https://buildkite.com/o-1-labs-2/mina-end-to-end-nightlies/builds/3900 |
| module Utils = struct | ||
| let populate_root_with_backing_root genesis_mask ~src ~dest = | ||
| let open Or_error.Let_syntax in | ||
| (* Create a new [Ledger.Root.t] ledger with the components of a root |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| (* Create a new [Ledger.Root.t] ledger with the components of a root | |
| (** Create a new [Ledger.Root.t] ledger with the components of a root |
| (* Certain database initialization methods, e.g. creation from a checkpoint, | ||
| depend on the parent directory existing and the target directory _not_ | ||
| existing. *) | ||
| let%bind () = Mina_stdlib_unix.File_system.remove_dir t.directory in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is specific reason to chose these two to Mina_stdlib_unix.File_system.create_dir ~clear_if_exists:true (a single call with same semantics) or to Mina_stdlib_unix.File_system.rmrf t.directory; Core.Unix.mkdir_p t.directory; (same code, no async)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will double-check. I kind of remember having a reason, but I might have been mistaken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe one API is buggy. We haven't fix it and add unit test for it.
|
It looks like the rosetta docker image was built successfully, then there was some docker error at the end: I retried those failed components, and then got this in the devnet connectivity test: I don't think that's related to this PR. I'm going to try one more time. |
|
It apparently failed again: Regardless, the failure in #17848 is no longer showing up, and it looks like the rosetta status is not going synced->bootstrap->synced any more. So I think this PR succeeds in that regard. ( I looked at the rosetta devnet connectivity test in the nightly CI runs, and we do seem to get close to this threshold already. For example, in this one from last night there was this reading: I'll look at a few more tests to see if we have gone over recently. |
|
The last retry of the devnet rosetta connectivity test in https://buildkite.com/o-1-labs-2/mina-end-to-end-nightlies/builds/3900 failed here: I think this was added two weeks ago in #17734, so it might not be fully stable in the first place? |
glyh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall
| create_root | ||
| ~config:(Instance.Config.snarked_ledger t) | ||
| ~depth:t.ledger_depth () | ||
| |> Or_error.ok_exn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Report an error message here, instead of a failure?
| Instance.close instance ; x | ||
|
|
||
| let reset_to_genesis_exn t ~precomputed_values = | ||
| (** Clear the factory directory and recreate the snarked ledger instance for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to move this doc comment to interface instead.
| (* Certain database initialization methods, e.g. creation from a checkpoint, | ||
| depend on the parent directory existing and the target directory _not_ | ||
| existing. *) | ||
| let%bind () = Mina_stdlib_unix.File_system.remove_dir t.directory in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe one API is buggy. We haven't fix it and add unit test for it.
This: However, is a bit concerning. This test is added recently where Darsiuz attempted to fix memleak. |
|
!ci-docker-me |
This PR is a revival of #17659. (I was going to reopen that one, but github won't allow me to do that now that I've force-pushed to rebase the branch onto the latest
compatibleto fix the merge conflicts that have accumulated since I opened that one).Explain your changes:
Instead of having a
populate_rootmethod to populate an existing, empty root ledger from genesis, the genesis ledger now has acreate_rootmethod. This method assumes the responsibility for creating the root ledger itself. This design allows thecreate_rootmethod to checkpoint the genesis ledger directly when the genesis ledger is backed by a database, which is much faster than using thetransfer_accountsmethod from the functor inledger_transfer.ml. This should resolve #17570 as a result, because we no longer rehash the genesis ledger and genesis staking epoch ledger when bootstrapping from genesis.My reasoning for why we can skip this hashing:
tar.gzfile that it downloads from S3, and this is the only time it handles a genesis ledger database from an external sourcetransfer_accountsfromledger_transfer.mlon the genesis ledger databases and not some explicit ledger database integrity check methodIt's possible that we still want this kind of strict, slow checking, but I'd argue that it should be explicit (not an accidental side-effect of the
transfer_accountsmethod) and also that it should be written in such a way that the daemon does not pause for the whole duration of the check. Maybe it should also optional and turned off by default, if we even want it.Also, it should mostly resolve an issue we've been seeing in the nightly rosetta tests; because the daemon was unresponsive during this rehashing, the initial best tip network query would always fail when trying to bootstrap from genesis while connecting to mainnet. This would cause the daemon to say it was synced while its best tip was at genesis and cause some very confusing behaviour in rosetta. The daemon would stay in this state for quite a number of minutes (delaying startup as a result) until it eventually reverted to Bootstrap and continued startup. This initial query failure should now no longer occur except in very specific poor networking conditions, hopefully.
Explain how you tested your changes:
The nightly tests should cover this change, especially my claim that it will resolve the rosetta nightly test failures.
I also added some log lines to indicate moments when the daemon was resetting to genesis and populating a root ledger with genesis data. I got this result when I synced a daemon to mainnet with an empty config directory:
So, this does seem to be significantly faster, as expected.
Checklist: