Skip to content

Conversation

@cjjdespres
Copy link
Member

This PR is a revival of #17659. (I was going to reopen that one, but github won't allow me to do that now that I've force-pushed to rebase the branch onto the latest compatible to fix the merge conflicts that have accumulated since I opened that one).

Explain your changes:

Instead of having a populate_root method to populate an existing, empty root ledger from genesis, the genesis ledger now has a create_root method. This method assumes the responsibility for creating the root ledger itself. This design allows the create_root method to checkpoint the genesis ledger directly when the genesis ledger is backed by a database, which is much faster than using the transfer_accounts method from the functor in ledger_transfer.ml. This should resolve #17570 as a result, because we no longer rehash the genesis ledger and genesis staking epoch ledger when bootstrapping from genesis.

My reasoning for why we can skip this hashing:

  • The daemon already checks the SHA256 of the tar.gz file that it downloads from S3, and this is the only time it handles a genesis ledger database from an external source
  • The existing rehashing appears unintended, in that it's a side-effect of using the transfer_accounts from ledger_transfer.ml on the genesis ledger databases and not some explicit ledger database integrity check method
  • The daemon does not do this kind of explicit ledger database integrity checking elsewhere when loading ledger databases from the mina config directory. (It does do a faster ledger sync integrity check when that particular feature is enabled, but that's it).

It's possible that we still want this kind of strict, slow checking, but I'd argue that it should be explicit (not an accidental side-effect of the transfer_accounts method) and also that it should be written in such a way that the daemon does not pause for the whole duration of the check. Maybe it should also optional and turned off by default, if we even want it.

Also, it should mostly resolve an issue we've been seeing in the nightly rosetta tests; because the daemon was unresponsive during this rehashing, the initial best tip network query would always fail when trying to bootstrap from genesis while connecting to mainnet. This would cause the daemon to say it was synced while its best tip was at genesis and cause some very confusing behaviour in rosetta. The daemon would stay in this state for quite a number of minutes (delaying startup as a result) until it eventually reverted to Bootstrap and continued startup. This initial query failure should now no longer occur except in very specific poor networking conditions, hopefully.

Explain how you tested your changes:

The nightly tests should cover this change, especially my claim that it will resolve the rosetta nightly test failures.

I also added some log lines to indicate moments when the daemon was resetting to genesis and populating a root ledger with genesis data. I got this result when I synced a daemon to mainnet with an empty config directory:

2025-09-29 16:13:45 UTC [Debug] Resetting snarked_root in "/home/despresc/.mina-config/root" to genesis

2025-09-29 16:13:45 UTC [Debug] Finished resetting snarked_root genesis

So, this does seem to be significantly faster, as expected.

Checklist:

  • Dependency versions are unchanged
  • Modified the current draft of release notes with details on what is completed or incomplete within this project
  • Document code purpose, how to use it
  • Tests were added for the new behavior
  • All tests pass (CI will check this if you didn't)
  • Serialized types are in stable-versioned modules
  • Does this close issues? List them

Instead of having a populate_root method to populate an existing, empty
root ledger from genesis, the genesis ledger now has a create_root
method. This method assumes the responsibility for creating the root
ledger itself. The create_root method can thus checkpoint the genesis
ledger directly when the genesis ledger is backed by a database.
@cjjdespres cjjdespres requested a review from a team as a code owner September 29, 2025 16:18
@cjjdespres cjjdespres force-pushed the cjjdespres/rework-genesis-population branch from 5f2699f to af98edb Compare September 29, 2025 16:19
@cjjdespres
Copy link
Member Author

!ci-build-me

@cjjdespres
Copy link
Member Author

!ci-nightly-me

@cjjdespres
Copy link
Member Author

module Utils = struct
let populate_root_with_backing_root genesis_mask ~src ~dest =
let open Or_error.Let_syntax in
(* Create a new [Ledger.Root.t] ledger with the components of a root
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(* Create a new [Ledger.Root.t] ledger with the components of a root
(** Create a new [Ledger.Root.t] ledger with the components of a root

(* Certain database initialization methods, e.g. creation from a checkpoint,
depend on the parent directory existing and the target directory _not_
existing. *)
let%bind () = Mina_stdlib_unix.File_system.remove_dir t.directory in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is specific reason to chose these two to Mina_stdlib_unix.File_system.create_dir ~clear_if_exists:true (a single call with same semantics) or to Mina_stdlib_unix.File_system.rmrf t.directory; Core.Unix.mkdir_p t.directory; (same code, no async)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will double-check. I kind of remember having a reason, but I might have been mistaken.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe one API is buggy. We haven't fix it and add unit test for it.

@cjjdespres
Copy link
Member Author

cjjdespres commented Sep 29, 2025

It looks like the rosetta docker image was built successfully, then there was some docker error at the end:

+ docker push gcr.io/o1labs-192920/mina-archive:3.3.0-alpha1-cjjdespres-rework-genesis-population-af98edb-bullseye-devnet
The push refers to repository [gcr.io/o1labs-192920/mina-archive]
bab5e0c94398: Pushed
34cb647cdf3b: Pushed
d3b3852f9c75: Pushed
bb437b897c9d: Pushed
8b12b97d539f: Pushed
0bb9f665a67e: Pushed
c251bdf6b590: Pushed
21e3a62c5ba4: Pushed
427a1be88273: Pushed
6e12a734996b: Layer already exists
3.3.0-alpha1-cjjdespres-rework-genesis-population-af98edb-bullseye-devnet: digest: sha256:154e1e0893f0db3ea05a116036d81471b923566065315883ee7f1cb89a2c5602 size: 2414
+ docker tag gcr.io/o1labs-192920/mina-archive:3.3.0-alpha1-cjjdespres-rework-genesis-population-af98edb-bullseye-devnet gcr.io/o1labs-192920/mina-archive:af98edb-bullseye-devnet
Error response from daemon: No such image: gcr.io/o1labs-192920/mina-archive:3.3.0-alpha1-cjjdespres-rework-genesis-population-af98edb-bullseye-devnet
🚨 Error: The command exited with status 1

I retried those failed components, and then got this in the devnet connectivity test:

2025-09-29 18:54:23 - 🔄 Current TPS: 3.16, 📈 Average TPS: 3.15, 📊 Total Requests: 1058
  📊 Memory Usage:
   - 🐘 PostgreSQL: 2547.68 MB
   - 📦 Mina-archive: 147.484 MB
   - 🌹 Mina-rosetta: 224.766 MB
❌ MEMORY THRESHOLD EXCEEDED: PostgreSQL using 2547.74 MB (threshold: 2500 MB)

I don't think that's related to this PR. I'm going to try one more time.

@cjjdespres
Copy link
Member Author

cjjdespres commented Sep 29, 2025

It apparently failed again:

2025-09-29 19:18:20 - 🔄 Current TPS: 3.15, 📈 Average TPS: 3.15, 📊 Total Requests: 962
  📊 Memory Usage:
   - 🐘 PostgreSQL: 2531.34 MB
   - 📦 Mina-archive: 148.965 MB
   - 🌹 Mina-rosetta: 223.141 MB
❌ MEMORY THRESHOLD EXCEEDED: PostgreSQL using 2531.28 MB (threshold: 2500 MB)

Regardless, the failure in #17848 is no longer showing up, and it looks like the rosetta status is not going synced->bootstrap->synced any more. So I think this PR succeeds in that regard. (EDIT - I'm not sure the mainnet connectivity test is run in the ci-nightly-me suite, only the devnet one. So we might want to check and see if the nightlies start succeeding after this is merged Never mind: the devnet version was also failing in the same way).


I looked at the rosetta devnet connectivity test in the nightly CI runs, and we do seem to get close to this threshold already. For example, in this one from last night there was this reading:

2025-09-28 04:49:43 - 🔄 Current TPS: 2.13, 📈 Average TPS: 2.81, 📊 Total Requests: 726
  📊 Memory Usage:
   - 🐘 PostgreSQL: 2372.63 MB
   - 📦 Mina-archive: 132.75 MB
   - 🌹 Mina-rosetta: 222.914 MB

I'll look at a few more tests to see if we have gone over recently.

@cjjdespres
Copy link
Member Author

The last retry of the devnet rosetta connectivity test in https://buildkite.com/o-1-labs-2/mina-end-to-end-nightlies/builds/3900 failed here:

+ echo -e '\033[0;31mTimeout waiting for new blocks after double upgrade. Test failed.\033[0m'
Timeout waiting for new blocks after double upgrade. Test failed.
+ exit 1
🚨 Error: The command exited with status 1
user command error: exit status 1

I think this was added two weeks ago in #17734, so it might not be fully stable in the first place?

Copy link
Member

@glyh glyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall

create_root
~config:(Instance.Config.snarked_ledger t)
~depth:t.ledger_depth ()
|> Or_error.ok_exn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Report an error message here, instead of a failure?

Instance.close instance ; x

let reset_to_genesis_exn t ~precomputed_values =
(** Clear the factory directory and recreate the snarked ledger instance for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to move this doc comment to interface instead.

(* Certain database initialization methods, e.g. creation from a checkpoint,
depend on the parent directory existing and the target directory _not_
existing. *)
let%bind () = Mina_stdlib_unix.File_system.remove_dir t.directory in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe one API is buggy. We haven't fix it and add unit test for it.

@glyh
Copy link
Member

glyh commented Sep 30, 2025

@cjjdespres

  • The failure of double upgrade is very common when I test yesterday. Yes it's added recently
  • Error response from daemon: No such image: gcr.io/o1labs-192920/mina-archive:3.3.0-alpha1-cjjdespres-rework-genesis-population-af98edb-bullseye-devnet This is common because of the buildkite jobs haven't get the dependency relation sorted correctly. rosetta should depend on both daemon and archive but it's not specified so in our pipeline

This:

❌ MEMORY THRESHOLD EXCEEDED: PostgreSQL using 2531.28 MB (threshold: 2500 MB)

However, is a bit concerning. This test is added recently where Darsiuz attempted to fix memleak.

@cjjdespres cjjdespres merged commit 0367fc0 into compatible Sep 30, 2025
55 checks passed
@cjjdespres cjjdespres deleted the cjjdespres/rework-genesis-population branch September 30, 2025 02:39
@glyh
Copy link
Member

glyh commented Sep 30, 2025

!ci-docker-me

@glyh
Copy link
Member

glyh commented Sep 30, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

We appear to re-hash the genesis ledger when resetting the snarked root

4 participants