Skip to content

Daemon can crash on subsequent startup if bootstrap epoch ledger sync is interrupted #17847

@cjjdespres

Description

@cjjdespres

Tested with a recent compatible, so it's still an issue, and with a build based on release 3.1.0 (commit a35e440), so it's not due to the recent changes to the epoch snapshot code that we've been making. In both cases I did a full reproduction (both the interrupted sync and the subsequent startup) with a single build.

If a daemon is set to sync to devnet (for instance) with an empty config directory, it will need to download epoch ledger snapshots from the network. If the daemon shuts down relatively soon into this process (and even a controlled shutdown with mina client stop will work), then the database will be left in an inconsistent state, and this crash will happen on next startup:

2025-09-22 15:19:34 UTC [Info] Loading epoch ledger from disk: $location
  location: "/home/despresc/.mina-config/epoch_ledger3e3cd614-c37c-339f-e574-433c7db8cc2c"
2025-09-22 15:19:34 UTC [Error] Could not send error report: Node_error_service was not configured
  
2025-09-22 15:19:34 UTC [Error] Could not send error report: Node_error_service was not configured
  
2025-09-22 15:19:34 UTC [Fatal] Unhandled top-level exception: $exn
Generating crash report
  exn: {
  "sexp": [
    "monitor.ml.Error",
    "Option.value_exn None",
    [
      "Raised at Base__Error.raise in file \"src/error.ml\" (inlined), line 8, characters 14-30",
      "Called from Base__Option.value_exn in file \"src/option.ml\", line 136, characters 4-21",
      "Called from Merkle_ledger__Database.Make.iteri.(fun) in file \"src/lib/merkle_ledger/database.ml\", line 580, characters 42-64",
      "Called from Base__Sequence.iter.loop in file \"src/sequence.ml\", line 357, characters 6-9",
      "Called from Consensus__Proof_of_stake.Make_str.compute_delegatee_table in file \"src/lib/consensus/proof_of_stake.ml\", line 62, characters 4-623",
      "Called from Base__Result.try_with in file \"src/result.ml\", line 195, characters 9-15",
      "Caught by monitor coda"
    ]
  ],
  "backtrace": [
    "Raised at Stdlib.failwith in file \"stdlib.ml\", line 29, characters 17-33",
    "Called from O1trace.exec_thread in file \"src/lib/o1trace/o1trace.ml\", line 82, characters 6-27",
    "Called from Consensus__Proof_of_stake.Make_str.Data.Local_state.create.(fun) in file \"src/lib/consensus/proof_of_stake.ml\", line 546, characters 18-130",
    "Called from Mina_cli_entrypoint.setup_daemon.(fun).mina_initialization_deferred.(fun) in file \"src/app/cli/src/cli_entrypoint/mina_cli_entrypoint.ml\", line 1146, characters 12-652",
    "Called from Async_kernel__Deferred0.bind.(fun) in file \"src/deferred0.ml\", line 54, characters 64-69",
    "Called from Async_kernel__Job_queue.run_job in file \"src/job_queue.ml\" (inlined), line 128, characters 2-5",
    "Called from Async_kernel__Job_queue.run_jobs in file \"src/job_queue.ml\", line 169, characters 6-47"
  ]
}

I'm reasonably confident that this is failing because the database is in the state mentioned in #17819: the epoch ledgers have been partially synced to the network, so there are gaps in the account data relative to what Db.num_accounts implies.

The code that does this is in sync_local_state in proof_of_stake.ml. Specifically, it's the sync protocol implemented in Mina_ledger.Sync_ledger.Root.create that is being interrupted, I believe. (For 3.1.0 this was Mina_ledger.Sync_ledger.Db.create).

Other than the crash itself, my vague other worry here is that the crash implies that the daemon has loaded a database that isn't completely synced and is trying to operate on it as if it were actually synced.

This might also affect the snarked root - I haven't tested interrupting that sync yet. That ledger does also seem to be synced in-place, so it might be susceptible to this same kind of bug. However, it's possible that we don't ever examine the snarked root database during bootstrap like we examine the epoch snapshots. In practice we might just throw away the existing incompletely-synced snarked root or resume the incomplete sync.


Possibilities for fixes off the top of my head (not all need to be done, necessarily)

  1. Don't sync the epoch ledger snapshots (or the snarked root, probably) in place. Instead, sync them in a temporary directory inside the mina config directory, then Sys.rename them so they're in the right place once sync is done. We do only sync each ledger once, in one place each, so the code surrounding the sync could probably be modified in this way without too much trouble.
  2. Add some kind of flag to the database indicating its status, so we can explicitly mark it as "sync in progress" or something like that. The database would be flagged just as we start syncing, and then the flag would be removed at the end of sync. The database should be in a fully consistent state by the end of sync.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions