Daemon can crash on subsequent startup if bootstrap epoch ledger sync is interrupted

Tested with a recent compatible, so it's still an issue, and with a build based on release 3.1.0 (commit a35e44079c01234b2caf92a04f5d3a3f60b2c189), so it's not due to the recent changes to the epoch snapshot code that we've been making. In both cases I did a full reproduction (both the interrupted sync and the subsequent startup) with a single build.

If a daemon is set to sync to devnet (for instance) with an empty config directory, it will need to download epoch ledger snapshots from the network. If the daemon shuts down relatively soon into this process (and even a controlled shutdown with `mina client stop` will work), then the database will be left in an inconsistent state, and this crash will happen on next startup:

```
2025-09-22 15:19:34 UTC [Info] Loading epoch ledger from disk: $location
  location: "/home/despresc/.mina-config/epoch_ledger3e3cd614-c37c-339f-e574-433c7db8cc2c"
2025-09-22 15:19:34 UTC [Error] Could not send error report: Node_error_service was not configured
  
2025-09-22 15:19:34 UTC [Error] Could not send error report: Node_error_service was not configured
  
2025-09-22 15:19:34 UTC [Fatal] Unhandled top-level exception: $exn
Generating crash report
  exn: {
  "sexp": [
    "monitor.ml.Error",
    "Option.value_exn None",
    [
      "Raised at Base__Error.raise in file \"src/error.ml\" (inlined), line 8, characters 14-30",
      "Called from Base__Option.value_exn in file \"src/option.ml\", line 136, characters 4-21",
      "Called from Merkle_ledger__Database.Make.iteri.(fun) in file \"src/lib/merkle_ledger/database.ml\", line 580, characters 42-64",
      "Called from Base__Sequence.iter.loop in file \"src/sequence.ml\", line 357, characters 6-9",
      "Called from Consensus__Proof_of_stake.Make_str.compute_delegatee_table in file \"src/lib/consensus/proof_of_stake.ml\", line 62, characters 4-623",
      "Called from Base__Result.try_with in file \"src/result.ml\", line 195, characters 9-15",
      "Caught by monitor coda"
    ]
  ],
  "backtrace": [
    "Raised at Stdlib.failwith in file \"stdlib.ml\", line 29, characters 17-33",
    "Called from O1trace.exec_thread in file \"src/lib/o1trace/o1trace.ml\", line 82, characters 6-27",
    "Called from Consensus__Proof_of_stake.Make_str.Data.Local_state.create.(fun) in file \"src/lib/consensus/proof_of_stake.ml\", line 546, characters 18-130",
    "Called from Mina_cli_entrypoint.setup_daemon.(fun).mina_initialization_deferred.(fun) in file \"src/app/cli/src/cli_entrypoint/mina_cli_entrypoint.ml\", line 1146, characters 12-652",
    "Called from Async_kernel__Deferred0.bind.(fun) in file \"src/deferred0.ml\", line 54, characters 64-69",
    "Called from Async_kernel__Job_queue.run_job in file \"src/job_queue.ml\" (inlined), line 128, characters 2-5",
    "Called from Async_kernel__Job_queue.run_jobs in file \"src/job_queue.ml\", line 169, characters 6-47"
  ]
}
```

I'm reasonably confident that this is failing because the database is in the state mentioned in https://github.com/MinaProtocol/mina/pull/17819: the epoch ledgers have been partially synced to the network, so there are gaps in the account data relative to what `Db.num_accounts` implies.

The code that does this is in `sync_local_state` in `proof_of_stake.ml`. Specifically, it's the sync protocol implemented in  `Mina_ledger.Sync_ledger.Root.create` that is being interrupted, I believe. (For 3.1.0 this was `Mina_ledger.Sync_ledger.Db.create`).

Other than the crash itself, my vague other worry here is that the crash implies that the daemon has loaded a database that isn't completely synced and is trying to operate on it as if it were actually synced. 

This might also affect the snarked root - I haven't tested interrupting that sync yet. That ledger does also seem to be synced in-place, so it *might* be susceptible to this same kind of bug. However, it's possible that we don't ever examine the snarked root database during bootstrap like we examine the epoch snapshots. In practice we might just throw away the existing incompletely-synced snarked root or resume the incomplete sync.

------

Possibilities for fixes off the top of my head (not all need to be done, necessarily)

1. Don't sync the epoch ledger snapshots (or the snarked root, probably) in place. Instead, sync them in a temporary directory inside the mina config directory, then `Sys.rename` them so they're in the right place once sync is done. We do only sync each ledger once, in one place each, so the code surrounding the sync could probably be modified in this way without *too* much trouble.
2. Add some kind of flag to the database indicating its status, so we can explicitly mark it as "sync in progress" or something like that. The database would be flagged just as we start syncing, and then the flag would be removed at the end of sync. The database *should* be in a fully consistent state by the end of sync.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Daemon can crash on subsequent startup if bootstrap epoch ledger sync is interrupted #17847

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Daemon can crash on subsequent startup if bootstrap epoch ledger sync is interrupted #17847

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions