Local Statesync Plan
Context
Live migration: a temporary migration DB syncs from the live execution DB via statesync, then cuts over. Both DBs share the same storage device. After cutover, the migration DB becomes primary and the old DB's seq chunks are recycled into the shared pool. Each DB needs its own metadata/root-offset backing and its own owned set of seq chunks, but seq ownership can be dynamic through the shared pool. The design should reuse the existing device/chunking model instead of repartitioning the whole pool from scratch for every migration.
Phase 0: Decouple DB Metadata Backing From Fixed CNV Chunk Positions
The current code assumes the first conventional chunks have fixed meanings:
AsyncIO hardcodes CNV chunk 0 as the metadata chunk
UpdateAux hardcodes CNV chunk 0 for db_metadata and CNV chunk 1+ for root offsets
- the CLI restore/import path also hardcodes CNV chunk 0
This is too coupled to the old single-DB format. Before multi-DB support, we should make "where this DB's metadata/root offsets live" an explicit backing choice rather than "whichever CNV chunks happen to be first in the pool."
Phase 0 outcome:
- CNV chunks do not need to live at special positions in
storage_pool
db_id=1 can still point at the legacy on-pool backing
db_id=2 can point at sidecar-backed metadata/root offsets
- all open paths resolve metadata/root-offset backing explicitly
Phase 0 Non-Goals
Phase 0 is not the shared-free-list refactor. Do not touch seq allocation policy yet:
- do not change
fast / slow / free_list behavior yet
- do not add the global pool freelist yet
- do not add dynamic seq ownership transfer yet
- do not repartition the pool footer or change chunk geometry
The goal is only to remove the positional assumptions around metadata/root-offset backing.
Cleanest Implementation Boundary
The clean cut is:
- make
AsyncIO effectively seq-only
- move metadata/root-offset backing selection into MPT open/setup code
- make
UpdateAux::set_io() map metadata/root-offsets from an explicit backing descriptor rather than inferring them from pool.chunk(cnv, 0) and pool.chunk(cnv, 1+)
Important observation from the current code:
AsyncIO only touches CNV in its constructor to register chunk-0 FDs with io_uring
- metadata/root-offset access is not performed through async reads/writes;
UpdateAux::set_io() mmaps those backings directly
That means Phase 0 should not try to teach storage_pool about multiple logical DBs yet. The smallest clean refactor is to resolve DB backing once during open, then pass that resolved backing into UpdateAux.
Recommended Backing Model
Add an explicit backing descriptor, resolved once per DB open:
struct resolved_db_backing
{
uint8_t db_id; // 1 = legacy primary, 2 = migration DB
struct mapped_chunk_ref
{
int read_fd;
int write_fd;
uint64_t base_offset;
uint64_t capacity;
uint32_t logical_id;
};
mapped_chunk_ref metadata_chunk; // holds both metadata copies
std::vector<mapped_chunk_ref> root_offset_chunks; // each holds both copies
};
Two backing modes are enough for v1:
- legacy pool-backed DB 1:
- metadata chunk = pool CNV chunk 0
- root-offset chunks = pool CNV chunks 1..N
- sidecar-backed DB 2:
- metadata chunk = dedicated sidecar file
- root-offset chunks = dedicated sidecar files
Do not add an arbitrary CNV-placement feature in Phase 0. Sidecar-backed DB 2 is enough to break the hardcoded single-DB assumption cleanly.
Preserve The Existing On-Disk Copy Layout
Do not redesign the metadata format in Phase 0. Reuse the current "two copies in one CNV-sized backing object" layout:
- metadata backing stores copy A in the first half, copy B in the second half
- each root-offset backing does the same
For sidecar-backed DB 2, size each sidecar file exactly like one CNV chunk on the pool and keep the same half-and-half layout. This keeps the mmap logic almost identical to today.
Do Not Treat root_offsets.storage_.cnv_chunk_id As A Physical Pool Chunk ID Anymore
This field should become a logical backing ID:
- for legacy DB 1, the logical IDs still happen to be
1..N and resolve to pool CNV chunks
- for sidecar DB 2, the logical IDs also use
1..N but resolve to sidecar root-offset files
That avoids a db-metadata format change in Phase 0. UpdateAux::map_root_offsets() should resolve through resolved_db_backing.root_offset_chunks, not by directly calling pool.chunk(storage_pool::cnv, stored_id).
Exact Phase 0 Code Changes
0.1 Add an explicit backing spec to config
File:
category/mpt/ondisk_db_config.hpp
Add:
uint8_t db_id{1};
- a small optional backing override for non-legacy DBs, e.g.:
struct DbBackingPaths
{
std::filesystem::path metadata_path;
std::vector<std::filesystem::path> root_offset_paths;
};
Then:
OnDiskDbConfig gets db_id and std::optional<DbBackingPaths> backing_paths
ReadOnlyOnDiskDbConfig gets the same
Rule:
db_id=1 with no backing_paths means legacy pool-backed DB
db_id=2 must provide backing_paths
root_offsets_chunk_count only affects pool creation for the legacy pool-backed case; for sidecar-backed DB 2 it controls how many sidecar root-offset files to create, not how many CNV chunks the pool has
0.2 Resolve backing once in AsyncIOContext
Files:
category/mpt/db.hpp
category/mpt/db.cpp
- new helper file is recommended, e.g.
category/mpt/db_backing.hpp/.cpp
Add:
resolved_db_backing backing; to AsyncIOContext
- helper
resolve_db_backing(storage_pool&, options) returning resolved_db_backing
resolve_db_backing(...) should also own sidecar lifecycle for writable opens:
- create/truncate sidecar metadata/root-offset files when opening a fresh writable DB 2
- reopen existing sidecar files for append/open-existing mode
- size every sidecar file to exactly one CNV chunk capacity
Why here:
- every open path already funnels through
AsyncIOContext
- once
AsyncIOContext owns the resolved backing, both RO and RW worker-thread paths inherit it automatically
- this is also where the current
pool_options.num_cnv_chunks = root_offsets_chunk_count + 1 logic must become conditional on using the legacy pool-backed DB format
0.3 Stop making AsyncIO responsible for metadata backing
Files:
category/async/io.hpp
category/async/io.cpp
Change:
- remove
cnv_chunk_ from AsyncIO
- stop registering CNV chunk 0 with io_uring
- keep
AsyncIO focused on seq chunks only
This is the cleanest boundary because no async code currently uses CNV for metadata/root-offset access after construction.
0.4 Change UpdateAux::set_io() to take the resolved backing
Files:
category/mpt/trie.hpp
category/mpt/update_aux.cpp
Signature change:
void set_io(
AsyncIO &,
resolved_db_backing const &,
std::optional<uint64_t> history_length = {});
Constructor overloads should follow the same pattern.
Implementation split in update_aux.cpp:
map_db_metadata_from_backing(...)
map_root_offsets_from_backing(...)
initialize_new_db_from_backing(...)
Specific replacements inside current set_io():
- replace
pool.chunk(cnv, 0) with backing.metadata_chunk
- replace the current
for each stored cnv_chunk_id -> pool.chunk(cnv, id) logic with lookup in backing.root_offset_chunks
- on fresh DB init, allocate/zero the configured root-offset backings instead of hardcoding CNV chunks
1..N
0.5 Keep fresh DB initialization identical except for backing selection
In UpdateAux::set_io():
- keep the double-copy metadata initialization
- keep the history/ring sizing rules
- keep the existing free/fast/slow initialization
- only change where metadata/root-offset bytes are written
That means Claude should avoid touching:
- list semantics
- chunk insertion order
capacity_in_free_list
- seq-chunk initialization
0.6 Update every constructor path that builds UpdateAux
Files:
Call sites that must pass the new backing:
Db::ROOnDiskBlocking
OnDiskWithWorkerThreadImpl::DbAsyncWorker (both RO and RW constructors)
- any direct
UpdateAux construction in tests/tooling
This is the spot where hidden default-DB paths usually survive, so Phase 0 is not done until every one of these passes the same resolved backing.
0.7 Update CLI restore/import to use the same backing helpers
File:
category/mpt/cli_tool_impl.cpp
Minimum requirement for Phase 0:
- remove direct
pool->chunk(cnv, 0) metadata mapping
- use the same metadata-backing resolution helper used by
AsyncIOContext
It is okay if full archive/export support for sidecar-backed DB 2 is deferred, but restore/open code must stop assuming metadata always lives in pool CNV chunk 0.
Phase 0 Validation
Add or update tests for these exact cases:
- legacy DB 1 still initializes and reopens with pool CNV chunk 0 metadata and pool CNV
1..N root offsets
- sidecar-backed DB 2 initializes and reopens using sidecar metadata/root-offset files
- RO open path uses the same resolved backing as RW open path
- worker-thread opens (
RODb, RW worker) use the same resolved backing as direct opens
UpdateAux::set_io() no longer contains any hardcoded pool.chunk(storage_pool::cnv, 0) or for (n = 2; ...) pool.chunk(storage_pool::cnv, n) assumptions
Phase 1: Shared Pool Free List
Extend the existing storage_pool chunk management (not a new system) with a global seq-chunk free list.
Important design correction: split responsibilities cleanly. Pool metadata owns the global free list of unowned chunks. Triedb db_metadata remains durable mmapped state for per-DB placement and reuse of chunks already held by that DB. fast/slow stay DB-local implementation details, and the current DB-local free_list is repurposed as a DB-local recycle/reserve list rather than a global free-space list. Ownership is implicit: if a seq chunk is on the pool free list, it is unowned; otherwise it belongs to some DB.
Implicit Ownership
If a chunk is in the global free list, it is unowned. If it is not in the global free list, it is owned by some DB. Within that DB, the chunk may be active in fast/slow or sitting on the DB-local recycle list ready for reuse.
Global Freelist: Pool-Level Lock + Index-Linked List
struct pool_freelist {
std::atomic<uint8_t> lock; // small pool-level spinlock or mutex wrapper
uint16_t head; // first free chunk index (0xFFFF = empty)
uint16_t next_free[MAX_CHUNKS]; // singly-linked list via indices
};
This lock only protects pool free-list transitions. In the common single-DB case it is touched only when a chunk is acquired from or returned to the global free list, so steady-state overhead should be close to zero.
Practical note for v1:
- use a simple sidecar spinlock or mutex byte; do not try to invent a lock-free allocator here
- pool free-list operations are not on the hot path of every node write
- the hot path should stay DB-local through the recycle list
DB Identity and Metadata Backing
Do not add a pool catalog. Existing pools derive chunk geometry from the current footer layout, and existing pools cannot grow num_cnv_chunks in place. For the same reason, do not extend the existing pool footer with the new global free-list structure on existing pools. Instead, v1 uses:
- the existing pool footer unchanged
- a small sidecar file for global free-list metadata
- a fixed DB convention:
struct db_open_spec {
uint8_t db_id; // 1 = existing primary, 2 = migration DB
std::span<std::filesystem::path const> metadata_backing_paths;
std::optional<uint32_t> chunk_limit;
};
V1 assumption:
db_id=1 uses the existing on-pool metadata/root-offset backing
db_id=2 uses explicitly provided sidecar-backed metadata/root-offset files
- callers must pass
db_id through every open path
- cutover updates the process/config that chooses which DB backing is treated as primary
Global Free-List Recovery
The sidecar free-list metadata is authoritative during normal runtime, but it must be rebuildable. On migration-tool startup, or after any detected dirty/corrupt sidecar state:
- Open both DB metadata backings under exclusive migration lock
- Walk each DB's
fast/slow/local-recycle lists
- Mark those chunks as in-use
- Rebuild the global pool free list from the remaining seq chunks
This gives a simple crash-recovery rule without adding explicit owner arrays.
Recovery assumptions for Claude:
- the sidecar is authoritative only while it is clean
- if the sidecar is missing, dirty, or fails validation, rebuild it from DB metadata instead of trying to repair it incrementally
- rebuild requires exclusive migration/process ownership of both DB handles for the duration of the scan
Operations
Allocate chunk for DB X:
- Acquire the pool free-list lock
- Pop
head
- Release the lock
- Hand the chunk to DB X, which may place it directly into
fast/slow or park it on the DB-local recycle list first
Reuse chunk inside DB X:
- Prefer popping from DB X's local recycle list
- If empty and
chunk_limit allows, allocate from the global pool free list
- If empty and
chunk_limit would be exceeded, compact / reclaim within DB X and retry the local recycle list before touching the pool again
Free chunk from DB X:
- Remove chunk from DB X's
fast/slow metadata
- Trim/destroy chunk contents
- Return it to DB X's local recycle list by default
- Only explicit shrink/destroy paths return chunks to the global pool free list
Destroy DB X (bulk reclaim):
- DB X must already be quiesced and all DB handles closed
- Acquire the pool free-list lock
- Walk DB X's owned seq chunks, trim them, remove them from DB X metadata, and push them onto the global free list
- Release the lock
Bootstrap DB 2 on an existing pool:
- Quiesce DB 1 and close helper handles long enough to take a consistent snapshot
- Reinterpret DB 1's existing
free_list as DB 1's local recycle list
- Carve out an initial budget for DB 2 by moving selected chunks from DB 1's recycle list into the global pool free list, then letting DB 2 allocate them
- Create DB 2 sidecar metadata/root-offset backing
- Open DB 1 with
db_id=1 and DB 2 with db_id=2
- From this point on, each DB prefers its own recycle list; the pool is used when a DB needs new unowned chunks
Bootstrap assumptions:
- v1 does not need a perfect automatic rebalance policy; a fixed initial carve-out is enough
- carve-out should only move chunks that are already empty and on DB 1's recycle list
- do not steal active
fast/slow chunks during bootstrap
- if the first deployment is allowed to hard-reset the pool, this entire bootstrap path can be skipped and both DBs can start from a freshly initialized layout
Cutover (migration complete):
- Stop source execution writes and stop issuing new statesync work
- Drain in-flight statesync/server reads, then close source RW/RO handles (
TrieDb, source Db, sctx.ro, server/client contexts)
- Reopen DB B as the live execution DB / make it the configured primary
- Destroy DB A — all its seq chunks return to the global pool free list
- DB B can now allocate from the expanded free list
Per-DB Size Configuration
chunk_limit is a per-DB policy. It can live in config and/or be mirrored in DB metadata for observability. When a DB approaches its limit, the policy should prefer compaction and reuse from that DB's recycle list instead of pulling new chunks from the global pool. The temporary migration DB can be capped while the old primary is live. After cutover, the promoted DB just allocates from the same global free list without the cap.
CNV backing is separate and fixed at DB creation time. Existing pools cannot grow num_cnv_chunks in place, so v1 still must not depend on repartitioning the pool's CNV region. For existing pools, the current primary DB keeps its existing CNV assignment and the temporary migration DB uses dedicated sidecar-backed metadata/root-offset storage on the same device.
On-Disk Layout
Device:
┌─────────────────────────────────────────────────┐
│ Existing pool layout remains in place │
│ DB 1: existing cnv assignment │
│ DB 2: sidecar metadata/root offsets (v1) │
├─────────────────────────────────────────────────┤
│ SEQ chunks │
│ Interleaved across both DBs │
│ ownership implicit from pool free list vs DB use│
│ DB-local recycle lists stay in DB metadata │
├─────────────────────────────────────────────────┤
│ chunk_bytes_used[N] (existing, atomic<uint32>) │
│ metadata_t (existing, 64 bytes) │
├─────────────────────────────────────────────────┤
│ Pool freelist sidecar (NEW, head + next_free) │
└─────────────────────────────────────────────────┘
Changes to Existing Code
| File |
Change |
Effort |
async/storage_pool.hpp |
Add pool-level global free-list helpers plus allocate_chunk(db_id) / free_chunk_to_pool(chunk) APIs. Do not add DB-placement semantics here. |
Medium |
async/storage_pool.cpp |
Implement global free-list operations under the pool lock, bootstrap migration from the existing DB-local free list, and final reclaim back into the pool. Keep the existing pool footer unchanged. |
Medium |
async/io.hpp / async/io.cpp |
Make AsyncIO use the explicit db_id/path-selected metadata backing instead of hardcoding CNV chunk 0. |
Medium |
mpt/db.hpp |
AsyncIOContext accepts optional shared storage_pool*, and all open-path config structs carry db_id. |
Small |
mpt/db.cpp |
AsyncIOContext, Db, and RODb constructors: use shared pool if provided and resolve metadata backing from explicit db_id/path convention. |
Small |
mpt/db.cpp (DbAsyncWorker) |
Forward shared pool and db_id to worker threads so helper opens do not fall back to the default DB. |
Small |
mpt/update_aux.cpp |
set_io() reads metadata backing from explicit db_id/path convention instead of hardcoding chunk 0. map_root_offsets() uses the configured backing. Keep a DB-local recycle list in db_metadata for owned reusable chunks; fast/slow remain in db_metadata, and new chunks come from the pool only when the recycle list is empty and policy allows. |
Large |
mpt/ondisk_db_config.hpp |
Add db_id and optional chunk_limit to both OnDiskDbConfig and ReadOnlyOnDiskDbConfig. Valid DB IDs start at 1, and callers must pass it explicitly for every migration-related open. |
Trivial |
mpt/cli_tool_impl.cpp |
Remove remaining hardcoded CNV chunk 0 assumptions in archive/import tooling so tooling can open non-primary DB slots correctly. |
Small |
cmd/monad_local_statesync.cpp |
Hold the external migration lock, rebuild allocator state on open, drive statesync, promote migration DB, then reclaim old primary. Optionally host DB 1 + DB 2 on one migration-owned worker thread. |
Medium |
cmd/monad_pool_freelist_sidecar.* (new) |
Read/write the global free-list sidecar and rebuild it from DB metadata when dirty or missing. |
Medium |
Phase 1 Assumptions To Keep Tight
- Do not redesign
db_metadata in Phase 1. Keep the list structure and chunk-info entries, but reinterpret free_list as DB-local recycle.
- Do not make
storage_pool aware of fast/slow.
- Do not try to support arbitrary live chunk migration between DBs yet; only pool allocation, local reuse, bootstrap carve-out, and bulk reclaim.
- Keep the sidecar free-list format boring: header + head + next array + dirty bit/version.
Backwards Compatibility
Single-DB: use db_id=1. The global pool lock is only touched when a chunk is acquired or returned, so steady-state single-DB execution should see near-zero overhead. Behavior is otherwise identical to current code.
Suggested Implementation Steps
Phase 0A: Introduce explicit metadata/root-offset backing selection
Goal: remove the hardcoded "CNV chunk 0 / CNV chunk 1+" assumptions before touching allocation.
Scope:
async/io.hpp
async/io.cpp
mpt/db.hpp
mpt/db.cpp
mpt/update_aux.cpp
mpt/ondisk_db_config.hpp
mpt/cli_tool_impl.cpp
Behavior:
- add explicit
db_id
- make metadata/root-offset backing explicit per DB open
- keep single-DB behavior unchanged for
db_id=1
- no shared freelist yet
Phase 1A: Add global pool free-list sidecar for the existing single DB
Goal: introduce the pool-level free-list sidecar and recovery logic without changing open-path identity or statesync.
Scope:
async/storage_pool.hpp
async/storage_pool.cpp
cmd/monad_pool_freelist_sidecar.*
async/test/storage_pool.cpp
Behavior:
- keep
db_id=1 only
- bootstrap the sidecar by reinterpreting the existing DB-local
free_list as DB-local recycle space and moving globally free chunks into the pool sidecar
- no second DB yet
- no statesync yet
Implementation notes:
- this phase should only touch
storage_pool plus sidecar code
- it should not yet change
replace_node_writer() or UpdateAux::append/remove()
- for a brand-new pool, initialize the sidecar directly from all free seq chunks
- for an existing pool, populate the sidecar from the DB-local
free_list once, under exclusive ownership
- after Phase 1A, the rest of triedb may still behave as if
free_list is the allocator source; that semantic flip happens in Phase 1B
Phase 1B: Switch triedb allocation from DB-local free list to local recycle + pool allocate
Goal: keep single-DB behavior green while changing the actual allocator boundary.
Scope:
mpt/detail/db_metadata.hpp
mpt/update_aux.cpp
mpt/trie.cpp
mpt/trie.hpp
mpt/cli_tool_impl.cpp
mpt/test/* touching free-list assumptions
Behavior:
- DB-local
free_list becomes a local recycle list
- node-writer rollover stops pulling from
db_metadata()->free_list_end() and allocates from storage_pool when local recycle is empty
capacity_in_free_list and related free-space reporting are updated to reflect the new meaning or split into local-recycle vs global-pool metrics
- still single-DB only
Implementation notes:
- the first allocator hot spots are:
UpdateAuxImpl::append() / remove() in category/mpt/update_aux.cpp
replace_node_writer_to_start_at_new_chunk() in category/mpt/trie.cpp
replace_node_writer() in category/mpt/trie.cpp
- keep the existing list-manipulation helpers if possible; change their meaning before renaming them
- the safest sequence is:
- add pool allocation/free APIs
- add
UpdateAux helpers such as pop_recycle_chunk() / allocate_chunk_for_writer()
- switch the two node-writer rollover sites to those helpers
- only then update accounting/reporting
- do not mix multi-DB logic into Phase 1B; keep all tests single-DB
Phase 2: Multi-DB open/backing plumbing
Goal: make it possible to open DB 1 and DB 2 on the same underlying device before statesync exists.
Scope:
- tests / tooling /
monad_mpt setup for two DBs on one device
- optional shared migration-owned worker thread
Behavior:
monad_mpt can initialize DB 1 + DB 2 on the same block device
- DB 1 and DB 2 have distinct metadata/root-offset backings
- DB 1 and DB 2 can both own seq chunks and allocate through the shared pool
- basic two-DB open/write/isolation works
Implementation notes:
- Phase 2 is where command/config plumbing grows up, not where allocator semantics change again
- the key callers to update are:
cmd/monad/main.cpp
cmd/monad_cli.cpp
- any
monad_mpt setup path that currently only knows one DB per pool
- initialize DB 2 by creating its sidecar metadata/root-offset files and then opening it through the normal
Db path with db_id=2
- isolation test should prove:
- DB 1 and DB 2 can both write
- each DB sees only its own metadata/root history
- both allocate from the shared pool without corrupting each other
- if shared-worker mode is used, keep it outside the generic
Db interface: it should be a migration-tool-owned composition of two normal DB contexts
Phase 3: Local statesync and cutover
Goal: build the migration workflow on top of the allocator and multi-DB plumbing.
Scope:
cmd/monad_local_statesync.cpp
cmd/CMakeLists.txt
- statesync tests / migration integration tests
Behavior:
- source DB and migration DB run concurrently
- progressive statesync fills DB 2
- cutover quiesces DB 1, promotes DB 2, reclaims DB 1 back into the global pool
Critical process assumption:
- v1 should assume the migration tool/process owns both DB handles and the pool free-list sidecar during the migration workflow
- do not support an unrelated external writer mutating pool ownership state concurrently with the migration tool
- if that assumption changes later, the allocator protocol must become inter-process authoritative rather than rebuild-on-open
Phase 3: Multi-DB Local Statesync
In-Memory Bridge (~40 lines)
From test_statesync.cpp:49-139: monad_statesync_client/monad_statesync_server_network structs + four function pointers.
Server Setup
- Open shared
storage_pool over the device
- Open source DB (db_id=1, writable, shared pool) →
TrieDb → monad_statesync_server_context
- Clone pool read-only with
db_id=1 as well → set as sctx.ro
Specific mapping to current code:
- server deletion tracking lives in
category/statesync/statesync_server_context.cpp
- finalized deletes are accumulated through
monad_statesync_server_context::commit() plus finalize()
- the local migration command should reuse those existing hooks instead of inventing a second deletion log
Commit Blocks Through Server Context
Implement a small replay loop in monad_local_statesync.cpp that loads finalized blocks sequentially and commits them through the server-side execution path so monad_statesync_server_context accumulates FinalizedDeletions. There is no existing commit_sequential(...) helper to reuse as-is.
Client Setup + Progressive Statesync
- Open dest DB (db_id=2, writable, shared pool, chunk_limit=N)
- Create
monad_statesync_client_context
- Wire server ↔ client,
handle_new_peer × 256
- Progressive:
handle_target(Ti) → drain → repeat
finalize()
Specific mapping to current code:
- use the same bridge pattern as
category/statesync/test/test_statesync.cpp
- use the same target-driving pattern as
sync_from_some style tests
- keep the first implementation strictly in-process; do not add sockets, RPC, or a separate transport
Optional Shared Worker Thread
If hardware is tight, the migration process may optionally host DB 1 and DB 2 on a single shared triedb worker thread instead of one DbAsyncWorker per DB. In that mode, one thread owns two AsyncIOContext / UpdateAux pairs and dispatches work by db_id. This should stay a migration-tool concern rather than a required change to the generic Db API.
Cutover
- Statesync reaches target → dest DB has valid state
- Stop source writes, drain statesync/server work, and close source-side handles
- Reconfigure the live process so dest (
db_id=2) becomes the primary DB backing
- Destroy source DB → all its seq chunks return to the global pool free list
- Reopen dest DB as the live execution DB
Cutover assumptions:
- reclaimed DB 1 seq chunks must not be returned to the pool until all DB 1 handles are closed
- this includes helper RO opens and any
sctx.ro clone
- promotion should be a config/process-level switch, not an in-place mutation of DB IDs
Files to Create/Modify
- Create:
cmd/monad_local_statesync.cpp
- Modify:
cmd/CMakeLists.txt
Reference
test_statesync.cpp:49-196 — bridge pattern
test_statesync.cpp:311-491 — progressive targets (sync_from_some)
statesync_server_context.cpp:38-105 — deletion tracking
statesync_client.cpp:101-157 — client lifecycle
Verification
cmake --build build-claude -j$(nproc)
- Existing statesync tests:
ctest --test-dir build-claude -R statesync --timeout 30
- Single-DB backwards compat: existing
monad/monad-cli unchanged
- Two-DB pool test: create two DBs on same device, write to both, verify isolation
- Bootstrap migration test: reinterpret DB 1's current free list as a local recycle list, carve an initial budget into the global pool, then open DB 1 and DB 2 concurrently
- Local-reuse test: a DB compacts and reuses its own recycled chunks without touching the pool
- Global free-list test: when a DB needs new unowned chunks, both DBs can allocate from the pool and explicit shrink/destroy returns chunks globally
- Bulk reclaim test: destroy one DB, verify chunks returned, other DB can use them after cutover
- Local statesync end-to-end with cutover
- Quiesce test: verify that source handles are closed before reclaimed chunks are reused by dest
db_id plumbing test: open the same pool through writable DB, RODb, worker-thread DB, and sctx.ro, and verify each path resolves the intended DB slot
- Shared-worker test: run DB 1 + DB 2 through one migration-owned worker thread and verify request routing / shutdown are correct
Recommended execution order for Claude:
- land Phase 0 and keep all existing single-DB tests green
- land Phase 1A and validate sidecar rebuild/init separately
- land Phase 1B and validate allocator rollover separately
- land Phase 2 with a focused two-DB open/isolation test
- only then land Phase 3 statesync/cutover
Design References
- LMDB: use dual meta pages and a single authoritative metadata plane. In the same spirit, pool ownership/free-list state should have one authority, and RO/RW opens should both resolve the same
db_id.
- ZFS: treat migration like clone promotion plus checkpointed cutover. The temporary migration DB becomes primary, and reclamation of the old primary happens only after the destructive step boundary.
- memcached: keep the allocator state simple and central, and recycle pages/chunks through one free structure rather than many duplicated ownership views.
- Linux kernel: use a stable ID registry plus grace-period-style reclamation.
db_id assignment should look like an IDA/XArray-style registry, and old-primary chunk reclaim should happen only after all readers are gone.
Local Statesync Plan
Context
Live migration: a temporary migration DB syncs from the live execution DB via statesync, then cuts over. Both DBs share the same storage device. After cutover, the migration DB becomes primary and the old DB's seq chunks are recycled into the shared pool. Each DB needs its own metadata/root-offset backing and its own owned set of seq chunks, but seq ownership can be dynamic through the shared pool. The design should reuse the existing device/chunking model instead of repartitioning the whole pool from scratch for every migration.
Phase 0: Decouple DB Metadata Backing From Fixed CNV Chunk Positions
The current code assumes the first conventional chunks have fixed meanings:
AsyncIOhardcodes CNV chunk 0 as the metadata chunkUpdateAuxhardcodes CNV chunk 0 fordb_metadataand CNV chunk 1+ for root offsetsThis is too coupled to the old single-DB format. Before multi-DB support, we should make "where this DB's metadata/root offsets live" an explicit backing choice rather than "whichever CNV chunks happen to be first in the pool."
Phase 0 outcome:
storage_pooldb_id=1can still point at the legacy on-pool backingdb_id=2can point at sidecar-backed metadata/root offsetsPhase 0 Non-Goals
Phase 0 is not the shared-free-list refactor. Do not touch seq allocation policy yet:
fast/slow/free_listbehavior yetThe goal is only to remove the positional assumptions around metadata/root-offset backing.
Cleanest Implementation Boundary
The clean cut is:
AsyncIOeffectively seq-onlyUpdateAux::set_io()map metadata/root-offsets from an explicit backing descriptor rather than inferring them frompool.chunk(cnv, 0)andpool.chunk(cnv, 1+)Important observation from the current code:
AsyncIOonly touches CNV in its constructor to register chunk-0 FDs with io_uringUpdateAux::set_io()mmaps those backings directlyThat means Phase 0 should not try to teach
storage_poolabout multiple logical DBs yet. The smallest clean refactor is to resolve DB backing once during open, then pass that resolved backing intoUpdateAux.Recommended Backing Model
Add an explicit backing descriptor, resolved once per DB open:
Two backing modes are enough for v1:
Do not add an arbitrary CNV-placement feature in Phase 0. Sidecar-backed DB 2 is enough to break the hardcoded single-DB assumption cleanly.
Preserve The Existing On-Disk Copy Layout
Do not redesign the metadata format in Phase 0. Reuse the current "two copies in one CNV-sized backing object" layout:
For sidecar-backed DB 2, size each sidecar file exactly like one CNV chunk on the pool and keep the same half-and-half layout. This keeps the mmap logic almost identical to today.
Do Not Treat
root_offsets.storage_.cnv_chunk_idAs A Physical Pool Chunk ID AnymoreThis field should become a logical backing ID:
1..Nand resolve to pool CNV chunks1..Nbut resolve to sidecar root-offset filesThat avoids a db-metadata format change in Phase 0.
UpdateAux::map_root_offsets()should resolve throughresolved_db_backing.root_offset_chunks, not by directly callingpool.chunk(storage_pool::cnv, stored_id).Exact Phase 0 Code Changes
0.1 Add an explicit backing spec to config
File:
category/mpt/ondisk_db_config.hppAdd:
uint8_t db_id{1};Then:
OnDiskDbConfiggetsdb_idandstd::optional<DbBackingPaths> backing_pathsReadOnlyOnDiskDbConfiggets the sameRule:
db_id=1with nobacking_pathsmeans legacy pool-backed DBdb_id=2must providebacking_pathsroot_offsets_chunk_countonly affects pool creation for the legacy pool-backed case; for sidecar-backed DB 2 it controls how many sidecar root-offset files to create, not how many CNV chunks the pool has0.2 Resolve backing once in
AsyncIOContextFiles:
category/mpt/db.hppcategory/mpt/db.cppcategory/mpt/db_backing.hpp/.cppAdd:
resolved_db_backing backing;toAsyncIOContextresolve_db_backing(storage_pool&, options)returningresolved_db_backingresolve_db_backing(...)should also own sidecar lifecycle for writable opens:Why here:
AsyncIOContextAsyncIOContextowns the resolved backing, both RO and RW worker-thread paths inherit it automaticallypool_options.num_cnv_chunks = root_offsets_chunk_count + 1logic must become conditional on using the legacy pool-backed DB format0.3 Stop making
AsyncIOresponsible for metadata backingFiles:
category/async/io.hppcategory/async/io.cppChange:
cnv_chunk_fromAsyncIOAsyncIOfocused on seq chunks onlyThis is the cleanest boundary because no async code currently uses CNV for metadata/root-offset access after construction.
0.4 Change
UpdateAux::set_io()to take the resolved backingFiles:
category/mpt/trie.hppcategory/mpt/update_aux.cppSignature change:
Constructor overloads should follow the same pattern.
Implementation split in
update_aux.cpp:map_db_metadata_from_backing(...)map_root_offsets_from_backing(...)initialize_new_db_from_backing(...)Specific replacements inside current
set_io():pool.chunk(cnv, 0)withbacking.metadata_chunkfor each stored cnv_chunk_id -> pool.chunk(cnv, id)logic with lookup inbacking.root_offset_chunks1..N0.5 Keep fresh DB initialization identical except for backing selection
In
UpdateAux::set_io():That means Claude should avoid touching:
capacity_in_free_list0.6 Update every constructor path that builds
UpdateAuxFiles:
category/mpt/db.cppCall sites that must pass the new backing:
Db::ROOnDiskBlockingOnDiskWithWorkerThreadImpl::DbAsyncWorker(both RO and RW constructors)UpdateAuxconstruction in tests/toolingThis is the spot where hidden default-DB paths usually survive, so Phase 0 is not done until every one of these passes the same resolved backing.
0.7 Update CLI restore/import to use the same backing helpers
File:
category/mpt/cli_tool_impl.cppMinimum requirement for Phase 0:
pool->chunk(cnv, 0)metadata mappingAsyncIOContextIt is okay if full archive/export support for sidecar-backed DB 2 is deferred, but restore/open code must stop assuming metadata always lives in pool CNV chunk 0.
Phase 0 Validation
Add or update tests for these exact cases:
1..Nroot offsetsRODb, RW worker) use the same resolved backing as direct opensUpdateAux::set_io()no longer contains any hardcodedpool.chunk(storage_pool::cnv, 0)orfor (n = 2; ...) pool.chunk(storage_pool::cnv, n)assumptionsPhase 1: Shared Pool Free List
Extend the existing
storage_poolchunk management (not a new system) with a global seq-chunk free list.Important design correction: split responsibilities cleanly. Pool metadata owns the global free list of unowned chunks. Triedb
db_metadataremains durable mmapped state for per-DB placement and reuse of chunks already held by that DB.fast/slowstay DB-local implementation details, and the current DB-localfree_listis repurposed as a DB-local recycle/reserve list rather than a global free-space list. Ownership is implicit: if a seq chunk is on the pool free list, it is unowned; otherwise it belongs to some DB.Implicit Ownership
If a chunk is in the global free list, it is unowned. If it is not in the global free list, it is owned by some DB. Within that DB, the chunk may be active in
fast/slowor sitting on the DB-local recycle list ready for reuse.Global Freelist: Pool-Level Lock + Index-Linked List
This lock only protects pool free-list transitions. In the common single-DB case it is touched only when a chunk is acquired from or returned to the global free list, so steady-state overhead should be close to zero.
Practical note for v1:
DB Identity and Metadata Backing
Do not add a pool catalog. Existing pools derive chunk geometry from the current footer layout, and existing pools cannot grow
num_cnv_chunksin place. For the same reason, do not extend the existing pool footer with the new global free-list structure on existing pools. Instead, v1 uses:V1 assumption:
db_id=1uses the existing on-pool metadata/root-offset backingdb_id=2uses explicitly provided sidecar-backed metadata/root-offset filesdb_idthrough every open pathGlobal Free-List Recovery
The sidecar free-list metadata is authoritative during normal runtime, but it must be rebuildable. On migration-tool startup, or after any detected dirty/corrupt sidecar state:
fast/slow/local-recycle listsThis gives a simple crash-recovery rule without adding explicit owner arrays.
Recovery assumptions for Claude:
Operations
Allocate chunk for DB X:
headfast/slowor park it on the DB-local recycle list firstReuse chunk inside DB X:
chunk_limitallows, allocate from the global pool free listchunk_limitwould be exceeded, compact / reclaim within DB X and retry the local recycle list before touching the pool againFree chunk from DB X:
fast/slowmetadataDestroy DB X (bulk reclaim):
Bootstrap DB 2 on an existing pool:
free_listas DB 1's local recycle listdb_id=1and DB 2 withdb_id=2Bootstrap assumptions:
fast/slowchunks during bootstrapCutover (migration complete):
TrieDb, sourceDb,sctx.ro, server/client contexts)Per-DB Size Configuration
chunk_limitis a per-DB policy. It can live in config and/or be mirrored in DB metadata for observability. When a DB approaches its limit, the policy should prefer compaction and reuse from that DB's recycle list instead of pulling new chunks from the global pool. The temporary migration DB can be capped while the old primary is live. After cutover, the promoted DB just allocates from the same global free list without the cap.CNV backing is separate and fixed at DB creation time. Existing pools cannot grow
num_cnv_chunksin place, so v1 still must not depend on repartitioning the pool's CNV region. For existing pools, the current primary DB keeps its existing CNV assignment and the temporary migration DB uses dedicated sidecar-backed metadata/root-offset storage on the same device.On-Disk Layout
Changes to Existing Code
async/storage_pool.hppallocate_chunk(db_id)/free_chunk_to_pool(chunk)APIs. Do not add DB-placement semantics here.async/storage_pool.cppasync/io.hpp/async/io.cppdb_id/path-selected metadata backing instead of hardcoding CNV chunk 0.mpt/db.hppAsyncIOContextaccepts optional sharedstorage_pool*, and all open-path config structs carrydb_id.mpt/db.cppAsyncIOContext,Db, andRODbconstructors: use shared pool if provided and resolve metadata backing from explicitdb_id/path convention.mpt/db.cpp(DbAsyncWorker)db_idto worker threads so helper opens do not fall back to the default DB.mpt/update_aux.cppset_io()reads metadata backing from explicitdb_id/path convention instead of hardcoding chunk 0.map_root_offsets()uses the configured backing. Keep a DB-local recycle list indb_metadatafor owned reusable chunks;fast/slowremain indb_metadata, and new chunks come from the pool only when the recycle list is empty and policy allows.mpt/ondisk_db_config.hppdb_idand optionalchunk_limitto bothOnDiskDbConfigandReadOnlyOnDiskDbConfig. Valid DB IDs start at 1, and callers must pass it explicitly for every migration-related open.mpt/cli_tool_impl.cppcmd/monad_local_statesync.cppcmd/monad_pool_freelist_sidecar.*(new)Phase 1 Assumptions To Keep Tight
db_metadatain Phase 1. Keep the list structure and chunk-info entries, but reinterpretfree_listas DB-local recycle.storage_poolaware offast/slow.Backwards Compatibility
Single-DB: use
db_id=1. The global pool lock is only touched when a chunk is acquired or returned, so steady-state single-DB execution should see near-zero overhead. Behavior is otherwise identical to current code.Suggested Implementation Steps
Phase 0A: Introduce explicit metadata/root-offset backing selection
Goal: remove the hardcoded "CNV chunk 0 / CNV chunk 1+" assumptions before touching allocation.
Scope:
async/io.hppasync/io.cppmpt/db.hppmpt/db.cppmpt/update_aux.cppmpt/ondisk_db_config.hppmpt/cli_tool_impl.cppBehavior:
db_iddb_id=1Phase 1A: Add global pool free-list sidecar for the existing single DB
Goal: introduce the pool-level free-list sidecar and recovery logic without changing open-path identity or statesync.
Scope:
async/storage_pool.hppasync/storage_pool.cppcmd/monad_pool_freelist_sidecar.*async/test/storage_pool.cppBehavior:
db_id=1onlyfree_listas DB-local recycle space and moving globally free chunks into the pool sidecarImplementation notes:
storage_poolplus sidecar codereplace_node_writer()orUpdateAux::append/remove()free_listonce, under exclusive ownershipfree_listis the allocator source; that semantic flip happens in Phase 1BPhase 1B: Switch triedb allocation from DB-local free list to local recycle + pool allocate
Goal: keep single-DB behavior green while changing the actual allocator boundary.
Scope:
mpt/detail/db_metadata.hppmpt/update_aux.cppmpt/trie.cppmpt/trie.hppmpt/cli_tool_impl.cppmpt/test/*touching free-list assumptionsBehavior:
free_listbecomes a local recycle listdb_metadata()->free_list_end()and allocates fromstorage_poolwhen local recycle is emptycapacity_in_free_listand related free-space reporting are updated to reflect the new meaning or split into local-recycle vs global-pool metricsImplementation notes:
UpdateAuxImpl::append()/remove()incategory/mpt/update_aux.cppreplace_node_writer_to_start_at_new_chunk()incategory/mpt/trie.cppreplace_node_writer()incategory/mpt/trie.cppUpdateAuxhelpers such aspop_recycle_chunk()/allocate_chunk_for_writer()Phase 2: Multi-DB open/backing plumbing
Goal: make it possible to open DB 1 and DB 2 on the same underlying device before statesync exists.
Scope:
monad_mptsetup for two DBs on one deviceBehavior:
monad_mptcan initialize DB 1 + DB 2 on the same block deviceImplementation notes:
cmd/monad/main.cppcmd/monad_cli.cppmonad_mptsetup path that currently only knows one DB per poolDbpath withdb_id=2Dbinterface: it should be a migration-tool-owned composition of two normal DB contextsPhase 3: Local statesync and cutover
Goal: build the migration workflow on top of the allocator and multi-DB plumbing.
Scope:
cmd/monad_local_statesync.cppcmd/CMakeLists.txtBehavior:
Critical process assumption:
Phase 3: Multi-DB Local Statesync
In-Memory Bridge (~40 lines)
From
test_statesync.cpp:49-139:monad_statesync_client/monad_statesync_server_networkstructs + four function pointers.Server Setup
storage_poolover the deviceTrieDb→monad_statesync_server_contextdb_id=1as well → set assctx.roSpecific mapping to current code:
category/statesync/statesync_server_context.cppmonad_statesync_server_context::commit()plusfinalize()Commit Blocks Through Server Context
Implement a small replay loop in
monad_local_statesync.cppthat loads finalized blocks sequentially and commits them through the server-side execution path somonad_statesync_server_contextaccumulatesFinalizedDeletions. There is no existingcommit_sequential(...)helper to reuse as-is.Client Setup + Progressive Statesync
monad_statesync_client_contexthandle_new_peer× 256handle_target(Ti)→ drain → repeatfinalize()Specific mapping to current code:
category/statesync/test/test_statesync.cppsync_from_somestyle testsOptional Shared Worker Thread
If hardware is tight, the migration process may optionally host DB 1 and DB 2 on a single shared triedb worker thread instead of one
DbAsyncWorkerper DB. In that mode, one thread owns twoAsyncIOContext/UpdateAuxpairs and dispatches work bydb_id. This should stay a migration-tool concern rather than a required change to the genericDbAPI.Cutover
db_id=2) becomes the primary DB backingCutover assumptions:
sctx.rocloneFiles to Create/Modify
cmd/monad_local_statesync.cppcmd/CMakeLists.txtReference
test_statesync.cpp:49-196— bridge patterntest_statesync.cpp:311-491— progressive targets (sync_from_some)statesync_server_context.cpp:38-105— deletion trackingstatesync_client.cpp:101-157— client lifecycleVerification
cmake --build build-claude -j$(nproc)ctest --test-dir build-claude -R statesync --timeout 30monad/monad-cliunchangeddb_idplumbing test: open the same pool through writable DB,RODb, worker-thread DB, andsctx.ro, and verify each path resolves the intended DB slotRecommended execution order for Claude:
Design References
db_id.db_idassignment should look like an IDA/XArray-style registry, and old-primary chunk reclaim should happen only after all readers are gone.