diff --git a/README.md b/README.md index 39f8e86..0e4efad 100644 --- a/README.md +++ b/README.md @@ -96,6 +96,24 @@ For R2, add: --r2-account-id ``` +## Clone From Snapshot + +Create a new export lineage from a published remote snapshot. Before the first snapshot, reads are seeded from the source snapshot lazily. The first snapshot of the cloned export uploads a full base blob for the target export and becomes generation `1`. + +```bash +cargo run -- clone \ + --export-id vm02 \ + --cache-dir /var/lib/nbd-server/vm02 \ + --bucket my-bucket \ + --prefix exports/vm02 \ + --source-prefix exports/vm01 \ + --source-snapshot-id 2 \ + --listen 127.0.0.1:10810 \ + --admin-sock /tmp/nbd-server-vm02.sock +``` + +Clone always resolves against a published remote snapshot. If a running `vm01` process has local dirty writes, those writes are ignored by clone. + ## Attach With `nbd-client` The server speaks fixed-newstyle NBD and supports `OPT_GO` for a single export named exactly like `--export-id`. diff --git a/design/clone-from-snapshot.md b/design/clone-from-snapshot.md new file mode 100644 index 0000000..ad5a961 --- /dev/null +++ b/design/clone-from-snapshot.md @@ -0,0 +1,381 @@ +# Clone From Snapshot Design + +## Status + +Proposed. + +This document describes a new clone workflow for creating a new export lineage from an existing export snapshot without introducing long-term remote object sharing between the two exports. + +## Design Principles + +- A snapshot is immutable once published. +- Clone always starts from a published remote snapshot, never from local dirty cache state. +- A cloned export is a new lineage whose own generation starts at `0`. +- The source export and the cloned export must not share mutable remote state. + +## Summary + +We want to support a workflow like: + +- source export: `vm01` +- source snapshot: `2` +- target export: `vm02` + +The target export should behave like a new volume: + +- reads initially see the contents of `vm01@2` +- local writes are tracked only in `vm02`'s cache +- `vm02` starts at generation `0` +- the first snapshot taken on `vm02` becomes `vm02` generation `1` +- that first `vm02` snapshot uploads a full base blob for `vm02` +- after that point, `vm02` is fully independent from `vm01` + +This is intentionally simpler than a Git-like shared-object clone. It avoids cross-export object sharing, so existing per-export garbage collection remains valid. + +## Problem + +The current server can: + +- create a new empty export +- open an existing export at `current` or a specific snapshot generation +- lazily materialize data into a local cache +- publish snapshots and compactions back to object storage + +It cannot create a new export lineage from an existing snapshot. + +The more advanced design would let cloned exports share immutable remote objects until they diverge. That is attractive for storage efficiency, but it creates a real ownership problem: + +- `vm02` would reference objects under `vm01` +- later snapshots or compactions on `vm01` could garbage-collect those objects +- `vm02` would break unless we add global reachability tracking or cross-export reference counting + +That complexity is out of scope for the next step. + +## Goals + +- Support creating a new export lineage from an existing export snapshot. +- Restrict clone sources to immutable remote snapshots only. +- Keep the clone lazy on the read path before the first target snapshot. +- Make the first target snapshot generation `1`. +- Make the cloned export start at generation `0`. +- Avoid long-term remote object sharing between source and target exports. +- Preserve the existing mental model that one export lineage owns its own remote objects. +- Keep current snapshot and compaction behavior unchanged for ordinary exports. + +## Non-Goals + +- Cloning from unsnapshotted local dirty state. +- Cross-export remote object deduplication. +- Shared immutable object graphs between exports. +- Global garbage collection across multiple exports. +- Branch merge or rebase semantics. +- Multiple exports in one server process. + +## Chosen Approach + +### High-level model + +Add an explicit clone operation that creates a new local export backed by: + +- a target export identity and target storage prefix +- a local cache directory for the target export +- a seed read source taken from a source export snapshot + +The seed read source is used only for: + +- cache misses before the first target snapshot +- partial write materialization before the first target snapshot + +The target export's own lineage generation starts at `0`, regardless of the source snapshot generation. + +### Key rule + +The first snapshot on the target export always uploads a full base blob for the target export. + +That means: + +- `vm02` generation `1` is self-contained +- `vm02` generation `1` does not reference `vm01` objects +- after `vm02` generation `1`, normal incremental snapshot behavior resumes + +This avoids the need for cross-export reference tracking. + +## Example Lifecycle + +### Clone setup + +Assume: + +- source export: `vm01` +- source snapshot: `2` +- target export: `vm02` + +Clone initialization: + +1. Load the published remote manifest for `vm01@2`. +2. Create a fresh local cache for `vm02`. +3. Record that `vm02` has: + - target lineage generation `0` + - seed read source `vm01@2` +4. Do not create any remote objects for `vm02` yet. + +If a running `vm01` process has local dirty data layered on top of `vm01@2`, that local dirty state is ignored by clone. `clone vm01@2 -> vm02` always means "clone from the published remote snapshot 2", not "clone the current in-memory or local-cache state of a running vm01 server". + +### Reads before first `vm02` snapshot + +On a cache miss: + +1. resolve the chunk from the seed manifest `vm01@2` +2. fetch from object storage +3. write into `vm02` local cache +4. mark resident locally + +### Writes before first `vm02` snapshot + +On a write: + +1. if the chunk is fully overwritten, write locally and mark dirty +2. if the chunk is partially overwritten and not resident, materialize it from the seed manifest first +3. write locally and mark dirty + +This is the same local behavior as ordinary exports. The only difference is the remote read source. + +### First `vm02` snapshot + +When `vm02` takes its first snapshot: + +1. block writes +2. materialize all missing chunks from the seed manifest into local cache +3. sync local data +4. upload a full base blob to `exports/vm02/snapshots/1/base.blob` +5. publish `exports/vm02/snapshots/1/manifest.json` +6. publish `exports/vm02/refs/current.json` +7. clear local dirty bits +8. drop the seed read source from runtime state + +After step 8, `vm02` is a normal independent export. + +## Remote Layout + +The target export continues to use the existing remote layout: + +- `exports/vm02/refs/current.json` +- `exports/vm02/snapshots/1/base.blob` +- `exports/vm02/snapshots/1/manifest.json` +- later: + - `exports/vm02/snapshots/N/delta.blob` + - `exports/vm02/snapshots/N/manifest.json` + +No remote object under `exports/vm02/...` should reference `exports/vm01/...`. + +## Runtime State Changes + +The current runtime state assumes one remote head whose generation also determines the next snapshot generation. + +Clone support needs to split these concepts: + +- `read_source` + - where cache misses are fetched from + - either zero, a manifest from the target export, or a seed manifest from another export + +- `lineage_generation` + - the target export's own snapshot generation counter + - for clones, starts at `0` + +Today, these two concepts are conflated in `RemoteHead`. + +For clone support, they should become separate fields in the export runtime state. + +## Local Metadata + +The cache metadata should continue to belong to the target export only. + +For a clone, local metadata should store at least: + +- target export id +- image size +- chunk size +- target lineage generation +- normal dirty and resident bitmaps + +The seed source does not need to be persisted in `cache.meta` if clone is represented as a dedicated startup mode that reconstructs the seed state from explicit CLI or API input. + +If we later want clone state to survive restart before the first snapshot without external arguments, we will need to persist seed source metadata separately. + +That persistence is not required for the first implementation. + +## Snapshot Semantics + +### Source export + +No behavior change. + +### Target export before first snapshot + +- local writes are allowed +- `FLUSH` still means local durability only +- remote lineage generation remains `0` + +### Target export first snapshot + +- always full base upload +- always generation `1` +- allowed even when there were no writes + +The last point matters. A clone may be created and then snapshotted without any local modifications. We still need a self-contained `vm02` generation `1`. + +## Clone Source Validation + +Clone input must resolve to one immutable remote source: + +- `source_export_id` +- either: + - `source_snapshot_id`, or + - remote `current` + +Both resolve to a published remote manifest. + +The clone operation must reject: + +- any request to clone local unsnapshotted dirty state +- any request that depends on a running process's in-memory cache as the source of truth + +If `vm01` is currently opened and has local dirty writes, that does not block cloning `vm01@`. The clone still resolves only against the published remote manifest for that snapshot. + +This keeps clone semantics stable and reproducible. + +### Target export after first snapshot + +Normal incremental behavior resumes: + +- dirty-only delta upload +- sparse manifest rewrite +- normal compaction behavior + +## Garbage Collection + +This approach deliberately preserves the current ownership model: + +- `vm01` owns `exports/vm01/...` +- `vm02` owns `exports/vm02/...` + +Because the first `vm02` snapshot creates a full independent base blob, the current per-export garbage collection model can remain in place. + +No global GC redesign is required for this step. + +## API and CLI Shape + +The most natural surface is a new explicit clone command. + +Example: + +```bash +nbd-server clone \ + --source-export-id vm01 \ + --source-snapshot-id 2 \ + --target-export-id vm02 \ + --cache-dir /var/lib/nbd-server/vm02 \ + --bucket my-bucket \ + --prefix exports/vm02 \ + --listen 127.0.0.1:10810 \ + --admin-sock /tmp/nbd-server-vm02.sock +``` + +Alternative: + +- a local admin API that prepares the clone metadata and then starts a new server instance + +For the first implementation, the CLI command is simpler and easier to reason about. + +## Failure and Recovery + +### Before first target snapshot + +All authoritative state is local: + +- local dirty cache is authoritative +- the seed source is only a read source + +If the process crashes: + +- restart can continue from local cache if clone startup parameters are provided again +- or the local cache can be reset and the clone recreated + +### During first target snapshot + +Behavior should follow the existing full snapshot journal model: + +- stage snapshot intent locally +- upload full base blob +- upload manifest +- publish current ref +- clear dirty bits + +If the first target snapshot fails midway: + +- local dirty state remains authoritative +- target generation stays `0` +- retry is safe + +## Trade-offs + +### Advantages + +- Simple ownership model. +- No cross-export sharing. +- Current GC model remains usable. +- No manifest schema changes required. +- The clone behaves like a real new volume after generation `1`. + +### Disadvantages + +- The first target snapshot is expensive because it must upload a full base blob. +- Clone storage efficiency is worse than a shared-object approach. +- Pre-first-snapshot clone restart behavior may need explicit startup parameters. + +This is acceptable for the next step because it minimizes correctness risk. + +## Implementation Plan + +1. Introduce a clone startup path. +2. Split runtime state into: + - target lineage generation + - remote read source +3. Represent the seed read source as a manifest loaded from another export snapshot. +4. Reuse existing lazy read and write materialization logic against that seed source. +5. Change first snapshot behavior for cloned exports: + - materialize all chunks + - upload full base blob + - publish generation `1` +6. After generation `1`, switch the export into ordinary mode. +7. Add tests for: + - clone read from source snapshot + - partial write on clone before first snapshot + - first clone snapshot creates generation `1` + - first clone snapshot creates only target-owned remote objects + - source export GC does not affect the cloned export + +## Open Questions + +### Clone state persistence before first snapshot + +Do we want a clone to survive restart without having to provide: + +- source export id +- source snapshot id +- target export id + +If yes, we need a small persisted clone-state record in the target cache directory. + +### Read-only clone mode + +For debugging and inspection, a read-only clone mode may be useful. It is not required for the first implementation. + +### Naming + +We may eventually want to separate: + +- remote export identity +- local NBD export name + +That is orthogonal to the clone design described here. diff --git a/src/config.rs b/src/config.rs index f2f9ecd..d03cbf7 100644 --- a/src/config.rs +++ b/src/config.rs @@ -16,6 +16,7 @@ pub struct Cli { pub enum Commands { Create(ServerConfigArgs), Open(ServerConfigArgs), + Clone(CloneConfigArgs), } #[derive(Debug, Clone, Parser)] @@ -48,6 +49,34 @@ pub struct ServerConfigArgs { pub size: Option, } +#[derive(Debug, Clone, Parser)] +pub struct CloneConfigArgs { + #[arg(long)] + pub export_id: String, + #[arg(long)] + pub cache_dir: PathBuf, + #[arg(long)] + pub bucket: String, + #[arg(long)] + pub prefix: String, + #[arg(long)] + pub source_prefix: String, + #[arg(long)] + pub source_snapshot_id: Option, + #[arg(long)] + pub listen: SocketAddr, + #[arg(long)] + pub admin_sock: PathBuf, + #[arg(long, value_enum, default_value_t = StorageBackendKind::S3)] + pub storage_backend: StorageBackendKind, + #[arg(long, default_value = "us-east-1")] + pub region: String, + #[arg(long)] + pub endpoint_url: Option, + #[arg(long)] + pub r2_account_id: Option, +} + #[derive(Debug, Clone, Copy, PartialEq, Eq, ValueEnum)] pub enum StorageBackendKind { S3, @@ -64,6 +93,12 @@ pub struct StorageConfig { pub r2_account_id: Option, } +#[derive(Debug, Clone)] +pub struct CloneSourceConfig { + pub prefix: String, + pub snapshot_id: Option, +} + #[derive(Debug, Clone)] pub struct ServerConfig { pub export_id: String, @@ -74,6 +109,7 @@ pub struct ServerConfig { pub chunk_size: u64, pub snapshot_id: Option, pub image_size: Option, + pub clone_source: Option, } impl From for ServerConfig { @@ -94,6 +130,33 @@ impl From for ServerConfig { chunk_size: value.chunk_size, snapshot_id: value.snapshot_id, image_size: value.size, + clone_source: None, + } + } +} + +impl From for ServerConfig { + fn from(value: CloneConfigArgs) -> Self { + Self { + export_id: value.export_id, + cache_dir: value.cache_dir, + storage: StorageConfig { + backend: value.storage_backend, + bucket: value.bucket, + prefix: value.prefix.trim_end_matches('/').to_string(), + region: value.region, + endpoint_url: value.endpoint_url, + r2_account_id: value.r2_account_id, + }, + listen: value.listen, + admin_sock: value.admin_sock, + chunk_size: CHUNK_SIZE, + snapshot_id: None, + image_size: None, + clone_source: Some(CloneSourceConfig { + prefix: value.source_prefix.trim_end_matches('/').to_string(), + snapshot_id: value.source_snapshot_id, + }), } } } diff --git a/src/export.rs b/src/export.rs index e058a02..f0dcde8 100644 --- a/src/export.rs +++ b/src/export.rs @@ -1,6 +1,6 @@ use std::collections::{BTreeMap, BTreeSet}; -use std::fs::{OpenOptions, remove_file}; -use std::io::Write; +use std::fs::{File, OpenOptions, remove_file}; +use std::io::{Read, Write}; use std::path::Path; use std::sync::Arc; use std::sync::Mutex as StdMutex; @@ -11,7 +11,7 @@ use serde::{Deserialize, Serialize}; use tokio::sync::{Mutex, RwLock}; use crate::cache::LocalCache; -use crate::config::ServerConfig; +use crate::config::{CloneSourceConfig, ServerConfig}; use crate::error::{Error, Result}; use crate::journal::{JournalOperation, JournalRecord}; use crate::manifest::{ @@ -20,19 +20,15 @@ use crate::manifest::{ use crate::remote::StorageBackend; #[derive(Clone)] -enum RemoteHead { - Zero { - generation: u64, - image_size: u64, - chunk_size: u64, - }, +enum ReadSource { + Zero { image_size: u64, chunk_size: u64 }, Manifest(Manifest), } -impl RemoteHead { +impl ReadSource { fn generation(&self) -> u64 { match self { - Self::Zero { generation, .. } => *generation, + Self::Zero { .. } => 0, Self::Manifest(manifest) => manifest.generation, } } @@ -52,13 +48,6 @@ impl RemoteHead { Self::Manifest(manifest) => manifest.chunk_location(index), } } - - fn manifest(&self) -> Option<&Manifest> { - match self { - Self::Manifest(manifest) => Some(manifest), - Self::Zero { .. } => None, - } - } } #[derive(Debug, Serialize)] @@ -100,17 +89,31 @@ struct CurrentRef { manifest_key: String, } +#[derive(Debug, Clone, Serialize, Deserialize)] +struct CloneSeedRecord { + version: u32, + source_manifest_key: String, +} + +#[derive(Clone)] +struct ResolvedManifest { + manifest_key: String, + manifest: Manifest, +} + const SINGLE_PUT_LIMIT_BYTES: u64 = 5 * 1024 * 1024 * 1024; pub struct Export { config: ServerConfig, cache: Arc, remote: Arc, - remote_head: RwLock>, + read_source: RwLock>, + published_manifest: RwLock>, write_gate: RwLock<()>, operation_running: AtomicBool, operation_name: StdMutex>, journal_path: std::path::PathBuf, + clone_seed_path: std::path::PathBuf, chunk_locks: Vec>, } @@ -124,6 +127,11 @@ impl Export { "--snapshot-id is only valid with the open command".to_string(), )); } + if config.clone_source.is_some() { + return Err(Error::InvalidRequest( + "clone source is only valid with the clone command".to_string(), + )); + } if config.image_size.is_none() { return Err(Error::InvalidRequest( "--size is required when creating a new export".to_string(), @@ -150,14 +158,15 @@ impl Export { let chunk_total = cache.chunk_count(); Ok(Arc::new(Self { journal_path: config.cache_dir.join("snapshot.journal.json"), + clone_seed_path: config.cache_dir.join("clone.seed.json"), config, cache: Arc::new(cache), remote, - remote_head: RwLock::new(Arc::new(RemoteHead::Zero { - generation: 0, + read_source: RwLock::new(Arc::new(ReadSource::Zero { image_size, chunk_size, })), + published_manifest: RwLock::new(None), write_gate: RwLock::new(()), operation_running: AtomicBool::new(false), operation_name: StdMutex::new(None), @@ -166,18 +175,14 @@ impl Export { } pub async fn open(config: ServerConfig, remote: Arc) -> Result> { - let manifest_key = if let Some(snapshot_id) = config.snapshot_id { - manifest_key(&config.storage.prefix, snapshot_id) - } else { - let current_ref: CurrentRef = serde_json::from_slice( - &remote - .get_object(¤t_ref_key(&config.storage.prefix)) - .await?, - )?; - current_ref.manifest_key - }; - let manifest: Manifest = serde_json::from_slice(&remote.get_object(&manifest_key).await?)?; - manifest.validate()?; + if config.clone_source.is_some() { + return Err(Error::InvalidRequest( + "clone source is only valid with the clone command".to_string(), + )); + } + let resolved = + resolve_manifest(&*remote, &config.storage.prefix, config.snapshot_id).await?; + let manifest = resolved.manifest; let cache = if config.cache_dir.join("cache.meta").exists() { let cache = LocalCache::open(&config.cache_dir)?; @@ -198,10 +203,12 @@ impl Export { let chunk_total = cache.chunk_count(); Ok(Arc::new(Self { journal_path: config.cache_dir.join("snapshot.journal.json"), + clone_seed_path: config.cache_dir.join("clone.seed.json"), config, cache: Arc::new(cache), remote, - remote_head: RwLock::new(Arc::new(RemoteHead::Manifest(manifest))), + read_source: RwLock::new(Arc::new(ReadSource::Manifest(manifest.clone()))), + published_manifest: RwLock::new(Some(manifest)), write_gate: RwLock::new(()), operation_running: AtomicBool::new(false), operation_name: StdMutex::new(None), @@ -209,6 +216,95 @@ impl Export { })) } + pub async fn clone_from_snapshot( + config: ServerConfig, + remote: Arc, + ) -> Result> { + if config.snapshot_id.is_some() { + return Err(Error::InvalidRequest( + "--snapshot-id is only valid with the open command".to_string(), + )); + } + if config.image_size.is_some() { + return Err(Error::InvalidRequest( + "--size is not valid with the clone command".to_string(), + )); + } + let clone_source = config + .clone_source + .clone() + .ok_or_else(|| Error::InvalidRequest("missing clone source".to_string()))?; + + let clone_seed_path = config.cache_dir.join("clone.seed.json"); + let cache_exists = config.cache_dir.join("cache.meta").exists(); + if cache_exists { + let cache = LocalCache::open(&config.cache_dir)?; + Self::recover_local_state(&config.cache_dir, &cache)?; + if cache.manifest_generation() != 0 { + return Err(Error::InvalidRequest( + "clone can only be resumed before the first target snapshot; use open for an initialized export".to_string(), + )); + } + let seed = CloneSeedRecord::load(&clone_seed_path)?.ok_or_else(|| { + Error::InvalidRequest( + "cache directory is missing clone.seed.json; cannot safely resume clone state" + .to_string(), + ) + })?; + let resolved = resolve_manifest_by_key(&*remote, &seed.source_manifest_key).await?; + cache.validate_layout( + &config.export_id, + resolved.manifest.image_size, + resolved.manifest.chunk_size, + )?; + cache.set_clean_shutdown(false)?; + let chunk_total = cache.chunk_count(); + return Ok(Arc::new(Self { + journal_path: config.cache_dir.join("snapshot.journal.json"), + clone_seed_path, + config, + cache: Arc::new(cache), + remote, + read_source: RwLock::new(Arc::new(ReadSource::Manifest(resolved.manifest))), + published_manifest: RwLock::new(None), + write_gate: RwLock::new(()), + operation_running: AtomicBool::new(false), + operation_name: StdMutex::new(None), + chunk_locks: (0..chunk_total).map(|_| Mutex::new(())).collect(), + })); + } else { + let resolved = resolve_manifest_for_clone(&*remote, &clone_source).await?; + CloneSeedRecord { + version: 1, + source_manifest_key: resolved.manifest_key.clone(), + } + .persist(&clone_seed_path)?; + let cache = LocalCache::create( + &config.cache_dir, + config.export_id.clone(), + resolved.manifest.image_size, + resolved.manifest.chunk_size, + )?; + Self::recover_local_state(&config.cache_dir, &cache)?; + cache.set_clean_shutdown(false)?; + + let chunk_total = cache.chunk_count(); + return Ok(Arc::new(Self { + journal_path: config.cache_dir.join("snapshot.journal.json"), + clone_seed_path, + config, + cache: Arc::new(cache), + remote, + read_source: RwLock::new(Arc::new(ReadSource::Manifest(resolved.manifest))), + published_manifest: RwLock::new(None), + write_gate: RwLock::new(()), + operation_running: AtomicBool::new(false), + operation_name: StdMutex::new(None), + chunk_locks: (0..chunk_total).map(|_| Mutex::new(())).collect(), + })); + } + } + pub fn image_size(&self) -> u64 { self.cache.image_size() } @@ -318,7 +414,7 @@ impl Export { } pub async fn status(&self) -> Status { - let remote_head = self.remote_head.read().await; + let read_source = self.read_source.read().await; let operation_state = self .operation_name .lock() @@ -334,7 +430,7 @@ impl Export { resident_chunks: self.cache.resident_count(), dirty_chunks: self.cache.dirty_count(), snapshot_generation: self.cache.manifest_generation(), - remote_head_generation: remote_head.generation(), + remote_head_generation: read_source.generation(), operation_state, } } @@ -364,17 +460,42 @@ impl Export { let _snapshot_guard = self.write_gate.write().await; self.cache.set_snapshot_in_progress(true)?; self.cache.sync_data()?; - let remote_head = self.remote_head.read().await.clone(); - let next_generation = remote_head.generation() + 1; - let previous_keys = remote_head - .manifest() + let next_generation = self.cache.manifest_generation() + 1; + let read_source = self.read_source.read().await.clone(); + let published_manifest = self.published_manifest.read().await.clone(); + let previous_keys = published_manifest + .as_ref() .map(|manifest| manifest.referenced_object_keys()); - let result = if remote_head.generation() == 0 { - self.publish_initial_sparse_snapshot(next_generation).await + let result = if self.cache.manifest_generation() == 0 { + match read_source.as_ref() { + ReadSource::Zero { .. } => { + self.publish_initial_sparse_snapshot(next_generation).await + } + ReadSource::Manifest(_) => { + self.materialize_all_chunks().await?; + self.cache.sync_data()?; + self.publish_full_snapshot( + next_generation, + JournalOperation::Snapshot, + format!( + "{}/snapshots/{next_generation}/base.blob", + self.config.storage.prefix + ), + None, + ) + .await + } + } } else { - self.publish_delta_snapshot(next_generation, remote_head, previous_keys) - .await + self.publish_delta_snapshot( + next_generation, + published_manifest.ok_or_else(|| { + Error::InvalidManifest("missing published manifest".to_string()) + })?, + previous_keys, + ) + .await }; self.finish_publish(result) @@ -385,10 +506,12 @@ impl Export { self.cache.set_snapshot_in_progress(true)?; self.materialize_all_chunks().await?; self.cache.sync_data()?; - let remote_head = self.remote_head.read().await.clone(); - let generation = remote_head.generation() + 1; - let previous_keys = remote_head - .manifest() + let generation = self.cache.manifest_generation() + 1; + let previous_keys = self + .published_manifest + .read() + .await + .as_ref() .map(|manifest| manifest.referenced_object_keys()); let result = self @@ -429,6 +552,9 @@ impl Export { clear_journal?; clear_flag?; if response.snapshot_created { + if self.cache.manifest_generation() == 0 && response.generation == 1 { + remove_file(&self.clone_seed_path).ok(); + } self.cache.set_manifest_generation(response.generation)?; self.cache.clear_dirty_all()?; } @@ -608,14 +734,14 @@ impl Export { async fn publish_delta_snapshot( &self, generation: u64, - remote_head: Arc, + published_manifest: Manifest, previous_keys: Option>, ) -> Result { let dirty = self.cache.dirty_indices(); if dirty.is_empty() { return Ok(SnapshotResponse { snapshot_created: false, - generation: remote_head.generation(), + generation: self.cache.manifest_generation(), garbage_collected_objects: 0, }); } @@ -686,10 +812,7 @@ impl Export { object_key = %delta_key, "finished delta snapshot upload" ); - let manifest = remote_head - .manifest() - .ok_or_else(|| Error::InvalidManifest("missing remote manifest".to_string()))? - .with_new_ref(generation, delta_key, replacements)?; + let manifest = published_manifest.with_new_ref(generation, delta_key, replacements)?; let gc = self.publish_manifest(manifest, previous_keys).await?; remove_file(&delta_path).ok(); @@ -724,7 +847,8 @@ impl Export { .await?; let new_keys = manifest.referenced_object_keys(); - *self.remote_head.write().await = Arc::new(RemoteHead::Manifest(manifest)); + *self.read_source.write().await = Arc::new(ReadSource::Manifest(manifest.clone())); + *self.published_manifest.write().await = Some(manifest); Ok(self .garbage_collect_unreferenced_objects(previous_keys, &new_keys) .await) @@ -779,8 +903,8 @@ impl Export { } async fn fetch_chunk_bytes(&self, index: u64) -> Result> { - let remote_head = self.remote_head.read().await.clone(); - let location = remote_head.chunk_location(index)?; + let read_source = self.read_source.read().await.clone(); + let location = read_source.chunk_location(index)?; match location.source { ChunkSource::Zero => Ok(vec![0_u8; location.logical_len as usize]), ChunkSource::Ref => Ok(self @@ -820,6 +944,61 @@ impl Export { } } +impl CloneSeedRecord { + fn load(path: &Path) -> Result> { + if !path.exists() { + return Ok(None); + } + let mut bytes = Vec::new(); + File::open(path)?.read_to_end(&mut bytes)?; + Ok(Some(serde_json::from_slice(&bytes)?)) + } + + fn persist(&self, path: &Path) -> Result<()> { + if let Some(parent) = path.parent() { + std::fs::create_dir_all(parent)?; + } + let mut file = File::create(path)?; + file.write_all(&serde_json::to_vec_pretty(self)?)?; + file.sync_all()?; + Ok(()) + } +} + +async fn resolve_manifest( + remote: &dyn StorageBackend, + prefix: &str, + snapshot_id: Option, +) -> Result { + let manifest_key = if let Some(snapshot_id) = snapshot_id { + manifest_key(prefix, snapshot_id) + } else { + let current_ref: CurrentRef = + serde_json::from_slice(&remote.get_object(¤t_ref_key(prefix)).await?)?; + current_ref.manifest_key + }; + resolve_manifest_by_key(remote, &manifest_key).await +} + +async fn resolve_manifest_for_clone( + remote: &dyn StorageBackend, + clone_source: &CloneSourceConfig, +) -> Result { + resolve_manifest(remote, &clone_source.prefix, clone_source.snapshot_id).await +} + +async fn resolve_manifest_by_key( + remote: &dyn StorageBackend, + manifest_key: &str, +) -> Result { + let manifest: Manifest = serde_json::from_slice(&remote.get_object(manifest_key).await?)?; + manifest.validate()?; + Ok(ResolvedManifest { + manifest_key: manifest_key.to_string(), + manifest, + }) +} + fn current_ref_key(prefix: &str) -> String { format!("{prefix}/refs/current.json") } @@ -903,6 +1082,31 @@ mod tests { chunk_size: 4, snapshot_id: None, image_size: Some(8), + clone_source: None, + } + } + + fn clone_config(dir: &Path) -> ServerConfig { + ServerConfig { + export_id: "clone".to_string(), + cache_dir: dir.join("clone-cache"), + storage: crate::config::StorageConfig { + backend: crate::config::StorageBackendKind::S3, + bucket: "bucket".to_string(), + prefix: "exports/clone".to_string(), + region: "us-east-1".to_string(), + endpoint_url: None, + r2_account_id: None, + }, + listen: "127.0.0.1:10810".parse().unwrap(), + admin_sock: dir.join("clone-admin.sock"), + chunk_size: 4, + snapshot_id: None, + image_size: None, + clone_source: Some(crate::config::CloneSourceConfig { + prefix: "exports/export".to_string(), + snapshot_id: Some(2), + }), } } @@ -1053,6 +1257,118 @@ mod tests { assert_eq!(export.read(0, 8).await.unwrap(), b"abcdefgh"); } + #[tokio::test] + async fn clone_reads_from_source_snapshot_and_starts_at_generation_zero() { + let dir = tempdir().unwrap(); + let remote = Arc::new(MemoryRemote::default()); + remote + .put_bytes( + "exports/export/base/full.blob", + Bytes::from_static(b"abcdefgh"), + ) + .await + .unwrap(); + remote + .put_bytes( + "exports/export/snapshots/2/manifest.json", + Bytes::from_static( + br#"{ + "version":2, + "export_id":"export", + "generation":2, + "image_size":8, + "chunk_size":4, + "chunk_count":2, + "created_at":"2026-03-07T00:00:00Z", + "base_ref":1, + "refs":[{"id":1,"path":"exports/export/base/full.blob"}], + "entries":[] + }"#, + ), + ) + .await + .unwrap(); + + let clone = Export::clone_from_snapshot(clone_config(dir.path()), remote) + .await + .unwrap(); + + let status = clone.status().await; + assert_eq!(status.snapshot_generation, 0); + assert_eq!(status.remote_head_generation, 2); + assert_eq!(clone.read(0, 8).await.unwrap(), b"abcdefgh"); + } + + #[tokio::test] + async fn first_clone_snapshot_uploads_full_base_blob_and_cuts_over_to_target_lineage() { + let dir = tempdir().unwrap(); + let remote = Arc::new(MemoryRemote::default()); + remote + .put_bytes( + "exports/export/base/full.blob", + Bytes::from_static(b"abcdefgh"), + ) + .await + .unwrap(); + remote + .put_bytes( + "exports/export/snapshots/2/manifest.json", + Bytes::from_static( + br#"{ + "version":2, + "export_id":"export", + "generation":2, + "image_size":8, + "chunk_size":4, + "chunk_count":2, + "created_at":"2026-03-07T00:00:00Z", + "base_ref":1, + "refs":[{"id":1,"path":"exports/export/base/full.blob"}], + "entries":[] + }"#, + ), + ) + .await + .unwrap(); + + let clone = Export::clone_from_snapshot(clone_config(dir.path()), remote.clone()) + .await + .unwrap(); + clone.write(0, b"WXYZ", false).await.unwrap(); + + let snapshot = clone.snapshot().await.unwrap(); + assert_eq!(snapshot.generation, 1); + + let status = clone.status().await; + assert_eq!(status.snapshot_generation, 1); + assert_eq!(status.remote_head_generation, 1); + + let manifest = remote + .get_object("exports/clone/snapshots/1/manifest.json") + .await + .unwrap(); + let manifest: serde_json::Value = serde_json::from_slice(&manifest).unwrap(); + assert_eq!(manifest["generation"], 1); + assert_eq!(manifest["base_ref"], 1); + assert_eq!( + manifest["refs"][0]["path"], + "exports/clone/snapshots/1/base.blob" + ); + assert_eq!( + remote + .get_object("exports/clone/snapshots/1/base.blob") + .await + .unwrap(), + Bytes::from_static(b"WXYZefgh") + ); + assert!( + !dir.path() + .join("clone-cache") + .join("clone.seed.json") + .exists() + ); + } + #[tokio::test] async fn reopening_same_cache_dir_on_new_snapshot_invalidates_clean_resident_chunks() { let dir = tempdir().unwrap(); diff --git a/src/main.rs b/src/main.rs index c028ce5..f41e881 100644 --- a/src/main.rs +++ b/src/main.rs @@ -4,6 +4,12 @@ use nbd_server::config::ServerConfig; use nbd_server::remote::build_storage_backend; use nbd_server::{Cli, Commands}; +enum StartupMode { + Create, + Open, + Clone, +} + #[tokio::main] async fn main() -> nbd_server::Result<()> { tracing_subscriber::fmt() @@ -13,16 +19,19 @@ async fn main() -> nbd_server::Result<()> { .init(); let cli = Cli::parse(); - let (config, create_mode) = match cli.command { - Commands::Create(args) => (ServerConfig::from(args), true), - Commands::Open(args) => (ServerConfig::from(args), false), + let (config, mode) = match cli.command { + Commands::Create(args) => (ServerConfig::from(args), StartupMode::Create), + Commands::Open(args) => (ServerConfig::from(args), StartupMode::Open), + Commands::Clone(args) => (ServerConfig::from(args), StartupMode::Clone), }; let remote = build_storage_backend(&config.storage).await?; - let export = if create_mode { - nbd_server::export::Export::create(config.clone(), remote).await? - } else { - nbd_server::export::Export::open(config.clone(), remote).await? + let export = match mode { + StartupMode::Create => nbd_server::export::Export::create(config.clone(), remote).await?, + StartupMode::Open => nbd_server::export::Export::open(config.clone(), remote).await?, + StartupMode::Clone => { + nbd_server::export::Export::clone_from_snapshot(config.clone(), remote).await? + } }; let admin_socket = config.admin_sock.clone();