refactor(mesh): delete v1 internals (-10k lines)#1486
Conversation
Signed-off-by: Chang Su <[email protected]>
PR #1476 cut every gateway-side caller of v1 mesh. The internal v1 modules stayed because controller/ping_server/service still held v1 fields and dispatched v1 wire messages onto v1 stores nobody read from. This commit finishes the cleanup. Deleted: sync.rs, stores.rs, tree_ops.rs, collector.rs, consistent_hash.rs, rate_limit_window.rs, node_state_machine.rs, tests/comprehensive.rs. Trimmed: controller.rs (v1 fields + dispatch arms + periodic ticks); service.rs (sync_manager / stores / NodeStateMachine / start_rate_limit_task); ping_server.rs (delete create_snapshot_chunks, rewrite sync_stream from 1047 lines to ~170); flow_control.rs (MessageSizeValidator); metrics.rs (record_batch_sent and friends); crdt_kv (drop upsert/try_upsert/try_upsert_if from CrdtOrMap). Build: cargo clippy --all-targets -p smg-mesh -p smg clean. Tests: 112 smg-mesh + 835 smg pass. Net: roughly -9,300 lines. Signed-off-by: Chang Su <[email protected]>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR removes the legacy v1 gossip incremental state-sync stack (collector, stores, tree ops, rate-limit window, related metrics/flow-control and tests) and refactors controller, ping server, and service layers to use the v2 StreamBatch sync path via MeshKV; v1 wire messages are ignored with debug logs. ChangesMesh v1 → v2 stream-batch migration
sequenceDiagram
participant Controller as Controller
participant GossipService as GossipService
participant MeshKV as MeshKV
participant Peer as Peer
Controller->>MeshKV: with_mesh_kv() attach
Controller->>GossipService: share current_stream_batch
loop periodic round
Controller->>MeshKV: collect_round_batch()
MeshKV-->>Controller: RoundBatch
Controller->>GossipService: expose stream entries
end
GossipService->>Peer: send StreamBatch (periodic/drain)
Peer->>GossipService: send StreamBatch / Heartbeat / Ack / Nack
GossipService->>MeshKV: dispatch inbound StreamBatch
MeshKV-->>GossipService: apply updates
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly Related PRs
Suggested labels
Suggested Reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Clean deletion of v1 mesh internals (~10k lines). Reviewed the surviving files — no dangling references to deleted modules/types, removed public API methods (write_data, read_data, is_ready, state_machine, etc.) have no remaining callers in the gateway, and v1 wire messages are safely logged-and-ignored for mixed-version transition. One minor nit on a stale doc comment.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/mesh/src/ping_server.rs`:
- Around line 317-334: The code currently calls
update_peer_connections(&peer_id, true) when peer_id is still empty, causing a
phantom metric for ""; remove that early call and only invoke
update_peer_connections once after the real peer id is learned from the first
inbound message (the block that assigns peer_id from msg.peer_id and writes
learned_peer_inbound). Ensure no other early calls use the uninitialized peer_id
(check the scope around incoming.next().await and the variables peer_id,
learned_peer_inbound, and update_peer_connections) so metrics are recorded only
for the actual peer id.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: ef904632-69e3-4548-8a26-f826d0992b4f
📒 Files selected for processing (19)
crates/mesh/src/collector.rscrates/mesh/src/consistent_hash.rscrates/mesh/src/controller.rscrates/mesh/src/crdt_kv/crdt.rscrates/mesh/src/crdt_kv/mod.rscrates/mesh/src/flow_control.rscrates/mesh/src/lib.rscrates/mesh/src/metrics.rscrates/mesh/src/node_state_machine.rscrates/mesh/src/ping_server.rscrates/mesh/src/rate_limit_window.rscrates/mesh/src/service.rscrates/mesh/src/stores.rscrates/mesh/src/sync.rscrates/mesh/src/tests/comprehensive.rscrates/mesh/src/tests/mod.rscrates/mesh/src/tests/test_utils.rscrates/mesh/src/topology.rscrates/mesh/src/tree_ops.rs
💤 Files with no reviewable changes (13)
- crates/mesh/src/node_state_machine.rs
- crates/mesh/src/sync.rs
- crates/mesh/src/topology.rs
- crates/mesh/src/rate_limit_window.rs
- crates/mesh/src/tree_ops.rs
- crates/mesh/src/stores.rs
- crates/mesh/src/collector.rs
- crates/mesh/src/consistent_hash.rs
- crates/mesh/src/lib.rs
- crates/mesh/src/flow_control.rs
- crates/mesh/src/tests/comprehensive.rs
- crates/mesh/src/crdt_kv/crdt.rs
- crates/mesh/src/metrics.rs
There was a problem hiding this comment.
Code Review
This pull request performs a significant refactoring of the mesh synchronization logic, removing the legacy CentralCollector, PeerWatermark, NodeStateMachine, and TopologyManager components in favor of a more streamlined stream-based approach. The GossipService and MeshController have been updated to remove these dependencies, and the StateStores and MeshSyncManager have been removed as part of this cleanup. My feedback highlights that the removal of the 60-second idle timeout in sync_stream could lead to resource leaks from inactive clients, and I recommend re-introducing an idle timeout mechanism.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ce8100aa7e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Follow-up review nits on PR #1486's `sync_stream` rewrite: - `service.rs:70` (claude): the `/// Get state machine` doc comment was orphaned to `should_serve()` when `state_machine()` got deleted. Drop it. - `ping_server.rs:316` (gemini): the old `sync_stream` wrapped `incoming.next()` in `tokio::time::timeout(STREAM_IDLE_TIMEOUT, …)` so an idle client couldn't pin the server-side task and mpsc channel. The rewrite dropped the wrapper — restore it. - `ping_server.rs:342` (claude): drop the redundant `let next = match …; let msg = next;` and bind `msg` directly. - `ping_server.rs:348` (codex, P1): the rewrite dropped main's peer-identity-stability check. Restore: bind `peer_id` to the first non-empty inbound id, then break out of the loop on any later frame whose `msg.peer_id` differs (including empty). Same warn-and-close semantics as main. CodeRabbit also flagged a phantom-metric pattern in the same file (`update_peer_connections("", true)` runs before peer_id is learned). That code is carried over from `main` unchanged, not a regression introduced by this PR — leaving as-is for a focused follow-up. Build: cargo clippy --all-targets -p smg-mesh clean. Tests: 112 smg-mesh pass. Signed-off-by: Chang Su <[email protected]>
ce8100a to
5e8da6f
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/mesh/src/ping_server.rs`:
- Around line 242-256: The current logic sets last_stream_batch (from
stream_batch_handle.read().clone()) before the code knows whether
learned_peer_sender is Some, which can cause targeted entries to be skipped once
a peer is later learned; instead, defer updating last_stream_batch until after
targeted delivery for the current peer has been attempted (i.e., only set
last_stream_batch after learned_peer_sender is Some and after emitting
publish_to for stream_batch.targeted_entries), or alternatively add separate
seen bookkeeping for targeted delivery (track which RoundBatch or targeted entry
tuples have been delivered) so the broadcast freshness check (last_stream_batch,
stream_batch_handle) does not suppress later peer-dependent sends; update the
code around last_stream_batch, learned_peer_sender, has_targeted, and
stream_batch.targeted_entries to implement one of these approaches.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 82d3df78-cd9d-42ff-a342-abf375aa6d3a
📒 Files selected for processing (2)
crates/mesh/src/ping_server.rscrates/mesh/src/service.rs
Follow-up to the v1 trim (claude nit on r3237365164). After the v1 recording functions were deleted, `init_mesh_metrics()` still registered `describe_*!` calls for metrics with no writers, and `ConvergenceTracker` + `record_convergence_latency` were orphaned when `NodeStateMachine` went. Removed describes (no recorder remains): router_mesh_convergence_ms router_mesh_batches_total router_mesh_bytes_total router_mesh_snapshot_trigger_total router_mesh_snapshot_duration_seconds router_mesh_snapshot_bytes_total router_mesh_sync_batch_bytes router_mesh_store_workers router_mesh_store_policies router_mesh_store_memberships router_mesh_store_apps Removed code: ConvergenceTracker struct + impls record_convergence_latency Retained describes whose recorders are still alive (`update_peer_connections`, `record_peer_reconnect`, `record_ack`, `record_nack`, `record_sync_round_duration`) and the four `#[expect(dead_code)]`-marked drift / cardinality gauges that are explicitly kept as scaffolding. Build: cargo clippy --all-targets -p smg-mesh clean. Tests: 112 smg-mesh pass. Signed-off-by: Chang Su <[email protected]>
`BackpressureController` and `BACKPRESSURE_THRESHOLD` shipped in the original v1 mesh contribution (4e108ef, Tony Lu, 2026-01-14) as scaffolding for a "warn when channel ≥80% full" feature. Both methods (`can_send`, `remaining_capacity`) were `#[expect(dead_code)]` from the start and never wired anywhere — no call site logs or emits a metric based on channel-pressure lookahead. The hard-backpressure path the codebase actually relies on is `mpsc::Sender::try_send` returning `TrySendError::Full` → drop-with-application-retry, which doesn't need this helper. If a warning is wanted in future, modern tokio exposes `Sender::capacity()` / `Sender::max_capacity()` directly — a helper struct isn't needed. Also dropped the `pub backpressure_threshold: f64` field from the `MeshConfig` example in `mesh-v2-implementation-spec.md §10` since it was the docs-side counterpart to this dead scaffolding. Build: cargo clippy --all-targets -p smg-mesh clean. Tests: 112 smg-mesh pass. Signed-off-by: Chang Su <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 893486b2d5
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| break; | ||
| } | ||
| } | ||
| StreamMessageType::Ack => record_ack(&peer_id, true), |
There was a problem hiding this comment.
Preserve ACK success flag when recording metrics
This branch records every inbound Ack as success, which drops the ack.success information carried in the payload and makes failed ACKs indistinguishable from successful ones in router_mesh_peer_ack_total. In environments where peers send Ack { success: false } (e.g., partial failures or protocol errors), this will under-report failures and can mask real sync-health regressions in monitoring.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in 9b2c9f7. The arm now decodes the StreamAck payload and forwards ack.success to record_ack — matches main's behaviour (record_ack(&peer_id, ack.success) at ping_server.rs:1450 on main). Missing/wrong-variant payloads are still skipped (no metric recorded), so the previous trim's tendency to silently mark every Ack as success is gone.
The sync_stream v2 trim collapsed the inbound `Ack` arm to a hard-coded `record_ack(&peer_id, true)`, dropping the `success` flag carried in `StreamAck`. On main the arm decodes the payload and forwards `ack.success`, so failed ACKs land under the `status=failure` label of `router_mesh_peer_ack_total`. The trim made failed ACKs indistinguishable from successful ones, which hides real sync-health regressions in monitoring. Restore main's behaviour: read the `StreamAck` payload via `if let` and pass `ack.success` through. Missing/wrong-variant payloads are skipped (no metric recorded), matching main. Signed-off-by: Chang Su <[email protected]>
Description
Problem
PR #1476 cut every gateway-side caller of v1 mesh. The internal v1 modules stayed because
controller.rs/ping_server.rs/service.rsstill held v1 fields and dispatched v1 wire messages onto v1 stores nobody read from. The mesh crate carried ~9 KLoC of dead-but-compile-only code, and reviewers (and the cluster) couldn't tell which paths were actually live.Solution
Finish the v1 removal inside the mesh crate. v2 transport (membership, ping/SWIM, stream batches via
MeshKV) stays intact and keeps gossiping. v1 store-replication paths are gone end-to-end.Changes
Deleted entire files (orphaned now that the transport stops feeding them):
crates/mesh/src/sync.rs—MeshSyncManagerand apply-remote-* handlerscrates/mesh/src/stores.rs—StateStores,AppState,MembershipState,WorkerState,PolicyState,StoreTypecrates/mesh/src/tree_ops.rs—TreeState,TenantDelta, lz4 helperscrates/mesh/src/collector.rs—CentralCollector,PeerWatermark, v1RoundBatchcrates/mesh/src/consistent_hash.rscrates/mesh/src/rate_limit_window.rscrates/mesh/src/node_state_machine.rscrates/mesh/src/topology.rs(was already orphan, prior commit)crates/mesh/src/tests/comprehensive.rs(v1 integration tests)Trimmed transport files to v2-only:
controller.rs: dropstores/sync_manager/central_collector/current_batchfields. Delete v1 dispatch arms insync_stream(Incremental/SnapshotChunk/SnapshotRequest/SnapshotComplete), the v1 incremental sender block (PeerWatermarkfilter loop), and the periodic v1 ticks (checkpoint_tree_states,gc_tombstones, store-size logging,central_collector.collect).service.rs: dropstores/sync_managerfromMeshServerHandler/MeshServerBuilder/MeshServer. Deletestart_rate_limit_task/stop_rate_limit_task/write_data/read_data/get_operation_log/sync_app_from_log/state_machine()/is_ready(). RemoveMeshSyncManagerandNodeStateMachineconstruction frombuild().ping_server.rs: delete the entirecreate_snapshot_chunksimpl block. Drop v1 fields and builder methods (with_stores,with_sync_manager,with_current_batch). Rewritesync_streamfrom 1047 lines → ~170: v2-only handler that does heartbeat echo, ack/nack metrics, StreamBatch dispatch, plus a v2 sender that emits broadcastdrain_entriesand targeted entries to the learned peer. v1 wire messages now log at debug and drop.flow_control.rs: delete unusedMessageSizeValidator/MessageSizeError.metrics.rs: delete unused functions (record_batch_sent,record_snapshot_*,record_sync_batch_bytes,record_store_sizes).crdt_kv/mod.rs: drop unusedOperation/ReplicaIdre-exports.crdt_kv/crdt.rs: deleteupsert/try_upsert/try_upsert_iffromCrdtOrMap(every caller chain went throughMeshSyncManager→StateStores::update/_if, both now gone).tests/test_utils.rs: shrink to justbind_node/wait_for(the rest depended on deleted v1 types).v2 surface unchanged.
MeshKV,CrdtNamespace,StreamNamespace,MeshServerBuilder,MeshServerHandler,ClusterState,MergeStrategy,EpochCount,encode_epoch_count/decode_epoch_count/merge_epoch_max_wins— all still re-exported fromlib.rsand used by the gateway adapters andservice_discovery.rsexactly as before.Test Plan
cargo clippy --all-targets -p smg-mesh -p smg— cleancargo +nightly fmt --check— cleancargo test -p smg-mesh --lib— 112 passed, 1 ignoredcargo test -p smg --lib— 835 passed, 4 ignoredDiff stat (vs. main, both commits in this PR):
Checklist
cargo +nightly fmtpassescargo clippy --all-targets --all-features -- -D warningspassesSummary by CodeRabbit
Refactor
Removals
Tests