Skip to content

refactor(mesh): delete v1 internals (-10k lines)#1486

Open
CatherineSue wants to merge 6 commits into
mainfrom
chore/mesh-v1-deadcode-cleanup
Open

refactor(mesh): delete v1 internals (-10k lines)#1486
CatherineSue wants to merge 6 commits into
mainfrom
chore/mesh-v1-deadcode-cleanup

Conversation

@CatherineSue
Copy link
Copy Markdown
Member

@CatherineSue CatherineSue commented May 13, 2026

Description

Problem

PR #1476 cut every gateway-side caller of v1 mesh. The internal v1 modules stayed because controller.rs / ping_server.rs / service.rs still held v1 fields and dispatched v1 wire messages onto v1 stores nobody read from. The mesh crate carried ~9 KLoC of dead-but-compile-only code, and reviewers (and the cluster) couldn't tell which paths were actually live.

Solution

Finish the v1 removal inside the mesh crate. v2 transport (membership, ping/SWIM, stream batches via MeshKV) stays intact and keeps gossiping. v1 store-replication paths are gone end-to-end.

Changes

Deleted entire files (orphaned now that the transport stops feeding them):

  • crates/mesh/src/sync.rsMeshSyncManager and apply-remote-* handlers
  • crates/mesh/src/stores.rsStateStores, AppState, MembershipState, WorkerState, PolicyState, StoreType
  • crates/mesh/src/tree_ops.rsTreeState, TenantDelta, lz4 helpers
  • crates/mesh/src/collector.rsCentralCollector, PeerWatermark, v1 RoundBatch
  • crates/mesh/src/consistent_hash.rs
  • crates/mesh/src/rate_limit_window.rs
  • crates/mesh/src/node_state_machine.rs
  • crates/mesh/src/topology.rs (was already orphan, prior commit)
  • crates/mesh/src/tests/comprehensive.rs (v1 integration tests)

Trimmed transport files to v2-only:

  • controller.rs: drop stores / sync_manager / central_collector / current_batch fields. Delete v1 dispatch arms in sync_stream (Incremental/SnapshotChunk/SnapshotRequest/SnapshotComplete), the v1 incremental sender block (PeerWatermark filter loop), and the periodic v1 ticks (checkpoint_tree_states, gc_tombstones, store-size logging, central_collector.collect).
  • service.rs: drop stores / sync_manager from MeshServerHandler / MeshServerBuilder / MeshServer. Delete start_rate_limit_task / stop_rate_limit_task / write_data / read_data / get_operation_log / sync_app_from_log / state_machine() / is_ready(). Remove MeshSyncManager and NodeStateMachine construction from build().
  • ping_server.rs: delete the entire create_snapshot_chunks impl block. Drop v1 fields and builder methods (with_stores, with_sync_manager, with_current_batch). Rewrite sync_stream from 1047 lines → ~170: v2-only handler that does heartbeat echo, ack/nack metrics, StreamBatch dispatch, plus a v2 sender that emits broadcast drain_entries and targeted entries to the learned peer. v1 wire messages now log at debug and drop.
  • flow_control.rs: delete unused MessageSizeValidator / MessageSizeError.
  • metrics.rs: delete unused functions (record_batch_sent, record_snapshot_*, record_sync_batch_bytes, record_store_sizes).
  • crdt_kv/mod.rs: drop unused Operation / ReplicaId re-exports.
  • crdt_kv/crdt.rs: delete upsert / try_upsert / try_upsert_if from CrdtOrMap (every caller chain went through MeshSyncManagerStateStores::update/_if, both now gone).
  • tests/test_utils.rs: shrink to just bind_node / wait_for (the rest depended on deleted v1 types).

v2 surface unchanged. MeshKV, CrdtNamespace, StreamNamespace, MeshServerBuilder, MeshServerHandler, ClusterState, MergeStrategy, EpochCount, encode_epoch_count / decode_epoch_count / merge_epoch_max_wins — all still re-exported from lib.rs and used by the gateway adapters and service_discovery.rs exactly as before.

Test Plan

  • cargo clippy --all-targets -p smg-mesh -p smg — clean
  • cargo +nightly fmt --check — clean
  • cargo test -p smg-mesh --lib112 passed, 1 ignored
  • cargo test -p smg --lib835 passed, 4 ignored
  • Pre-commit hooks pass

Diff stat (vs. main, both commits in this PR):

19 files changed, 132 insertions(+), 10070 deletions(-)
Checklist
  • cargo +nightly fmt passes
  • cargo clippy --all-targets --all-features -- -D warnings passes
  • (Optional) Documentation updated
  • (Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

  • Refactor

    • Sync stack refocused on v2 stream-batch synchronization; controller, ping service, and server simplified to rely on KV-backed stream batches.
  • Removals

    • Legacy incremental/snapshot sync path, local state-store implementation, rate-limit window, and message-size validation removed.
    • Batch/snapshot metric helpers and several CRDT upsert APIs removed; public exports narrowed.
  • Tests

    • Comprehensive integration tests and many test helpers removed.

PR #1476 cut every gateway-side caller of v1 mesh. The internal v1
modules stayed because controller/ping_server/service still held
v1 fields and dispatched v1 wire messages onto v1 stores nobody
read from. This commit finishes the cleanup.

Deleted: sync.rs, stores.rs, tree_ops.rs, collector.rs,
consistent_hash.rs, rate_limit_window.rs, node_state_machine.rs,
tests/comprehensive.rs.

Trimmed: controller.rs (v1 fields + dispatch arms + periodic ticks);
service.rs (sync_manager / stores / NodeStateMachine /
start_rate_limit_task); ping_server.rs (delete create_snapshot_chunks,
rewrite sync_stream from 1047 lines to ~170); flow_control.rs
(MessageSizeValidator); metrics.rs (record_batch_sent and friends);
crdt_kv (drop upsert/try_upsert/try_upsert_if from CrdtOrMap).

Build: cargo clippy --all-targets -p smg-mesh -p smg clean.
Tests: 112 smg-mesh + 835 smg pass.
Net: roughly -9,300 lines.
Signed-off-by: Chang Su <[email protected]>
@github-actions github-actions Bot added tests Test changes mesh Mesh crate changes labels May 13, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 13, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR removes the legacy v1 gossip incremental state-sync stack (collector, stores, tree ops, rate-limit window, related metrics/flow-control and tests) and refactors controller, ping server, and service layers to use the v2 StreamBatch sync path via MeshKV; v1 wire messages are ignored with debug logs.

Changes

Mesh v1 → v2 stream-batch migration

Layer / File(s) Summary
Module deletions and manifest pruning
crates/mesh/src/collector.rs, crates/mesh/src/stores.rs, crates/mesh/src/tree_ops.rs, crates/mesh/src/rate_limit_window.rs, crates/mesh/src/lib.rs
Removes v1 modules and their public types/APIs (CentralCollector, RoundBatch, PeerWatermark, StateStores, TreeState/Delta, RateLimitWindow). Updates lib.rs to declare only remaining v2-focused modules.
Controller: switch to stream-rounds only
crates/mesh/src/controller.rs
Simplifies MeshController::new (drops stores/sync_manager), removes central current_batch and v1 collection/checkpoint logic, adds periodic retry-manager cleanup, replaces incremental/v1 sender with periodic broadcasting of current_stream_batch, and ignores v1 inbound message types.
GossipService / Ping server: v2-only streams
crates/mesh/src/ping_server.rs
Drops store/sync/state-machine ownership and snapshot chunking; sync_stream optionally runs a periodic sender that drains current_stream_batch to build/send StreamBatch messages, handles Heartbeat/Ack/Nack/StreamBatch inbound, and logs/ignores v1 message variants.
Service wiring and handler simplification
crates/mesh/src/service.rs
Removes service-level stores/sync/state-machine and rate-limit task management from MeshServerHandler/MeshServer; shutdown() now only signals; server builder no longer constructs StateStores/MeshSyncManager; ping service wired to controller current_stream_batch.
CRDT API & exports, flow control, metrics pruning
crates/mesh/src/crdt_kv/crdt.rs, crates/mesh/src/crdt_kv/mod.rs, crates/mesh/src/flow_control.rs, crates/mesh/src/metrics.rs
Removes CrdtOrMap upsert variants; narrows re-exports to OperationLog; removes MessageSizeValidator/MessageSizeError; deletes several batch/snapshot metric helpers.
Tests and test helpers reduction
crates/mesh/src/tests/comprehensive.rs, crates/mesh/src/tests/mod.rs, crates/mesh/src/tests/test_utils.rs
Deletes the comprehensive integration test module and test factory helpers for stores/sync/cluster setup; keeps chunking_integration and reduces test utilities to networking/time helpers only.
sequenceDiagram
    participant Controller as Controller
    participant GossipService as GossipService
    participant MeshKV as MeshKV
    participant Peer as Peer

    Controller->>MeshKV: with_mesh_kv() attach
    Controller->>GossipService: share current_stream_batch
    loop periodic round
      Controller->>MeshKV: collect_round_batch()
      MeshKV-->>Controller: RoundBatch
      Controller->>GossipService: expose stream entries
    end
    GossipService->>Peer: send StreamBatch (periodic/drain)
    Peer->>GossipService: send StreamBatch / Heartbeat / Ack / Nack
    GossipService->>MeshKV: dispatch inbound StreamBatch
    MeshKV-->>GossipService: apply updates
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly Related PRs

Suggested labels

model-gateway, ci

Suggested Reviewers

  • tonyluj
  • llfl
  • slin1237
  • key4ng

"I hopped through code with ear held high,
Old collectors packed, the trees said goodbye.
Streams hum now where deltas once leapt,
Mesh trimmed and tidy — a carrot well-kept. 🥕"

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'refactor(mesh): delete v1 internals (-10k lines)' directly and concisely summarizes the main change—removal of ~10,000 lines of unused v1 mesh internals while preserving v2 transport.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/mesh-v1-deadcode-cleanup

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread crates/mesh/src/service.rs Outdated
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean deletion of v1 mesh internals (~10k lines). Reviewed the surviving files — no dangling references to deleted modules/types, removed public API methods (write_data, read_data, is_ready, state_machine, etc.) have no remaining callers in the gateway, and v1 wire messages are safely logged-and-ignored for mixed-version transition. One minor nit on a stale doc comment.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/mesh/src/ping_server.rs`:
- Around line 317-334: The code currently calls
update_peer_connections(&peer_id, true) when peer_id is still empty, causing a
phantom metric for ""; remove that early call and only invoke
update_peer_connections once after the real peer id is learned from the first
inbound message (the block that assigns peer_id from msg.peer_id and writes
learned_peer_inbound). Ensure no other early calls use the uninitialized peer_id
(check the scope around incoming.next().await and the variables peer_id,
learned_peer_inbound, and update_peer_connections) so metrics are recorded only
for the actual peer id.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ef904632-69e3-4548-8a26-f826d0992b4f

📥 Commits

Reviewing files that changed from the base of the PR and between 8d5e423 and 489481d.

📒 Files selected for processing (19)
  • crates/mesh/src/collector.rs
  • crates/mesh/src/consistent_hash.rs
  • crates/mesh/src/controller.rs
  • crates/mesh/src/crdt_kv/crdt.rs
  • crates/mesh/src/crdt_kv/mod.rs
  • crates/mesh/src/flow_control.rs
  • crates/mesh/src/lib.rs
  • crates/mesh/src/metrics.rs
  • crates/mesh/src/node_state_machine.rs
  • crates/mesh/src/ping_server.rs
  • crates/mesh/src/rate_limit_window.rs
  • crates/mesh/src/service.rs
  • crates/mesh/src/stores.rs
  • crates/mesh/src/sync.rs
  • crates/mesh/src/tests/comprehensive.rs
  • crates/mesh/src/tests/mod.rs
  • crates/mesh/src/tests/test_utils.rs
  • crates/mesh/src/topology.rs
  • crates/mesh/src/tree_ops.rs
💤 Files with no reviewable changes (13)
  • crates/mesh/src/node_state_machine.rs
  • crates/mesh/src/sync.rs
  • crates/mesh/src/topology.rs
  • crates/mesh/src/rate_limit_window.rs
  • crates/mesh/src/tree_ops.rs
  • crates/mesh/src/stores.rs
  • crates/mesh/src/collector.rs
  • crates/mesh/src/consistent_hash.rs
  • crates/mesh/src/lib.rs
  • crates/mesh/src/flow_control.rs
  • crates/mesh/src/tests/comprehensive.rs
  • crates/mesh/src/crdt_kv/crdt.rs
  • crates/mesh/src/metrics.rs

Comment thread crates/mesh/src/ping_server.rs
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request performs a significant refactoring of the mesh synchronization logic, removing the legacy CentralCollector, PeerWatermark, NodeStateMachine, and TopologyManager components in favor of a more streamlined stream-based approach. The GossipService and MeshController have been updated to remove these dependencies, and the StateStores and MeshSyncManager have been removed as part of this cleanup. My feedback highlights that the removal of the 60-second idle timeout in sync_stream could lead to resource leaks from inactive clients, and I recommend re-introducing an idle timeout mechanism.

Comment thread crates/mesh/src/ping_server.rs
Comment thread crates/mesh/src/ping_server.rs Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ce8100aa7e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread crates/mesh/src/ping_server.rs Outdated
Follow-up review nits on PR #1486's `sync_stream` rewrite:

- `service.rs:70` (claude): the `/// Get state machine` doc comment
  was orphaned to `should_serve()` when `state_machine()` got
  deleted. Drop it.
- `ping_server.rs:316` (gemini): the old `sync_stream` wrapped
  `incoming.next()` in `tokio::time::timeout(STREAM_IDLE_TIMEOUT,
  …)` so an idle client couldn't pin the server-side task and
  mpsc channel. The rewrite dropped the wrapper — restore it.
- `ping_server.rs:342` (claude): drop the redundant
  `let next = match …; let msg = next;` and bind `msg` directly.
- `ping_server.rs:348` (codex, P1): the rewrite dropped main's
  peer-identity-stability check. Restore: bind `peer_id` to the
  first non-empty inbound id, then break out of the loop on any
  later frame whose `msg.peer_id` differs (including empty).
  Same warn-and-close semantics as main.

CodeRabbit also flagged a phantom-metric pattern in the same
file (`update_peer_connections("", true)` runs before peer_id
is learned). That code is carried over from `main` unchanged,
not a regression introduced by this PR — leaving as-is for a
focused follow-up.

Build: cargo clippy --all-targets -p smg-mesh clean.
Tests: 112 smg-mesh pass.
Signed-off-by: Chang Su <[email protected]>
@CatherineSue CatherineSue force-pushed the chore/mesh-v1-deadcode-cleanup branch from ce8100a to 5e8da6f Compare May 13, 2026 20:43
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/mesh/src/ping_server.rs`:
- Around line 242-256: The current logic sets last_stream_batch (from
stream_batch_handle.read().clone()) before the code knows whether
learned_peer_sender is Some, which can cause targeted entries to be skipped once
a peer is later learned; instead, defer updating last_stream_batch until after
targeted delivery for the current peer has been attempted (i.e., only set
last_stream_batch after learned_peer_sender is Some and after emitting
publish_to for stream_batch.targeted_entries), or alternatively add separate
seen bookkeeping for targeted delivery (track which RoundBatch or targeted entry
tuples have been delivered) so the broadcast freshness check (last_stream_batch,
stream_batch_handle) does not suppress later peer-dependent sends; update the
code around last_stream_batch, learned_peer_sender, has_targeted, and
stream_batch.targeted_entries to implement one of these approaches.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 82d3df78-cd9d-42ff-a342-abf375aa6d3a

📥 Commits

Reviewing files that changed from the base of the PR and between ce8100a and 5e8da6f.

📒 Files selected for processing (2)
  • crates/mesh/src/ping_server.rs
  • crates/mesh/src/service.rs

Comment thread crates/mesh/src/ping_server.rs
Comment thread crates/mesh/src/metrics.rs
Follow-up to the v1 trim (claude nit on r3237365164). After the v1
recording functions were deleted, `init_mesh_metrics()` still
registered `describe_*!` calls for metrics with no writers, and
`ConvergenceTracker` + `record_convergence_latency` were orphaned
when `NodeStateMachine` went.

Removed describes (no recorder remains):
  router_mesh_convergence_ms
  router_mesh_batches_total
  router_mesh_bytes_total
  router_mesh_snapshot_trigger_total
  router_mesh_snapshot_duration_seconds
  router_mesh_snapshot_bytes_total
  router_mesh_sync_batch_bytes
  router_mesh_store_workers
  router_mesh_store_policies
  router_mesh_store_memberships
  router_mesh_store_apps

Removed code:
  ConvergenceTracker struct + impls
  record_convergence_latency

Retained describes whose recorders are still alive
(`update_peer_connections`, `record_peer_reconnect`, `record_ack`,
`record_nack`, `record_sync_round_duration`) and the four
`#[expect(dead_code)]`-marked drift / cardinality gauges that are
explicitly kept as scaffolding.

Build: cargo clippy --all-targets -p smg-mesh clean.
Tests: 112 smg-mesh pass.
Signed-off-by: Chang Su <[email protected]>
`BackpressureController` and `BACKPRESSURE_THRESHOLD` shipped in
the original v1 mesh contribution (4e108ef, Tony Lu, 2026-01-14)
as scaffolding for a "warn when channel ≥80% full" feature. Both
methods (`can_send`, `remaining_capacity`) were `#[expect(dead_code)]`
from the start and never wired anywhere — no call site logs or
emits a metric based on channel-pressure lookahead.

The hard-backpressure path the codebase actually relies on is
`mpsc::Sender::try_send` returning `TrySendError::Full` →
drop-with-application-retry, which doesn't need this helper. If a
warning is wanted in future, modern tokio exposes
`Sender::capacity()` / `Sender::max_capacity()` directly — a
helper struct isn't needed.

Also dropped the `pub backpressure_threshold: f64` field from the
`MeshConfig` example in `mesh-v2-implementation-spec.md §10`
since it was the docs-side counterpart to this dead scaffolding.

Build: cargo clippy --all-targets -p smg-mesh clean.
Tests: 112 smg-mesh pass.
Signed-off-by: Chang Su <[email protected]>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 893486b2d5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread crates/mesh/src/ping_server.rs Outdated
break;
}
}
StreamMessageType::Ack => record_ack(&peer_id, true),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve ACK success flag when recording metrics

This branch records every inbound Ack as success, which drops the ack.success information carried in the payload and makes failed ACKs indistinguishable from successful ones in router_mesh_peer_ack_total. In environments where peers send Ack { success: false } (e.g., partial failures or protocol errors), this will under-report failures and can mask real sync-health regressions in monitoring.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9b2c9f7. The arm now decodes the StreamAck payload and forwards ack.success to record_ack — matches main's behaviour (record_ack(&peer_id, ack.success) at ping_server.rs:1450 on main). Missing/wrong-variant payloads are still skipped (no metric recorded), so the previous trim's tendency to silently mark every Ack as success is gone.

The sync_stream v2 trim collapsed the inbound `Ack` arm to a
hard-coded `record_ack(&peer_id, true)`, dropping the `success`
flag carried in `StreamAck`. On main the arm decodes the payload
and forwards `ack.success`, so failed ACKs land under the
`status=failure` label of `router_mesh_peer_ack_total`. The trim
made failed ACKs indistinguishable from successful ones, which
hides real sync-health regressions in monitoring.

Restore main's behaviour: read the `StreamAck` payload via
`if let` and pass `ack.success` through. Missing/wrong-variant
payloads are skipped (no metric recorded), matching main.

Signed-off-by: Chang Su <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

mesh Mesh crate changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant