test(cluster): validator restart inside the EndOfPublish deferred-close grace window by omersadika · Pull Request #1741 · dwallet-labs/ika

omersadika · 2026-06-12T22:56:04Z

Summary

The #1 post-merge fast-follow from the PR #1721 review: nothing exercised a validator restarting inside the EndOfPublish deferred-close grace window — the exact scenario the persisted quorum-anchor round and close-emitted marker exist for. A bug in either forks the final checkpoint of the epoch.

The test

One cluster test, two restarts striking the same epoch-1 close from both directions:

Validator X stops after the mpc_data freeze but before voting EndOfPublish, restarts mid-grace. Exercises the consensus-replay recovery path — and X's silence is what makes the window real at all: with four healthy validators the all_voted short-circuit closes the epoch at the fourth vote, leaving no window. With X down, the live 3-of-4 reach exactly stake quorum and sit out the full 50-round countdown. X is also literally the straggler the grace mechanism was built for.
Validator Y stops inside the window — quorum anchor persisted, close marker not (both observed via two new read accessors) — and restarts immediately. Exercises the persisted-anchor path: resume the countdown from the stored round, don't re-anchor later.

Assertions — determinism, not just liveness

Final epoch-1 checkpoint byte-identical on all 4 validators in both stores: certified and locally computed. The locally-computed half is load-bearing: a validator that closed at the wrong round would locally compute a divergent tail while happily state-syncing the canonical certified one — certified equality alone masks exactly the bug under test.
Handoff attestation cert sets byte-identical across all 4 (the struck epoch's cert formed while X was down and Y was bouncing).
Y's reloaded anchor equals the pre-kill anchor (when the close hasn't already landed by reboot — best-effort, the checkpoint assertions cover the invariant either way).
One more full reconfiguration with everyone back: the handed-off network keys actually work, not just the bookkeeping.

Accessors

Two pub fn reads on AuthorityPerEpochStore for the strike-window polling — end_of_publish_quorum_round(), is_epoch_close_emitted() — mirroring the existing get_frozen_validator_mpc_data_input_set().

Validation

cargo check/clippy clean (test crate + ika-core).
Cluster-test CI run with test_filter=restart_mid_grace dispatched on this branch — link in comments when green.

🤖 Generated with Claude Code

…se grace window The v4 epoch close persists two markers precisely so a restart inside the grace window cannot fork the final checkpoint: the quorum anchor round (a restarted validator counts the grace from the same round as its peers) and the close-emitted marker (a restart cannot re-emit the close at a later commit). Nothing exercised the restart path through that window. This test does, from both directions: - Validator X stops after the mpc_data freeze but before voting EndOfPublish, and restarts mid-grace (replay path; also what makes the window real — with all four healthy the all_voted short-circuit closes the epoch at the fourth vote, leaving nothing to strike). - Validator Y stops inside the window — anchor persisted, close marker not — and restarts immediately (persisted-anchor recovery path). Assertions are cross-validator determinism, not liveness alone: a byte-identical final checkpoint for the struck epoch on every validator in BOTH stores — certified and locally computed (a validator that closed at the wrong round would locally compute a divergent tail while syncing the canonical certified one, so certified equality alone masks exactly the bug under test) — byte-identical handoff cert sets, and one more full reconfiguration proving the handed-off network keys work. Adds two read accessors on AuthorityPerEpochStore for the strike-window polling (end_of_publish_quorum_round, is_epoch_close_emitted), mirroring the existing get_frozen_validator_mpc_data_input_set. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…t-cluster.yaml First CI run timed out at 120s waiting for the EndOfPublish quorum after X's kill. With X down the reconfiguration MPC runs at exactly 3-of-4 threshold with zero slack, and the chain-side EndOfPublish gate (sui_syncer: reconfiguration-completed + session drain + lock) sits behind it — 120s was an assumption, not a measurement. Waiting longer cannot loosen the strike (the grace window only opens AT quorum), so give the poll 300s and capture the diagnosis properly: the workflow gets a rust_log input so the gate-breakdown debug! in sui_syncer ("end-of-publish gate not yet satisfied") is actually visible in the failure replay instead of being swallowed by the hardcoded RUST_LOG=error. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Run 3 diagnosed both open questions: - The protocol is fine: with X down the EndOfPublish quorum arrived within the widened poll (reconfiguration completes at 3-of-4, just slower), and the grace machinery behaved. - The respawn panic ("Cannot open DB .../live/epochs: lock hold by current process") was this test's own bug: IkaNodeHandle holds a STRONG Arc<IkaNode>, and handles bound before stop() were still live at start() — Node::stop joins the node thread, but the RocksDB stores live until the last Arc drops, and the test's own handle was that Arc. A real validator restart never sees this (process death releases the lock); it is purely an in-process-swarm hazard. Acquire handles on demand — inside each poll tick, or scoped to one statement — and document the trap on the helper. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…rove it final Run 4 executed the full scenario (strike mid-grace at anchor 1523, both restarts clean, all four validators into epoch 2, handoff cert formed and verified) and failed only on my comparison: X — still syncing its certified tail after the restart — was sampled at its highest epoch-1 checkpoint SO FAR (seq 8) against Y's true final (seq 9). Catch-up lag misread as a fork. The helper now returns None until the node's latest certified checkpoint is past the struck epoch; sequence contiguity then makes the walk-back provably land on the true final checkpoint, so every validator independently derives the same boundary before bytes are compared. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Run 5 measured the real latency: the epoch-2 reconfiguration output — in a quiet epoch, the first thing that certifies a dwallet checkpoint — reached quorum ~3 minutes after the restarts (restarted validators re-instantiate adopted keys before participating, and the computation is heavy). The 90s finality poll expired seconds before it landed. Per the budget rule in the testing pitfalls: budgets guard against "never", not "slow". Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

omersadika requested a review from ycscaly as a code owner June 12, 2026 22:56

github-actions Bot added the Area: Rust label Jun 12, 2026

github-actions Bot added the Area: CI label Jun 12, 2026

omersadika and others added 3 commits June 13, 2026 02:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(cluster): validator restart inside the EndOfPublish deferred-close grace window#1741

test(cluster): validator restart inside the EndOfPublish deferred-close grace window#1741
omersadika wants to merge 5 commits into
devfrom
test/restart-mid-grace

omersadika commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

omersadika commented Jun 12, 2026

Summary

The test

Assertions — determinism, not just liveness

Accessors

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant