Skip to content

test(cluster): validator restart inside the EndOfPublish deferred-close grace window#1741

Open
omersadika wants to merge 5 commits into
devfrom
test/restart-mid-grace
Open

test(cluster): validator restart inside the EndOfPublish deferred-close grace window#1741
omersadika wants to merge 5 commits into
devfrom
test/restart-mid-grace

Conversation

@omersadika

Copy link
Copy Markdown
Contributor

Summary

The #1 post-merge fast-follow from the PR #1721 review: nothing exercised a validator restarting inside the EndOfPublish deferred-close grace window — the exact scenario the persisted quorum-anchor round and close-emitted marker exist for. A bug in either forks the final checkpoint of the epoch.

The test

One cluster test, two restarts striking the same epoch-1 close from both directions:

  • Validator X stops after the mpc_data freeze but before voting EndOfPublish, restarts mid-grace. Exercises the consensus-replay recovery path — and X's silence is what makes the window real at all: with four healthy validators the all_voted short-circuit closes the epoch at the fourth vote, leaving no window. With X down, the live 3-of-4 reach exactly stake quorum and sit out the full 50-round countdown. X is also literally the straggler the grace mechanism was built for.
  • Validator Y stops inside the window — quorum anchor persisted, close marker not (both observed via two new read accessors) — and restarts immediately. Exercises the persisted-anchor path: resume the countdown from the stored round, don't re-anchor later.

Assertions — determinism, not just liveness

  • Final epoch-1 checkpoint byte-identical on all 4 validators in both stores: certified and locally computed. The locally-computed half is load-bearing: a validator that closed at the wrong round would locally compute a divergent tail while happily state-syncing the canonical certified one — certified equality alone masks exactly the bug under test.
  • Handoff attestation cert sets byte-identical across all 4 (the struck epoch's cert formed while X was down and Y was bouncing).
  • Y's reloaded anchor equals the pre-kill anchor (when the close hasn't already landed by reboot — best-effort, the checkpoint assertions cover the invariant either way).
  • One more full reconfiguration with everyone back: the handed-off network keys actually work, not just the bookkeeping.

Accessors

Two pub fn reads on AuthorityPerEpochStore for the strike-window polling — end_of_publish_quorum_round(), is_epoch_close_emitted() — mirroring the existing get_frozen_validator_mpc_data_input_set().

Validation

  • cargo check/clippy clean (test crate + ika-core).
  • Cluster-test CI run with test_filter=restart_mid_grace dispatched on this branch — link in comments when green.

🤖 Generated with Claude Code

…se grace window

The v4 epoch close persists two markers precisely so a restart inside
the grace window cannot fork the final checkpoint: the quorum anchor
round (a restarted validator counts the grace from the same round as
its peers) and the close-emitted marker (a restart cannot re-emit the
close at a later commit). Nothing exercised the restart path through
that window. This test does, from both directions:

- Validator X stops after the mpc_data freeze but before voting
  EndOfPublish, and restarts mid-grace (replay path; also what makes
  the window real — with all four healthy the all_voted short-circuit
  closes the epoch at the fourth vote, leaving nothing to strike).
- Validator Y stops inside the window — anchor persisted, close marker
  not — and restarts immediately (persisted-anchor recovery path).

Assertions are cross-validator determinism, not liveness alone: a
byte-identical final checkpoint for the struck epoch on every validator
in BOTH stores — certified and locally computed (a validator that
closed at the wrong round would locally compute a divergent tail while
syncing the canonical certified one, so certified equality alone masks
exactly the bug under test) — byte-identical handoff cert sets, and one
more full reconfiguration proving the handed-off network keys work.

Adds two read accessors on AuthorityPerEpochStore for the strike-window
polling (end_of_publish_quorum_round, is_epoch_close_emitted), mirroring
the existing get_frozen_validator_mpc_data_input_set.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…t-cluster.yaml

First CI run timed out at 120s waiting for the EndOfPublish quorum
after X's kill. With X down the reconfiguration MPC runs at exactly
3-of-4 threshold with zero slack, and the chain-side EndOfPublish gate
(sui_syncer: reconfiguration-completed + session drain + lock) sits
behind it — 120s was an assumption, not a measurement. Waiting longer
cannot loosen the strike (the grace window only opens AT quorum), so
give the poll 300s and capture the diagnosis properly: the workflow
gets a rust_log input so the gate-breakdown debug! in sui_syncer
("end-of-publish gate not yet satisfied") is actually visible in the
failure replay instead of being swallowed by the hardcoded
RUST_LOG=error.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
omersadika and others added 3 commits June 13, 2026 02:31
Run 3 diagnosed both open questions:

- The protocol is fine: with X down the EndOfPublish quorum arrived
  within the widened poll (reconfiguration completes at 3-of-4, just
  slower), and the grace machinery behaved.

- The respawn panic ("Cannot open DB .../live/epochs: lock hold by
  current process") was this test's own bug: IkaNodeHandle holds a
  STRONG Arc<IkaNode>, and handles bound before stop() were still live
  at start() — Node::stop joins the node thread, but the RocksDB stores
  live until the last Arc drops, and the test's own handle was that
  Arc. A real validator restart never sees this (process death releases
  the lock); it is purely an in-process-swarm hazard.

Acquire handles on demand — inside each poll tick, or scoped to one
statement — and document the trap on the helper.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…rove it final

Run 4 executed the full scenario (strike mid-grace at anchor 1523, both
restarts clean, all four validators into epoch 2, handoff cert formed
and verified) and failed only on my comparison: X — still syncing its
certified tail after the restart — was sampled at its highest epoch-1
checkpoint SO FAR (seq 8) against Y's true final (seq 9). Catch-up lag
misread as a fork.

The helper now returns None until the node's latest certified
checkpoint is past the struck epoch; sequence contiguity then makes the
walk-back provably land on the true final checkpoint, so every
validator independently derives the same boundary before bytes are
compared.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Run 5 measured the real latency: the epoch-2 reconfiguration output —
in a quiet epoch, the first thing that certifies a dwallet checkpoint —
reached quorum ~3 minutes after the restarts (restarted validators
re-instantiate adopted keys before participating, and the computation
is heavy). The 90s finality poll expired seconds before it landed.
Per the budget rule in the testing pitfalls: budgets guard against
"never", not "slow".

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant