test(cluster): validator restart inside the EndOfPublish deferred-close grace window#1741
Open
omersadika wants to merge 5 commits into
Open
test(cluster): validator restart inside the EndOfPublish deferred-close grace window#1741omersadika wants to merge 5 commits into
omersadika wants to merge 5 commits into
Conversation
…se grace window The v4 epoch close persists two markers precisely so a restart inside the grace window cannot fork the final checkpoint: the quorum anchor round (a restarted validator counts the grace from the same round as its peers) and the close-emitted marker (a restart cannot re-emit the close at a later commit). Nothing exercised the restart path through that window. This test does, from both directions: - Validator X stops after the mpc_data freeze but before voting EndOfPublish, and restarts mid-grace (replay path; also what makes the window real — with all four healthy the all_voted short-circuit closes the epoch at the fourth vote, leaving nothing to strike). - Validator Y stops inside the window — anchor persisted, close marker not — and restarts immediately (persisted-anchor recovery path). Assertions are cross-validator determinism, not liveness alone: a byte-identical final checkpoint for the struck epoch on every validator in BOTH stores — certified and locally computed (a validator that closed at the wrong round would locally compute a divergent tail while syncing the canonical certified one, so certified equality alone masks exactly the bug under test) — byte-identical handoff cert sets, and one more full reconfiguration proving the handed-off network keys work. Adds two read accessors on AuthorityPerEpochStore for the strike-window polling (end_of_publish_quorum_round, is_epoch_close_emitted), mirroring the existing get_frozen_validator_mpc_data_input_set. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…t-cluster.yaml
First CI run timed out at 120s waiting for the EndOfPublish quorum
after X's kill. With X down the reconfiguration MPC runs at exactly
3-of-4 threshold with zero slack, and the chain-side EndOfPublish gate
(sui_syncer: reconfiguration-completed + session drain + lock) sits
behind it — 120s was an assumption, not a measurement. Waiting longer
cannot loosen the strike (the grace window only opens AT quorum), so
give the poll 300s and capture the diagnosis properly: the workflow
gets a rust_log input so the gate-breakdown debug! in sui_syncer
("end-of-publish gate not yet satisfied") is actually visible in the
failure replay instead of being swallowed by the hardcoded
RUST_LOG=error.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Run 3 diagnosed both open questions:
- The protocol is fine: with X down the EndOfPublish quorum arrived
within the widened poll (reconfiguration completes at 3-of-4, just
slower), and the grace machinery behaved.
- The respawn panic ("Cannot open DB .../live/epochs: lock hold by
current process") was this test's own bug: IkaNodeHandle holds a
STRONG Arc<IkaNode>, and handles bound before stop() were still live
at start() — Node::stop joins the node thread, but the RocksDB stores
live until the last Arc drops, and the test's own handle was that
Arc. A real validator restart never sees this (process death releases
the lock); it is purely an in-process-swarm hazard.
Acquire handles on demand — inside each poll tick, or scoped to one
statement — and document the trap on the helper.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…rove it final Run 4 executed the full scenario (strike mid-grace at anchor 1523, both restarts clean, all four validators into epoch 2, handoff cert formed and verified) and failed only on my comparison: X — still syncing its certified tail after the restart — was sampled at its highest epoch-1 checkpoint SO FAR (seq 8) against Y's true final (seq 9). Catch-up lag misread as a fork. The helper now returns None until the node's latest certified checkpoint is past the struck epoch; sequence contiguity then makes the walk-back provably land on the true final checkpoint, so every validator independently derives the same boundary before bytes are compared. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Run 5 measured the real latency: the epoch-2 reconfiguration output — in a quiet epoch, the first thing that certifies a dwallet checkpoint — reached quorum ~3 minutes after the restarts (restarted validators re-instantiate adopted keys before participating, and the computation is heavy). The 90s finality poll expired seconds before it landed. Per the budget rule in the testing pitfalls: budgets guard against "never", not "slow". Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The #1 post-merge fast-follow from the PR #1721 review: nothing exercised a validator restarting inside the EndOfPublish deferred-close grace window — the exact scenario the persisted quorum-anchor round and close-emitted marker exist for. A bug in either forks the final checkpoint of the epoch.
The test
One cluster test, two restarts striking the same epoch-1 close from both directions:
all_votedshort-circuit closes the epoch at the fourth vote, leaving no window. With X down, the live 3-of-4 reach exactly stake quorum and sit out the full 50-round countdown. X is also literally the straggler the grace mechanism was built for.Assertions — determinism, not just liveness
Accessors
Two
pub fnreads onAuthorityPerEpochStorefor the strike-window polling —end_of_publish_quorum_round(),is_epoch_close_emitted()— mirroring the existingget_frozen_validator_mpc_data_input_set().Validation
cargo check/clippyclean (test crate + ika-core).test_filter=restart_mid_gracedispatched on this branch — link in comments when green.🤖 Generated with Claude Code