Off-chain validator metadata + EndOfPublishV2#1721
Open
omersadika wants to merge 206 commits into
Open
Conversation
Foundation for the off-chain validator-metadata read flow. Pure types and no-op consensus dispatch — no behavior change, so the acceptance gate `test_network_dkg_full_flow` still passes. New types in `ika_types::validator_metadata`: - ValidatorMpcDataAnnouncement / SignedValidatorMpcDataAnnouncement - HandoffItemKey (sorted enum: NetworkDkgOutput | NetworkReconfigurationOutput | ValidatorMpcData) - HandoffAttestation with `items: Vec<(HandoffItemKey, [u8;32])>` sorted strictly ascending — plain length-prefixed BCS list, no map-aware bindings needed for non-Rust verifiers - HandoffSignatureMessage (Ed25519 sig by consensus key, NOT protocol key) - CertifiedHandoffAttestation (Vec<(AuthorityName, Ed25519Signature)>; Ed25519 doesn't aggregate) - EpochMpcDataReadySignal IntentScope: +ValidatorMpcDataAnnouncement, +HandoffAttestation. ConsensusTransactionKind + Key: 3 new variants + constructors + key extraction + Debug arms. AuthorityPerEpochStore / consensus_handler / consensus_validator wire dispatch as no-ops (actual handlers land in later steps); the per-epoch sender-author match enforces wire-binding for HandoffSignature and EpochMpcDataReadySignal (signer == consensus author), and is a trivial pass for ValidatorMpcDataAnnouncement (the inner BLS sig authenticates the validator's intent independent of the relayer). Unit tests cover BCS roundtrip + sort stability + ready-signal roundtrip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Anemo `ValidatorMetadata` service with one method `GetMpcDataBlob(blob_hash) -> Option<MpcDataBlob>`. Backed by an `InMemoryBlobStore` (RwLock<HashMap<[u8;32], Vec<u8>>>) implementing `MpcDataBlobStorage`. Callers hash-verify returned bytes — the network layer doesn't, and the doc comment on `fetch_blob` says so. `AuthorityPerpetualTables::mpc_artifact_blobs: DBMap<[u8;32], Vec<u8>>` with insert / get / iter helpers — the cross-restart store. At node startup `create_p2p_network` iterates that table and hydrates the in-memory cache before mounting the anemo server, so a restart keeps serving whatever blobs the validator had persisted. No producers or consumers wire up yet — those land in subsequent steps. The endpoint just serves whatever's been inserted (initially nothing on a fresh node). Acceptance gate `test_network_dkg_full_flow` passes (142s). 2 new unit tests in ika-network (`in_memory_blob_store_roundtrip`, `mpc_data_blob_hash_is_deterministic`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Producer side (ika_core::validator_metadata): - derive_mpc_data_blob(seed) returns the canonical BCS-encoded VersionedMPCData::V1 bytes — same encoding the CLI submits on chain via set_next_epoch_mpc_data_bytes. Deterministic from seed, so off-chain blobs hash-match chain bytes. - now_ms() for the announcement timestamp (latest-by-timestamp rule means later calls win, which is correct after a seed rotation). - sign_validator_mpc_data_announcement(...) builds + BLS-signs the announcement ready for consensus. Consumer side (AuthorityPerEpochStore): - New per-epoch table validator_mpc_data_announcements: DBMap<AuthorityName, SignedValidatorMpcDataAnnouncement>. - record_validator_mpc_data_announcement verifies the BLS sig against self.committee() (current-epoch path only — next-epoch joiner path deferred to step 6) and applies the latest-by-timestamp rule on insert. Replays and stale duplicates are silently dropped. - get_validator_mpc_data_announcement accessor. - Consensus dispatch wires the ConsensusTransactionKind:: ValidatorMpcDataAnnouncement variant through. Unit tests in ika-core::validator_metadata: - derive_mpc_data_blob_is_deterministic - sign_announcement_verifies_against_signer (covers intent scope + epoch binding + tamper detection). Acceptance gate test_network_dkg_full_flow still passes (143s). No producers wired up yet — they land in subsequent steps along with the ready-signal freeze. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two new epoch tables and a producer helper for the freeze step of the off-chain validator-metadata flow. `epoch_mpc_data_ready_signals` records, per authority, that this validator has decided its mpc_data input set is sufficient (`>= quorum_threshold` announcements observed). The first incoming signal that crosses quorum triggers `freeze_mpc_data_if_first`, which idempotently snapshots `validator_mpc_data_announcements` into `frozen_validator_mpc_data_input_set` — the immutable, content- addressed view of validator mpc_data used by all downstream consumers (handoff, reconfig, joiner bootstrap). The signal payload itself is unauthenticated; authorisation is the consensus binding (the authority that submitted the transaction). This is enforced at consensus dispatch in `AuthorityPerEpochStore`. Producer side: `build_epoch_mpc_data_ready_signal_transaction` wraps the signal in a `ConsensusTransaction` ready for the consensus adapter. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 142.28s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Joining validators (in V_{e+1} but not in V_e) can't submit
directly to consensus because they aren't members of the current
consensus committee. They fan out their signed mpc_data
announcement to every current-committee peer over a new Anemo RPC
`SubmitMpcDataAnnouncement`; one honest relayer is enough to land
the announcement in consensus.
This commit lands the transport only:
- `SubmitMpcDataAnnouncementRequest{Response}` wire types.
- `AnnouncementRelay` trait (impl supplied by the node once epoch
store + consensus adapter are up).
- `AnnouncementRelayHandle` — an `ArcSwapOption` late-binding
holder, installed at first epoch start and re-installed across
epoch boundaries. The Anemo server is constructed at node
startup before any epoch store exists, so install-after-the-fact
is needed.
- Anemo server impl that returns `Rejected` while the relay is
uninstalled (joiners retry) and dispatches to the active relay
otherwise.
- Client helpers: `submit_announcement_to_peer` (single peer) and
`submit_announcement_to_committee` (concurrent fan-out).
Installation of the actual relay impl (which performs signature
verification against the pending active set) is deferred to the
PendingActiveSet step, since the relay needs that verification
before it can safely submit.
Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 142.61s.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replaces the placeholder next-epoch branch in `record_validator_mpc_data_announcement` with real signature verification gated on a `JoinerPubkeyProvider`. `JoinerPubkeyProvider::is_registered_joiner(&AuthorityName) -> bool` is the trait the Sui-backed lookup will implement; a future step populates it from `validator_set.pending_active_set` plus each entry's `StakingPool.validator_info`'s next-epoch pubkey. Until that lands, `joiner_pubkey_provider` is unset and all next-epoch announcements drop — current-epoch flow is unchanged. `verify_joiner_announcement` is a pure helper (caller passes `expected_epoch` and the provider). The per-epoch-store method calls it and reacts to the four-way verdict (Accept/UnregisteredJoiner/InvalidSignature/InconsistentEnvelope); only `Accept` proceeds to the latest-by-timestamp insert rule. The provider is held in an `ArcSwapOption` on `AuthorityPerEpochStore`, swappable across epoch boundaries via `install_joiner_pubkey_provider` / `clear_joiner_pubkey_provider`. `AuthorityName == AuthorityPublicKeyBytes`, so the verifier uses `signed.auth_sig.authority` as the pubkey directly — the provider only authorizes *which* names are joinable. Tests cover Accept, UnregisteredJoiner, InvalidSignature (tampered blob hash), InconsistentEnvelope (wrong epoch + authority field mismatch), and `StaticJoinerPubkeyProvider` membership semantics. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 148.28s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Lands the canonical, off-chain handoff attestation primitives behind the next-step record/persist plumbing. These are the building blocks each validator runs locally at EndOfPublish (builder + signer) and that every validator runs on incoming consensus signatures (verifier + aggregator). - `build_handoff_attestation`: sorts items strictly ascending by `HandoffItemKey` (the wire format is a Vec, not a map, so the sort defines the canonical bytes every signer commits to); rejects duplicate keys. - `hash_next_committee_pubkey_set`: dedup + sort + BCS-encode + Blake2b256 over the next committee's pubkey set. This goes in the attestation header, so verifiers can confirm the cert is bound to the committee they're handing off to. - `sign_handoff_attestation`: Ed25519 over `bcs(IntentMessage::new(HandoffAttestation, attestation))` — signed with the validator's *consensus* key, NOT BLS. (Joiners look up signers' consensus pubkeys in the prior committee's on-chain validator info.) - `ConsensusPubkeyProvider` trait + `StaticConsensusPubkeyProvider` for the consensus-pubkey lookup, mirroring the joiner-provider shape from step 6. - `verify_handoff_signature` returns a four-way verdict (Accept/UnknownSigner/InvalidSignature/AttestationMismatch). - `HandoffAggregator`: one-shot stake-weighted aggregator that emits `CertifiedHandoffAttestation` the first time signers cross `committee.quorum_threshold()`. Replacements don't double-count; non-committee signers are silently dropped (the consensus path also rejects them at the dispatch site, but the aggregator is defense-in-depth). - `verify_certified_handoff_attestation`: standalone re-verify against a committee + provider — what joiners run during bootstrap on the cert they fetched. Tests cover sort canonicalization, duplicate-key rejection, pubkey-set hash invariance under reorder and dedup, sign+verify round trip with the four verdict outcomes, aggregator quorum crossing, replacement no-op, non-committee signer no-op, and end-to-end certify-then-re-verify-with-tampered-sig. Record / persist / EndOfPublish-trigger wiring land in follow-on commits; these helpers are isolated and consumed at those sites. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 143.26s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wires the consensus dispatch path for `HandoffSignature` to verify, persist, and aggregate incoming Ed25519 signatures over the epoch's handoff attestation. Per-epoch state on `AuthorityPerEpochStore`: - `handoff_signatures: DBMap<AuthorityName, Ed25519Signature>` — durable record of each verified signer's sig. Replays are no-ops via typed-store insert semantics. - `expected_handoff_attestation: ArcSwapOption<HandoffAttestation>` — this validator's locally-computed attestation, installed by the producer side once mpc_data is frozen + DKG/reconfig digests are known. Until installed, incoming signatures drop silently (`AttestationMismatch` is the only possible verdict). - `consensus_pubkey_provider: ArcSwapOption<...>` — Ed25519 lookup for signer pubkeys, populated by the same sui_syncer task that feeds the joiner provider. - `handoff_aggregator: Mutex<Option<HandoffAggregator>>` — in-memory stake accumulator. Rebuilt from persisted signatures when the expected attestation is (re)installed, so restart replay folds prior consensus-ordered signatures back in correctly. New pure helper in `validator_metadata`: - `process_handoff_signature` runs `verify_handoff_signature` and, on `Accept`, inserts into the aggregator. Returns one of `Recorded`, `Certified(cert)`, or `Rejected(verdict)`. Three new unit tests cover quorum-crossing, attestation mismatch, and unknown-signer paths. `PartialEq`/`Eq` added to `HandoffSignatureMessage` and `CertifiedHandoffAttestation` so the record-outcome enum can derive those traits for tests. Consensus dispatch: the `HandoffSignature` arm now calls `record_handoff_signature`. The returned cert (when quorum just crossed) is intentionally dropped on the floor for now — the perpetual-persist plumbing (step 7c) hangs off a dedicated drain task that pulls from the in-memory aggregator. Dropping is safe because the *next* ordered signature crossing quorum still mints a cert, and restart-replay rebuilds the aggregator. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 142.08s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes the handoff write path: once `record_handoff_signature`'s in-memory aggregator crosses quorum, the resulting `CertifiedHandoffAttestation` is immediately persisted into a keep-forever perpetual table. `AuthorityPerpetualTables`: - New `certified_handoff_attestations: DBMap<EpochId, CertifiedHandoffAttestation>` table, keyed by the epoch the outgoing committee is handing off *from*. - `insert_certified_handoff_attestation`, `get_certified_handoff_attestation`, `iter_certified_handoff_attestations` accessors. The handoff feedback rule (keep certs forever) is load-bearing because a joiner pulling history may need to verify the chain back to whichever cert it has a trusted committee for; skipping any single epoch's cert would permanently break their ability to bootstrap. `AuthorityPerEpochStore` gains `perpetual_tables_for_handoff: ArcSwapOption<...>` plus `install_perpetual_tables_for_handoff`. `ika-node` installs the perpetual handle directly after constructing the epoch store, so the very first cert produced by consensus lands on disk. When nothing is installed (e.g. unit tests that don't wire perpetual), the record path logs at debug level and keeps going — the cert stays in the in-memory aggregator and joiner-bootstrap consumers will simply miss it. The `Certified` arm of `record_handoff_signature` now also performs the perpetual write, with the persist failure logged (not propagated) — failing the entire consensus-dispatch path on a perpetual-DB hiccup would be far worse than a missing cert. Tests: 3 new perpetual-table unit tests cover insert/get roundtrip, ordered iteration across epochs, and byte-level idempotency on identical re-writes. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 141.68s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes the producer half of the handoff loop: when this validator reaches EndOfPublish, the same task that submits its `EndOfPublish` consensus transaction also builds, installs, signs, and submits its `HandoffSignatureMessage` for the epoch — exactly once. The trigger pipeline: 1. `compute_handoff_items` (pure): combines frozen mpc_data set + per-network-key DKG output digests + per-network-key reconfig output digests into a sorted Vec<(HandoffItemKey, [u8;32])>. Empty inputs are valid (yields an empty list) — important because DKG/reconfig digest caching is step 9, and the attestation needs to be signable before then. 2. `AuthorityPerEpochStore::build_local_handoff_attestation`: reads the frozen set, hashes the supplied next-committee pubkey set, calls compute_handoff_items, and builds a well-formed attestation. 3. `AuthorityPerEpochStore::build_local_handoff_signature_transaction`: installs the attestation locally (so the per-epoch record path accepts matching peer signatures), signs it with the consensus key, and wraps it in a `ConsensusTransaction`. 4. `EndOfPublishSender` is upgraded to take the consensus keypair (Arc) + a `Receiver<Committee>` for the next epoch, plus an `AtomicBool` one-shot flag. The handoff submit happens after the EndOfPublish submit on the same tick. Determinism across validators: identical inputs → identical attestation bytes → matching signatures. The frozen set is already agreed (step 4's quorum freeze); the next-committee pubkey set is read from chain. Until step 9 populates DKG/reconfig digests, every validator computes an attestation with those slots empty — still agreed. The handoff record path (step 7b) was already wired to consume these signatures, and the perpetual persist (step 7c) writes the cert as soon as quorum is reached. With this commit, the cycle runs end-to-end given an actual EndOfPublish trigger. Tests: 2 new unit tests cover `compute_handoff_items` sorting + empty-input semantics, in addition to the existing 19 helpers tests. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 144.29s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the read side that closes the handoff loop: peers can pull a
`CertifiedHandoffAttestation` for any persisted epoch over a new
`ValidatorMetadata::GetCertifiedHandoffAttestation` RPC, and joiners
have a single-hop verification helper that binds the cert to the
specific committee they're trying to join.
Network layer:
- New `GetCertifiedHandoffAttestationRequest { epoch }` wire type.
- New `HandoffCertStorage` trait — the read-only counterpart to
the perpetual store. Server holds an `Arc<C: HandoffCertStorage>`
alongside the existing blob store.
- `ValidatorMetadataServer` is now `Server<S, C>`; the
`build_server(storage, relay, cert_storage)` signature gained the
`cert_storage` arg.
- Joiner-side `fetch_certified_handoff_attestation(network, peer,
epoch)` mirrors the existing `fetch_blob`.
Adapter:
- `AuthorityPerpetualTables` implements `HandoffCertStorage` by
delegating to `get_certified_handoff_attestation` and logging
(not propagating) a perpetual-read error as `None`. The Anemo
hot path can't surface a typed error usefully.
ika-node:
- The perpetual handle is now passed into `build_server` so peers
immediately see every cert that lands on disk (via step 7c's
perpetual persist). No additional installation needed because
`AuthorityPerpetualTables` is constructed eagerly at startup.
Joiner bootstrap helper in `ika-core::validator_metadata`:
- `verify_joiner_bootstrap_cert(cert, prior_committee, prior_
consensus_pubkeys, expected_next_committee_pubkeys)` runs the
full check: pubkey-set-hash binding (so a malicious peer can't
hand a real cert for a different committee), then delegates to
the existing `verify_certified_handoff_attestation` for the
signature/stake check. One-hop only — joiners verify against
the *prior* committee, not back to genesis. (Per handoff design
memo: anchoring trust to the prior committee is sufficient since
the joiner gets there through earlier hops they either already
trust or are themselves bootstrapping from a known anchor.)
Tests: 1 new unit test exercising both the happy path and the
pubkey-set-mismatch refusal.
Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 143.31s.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Populates the producer-side caches that feed the handoff attestation's `NetworkDkgOutput` / `NetworkReconfigurationOutput` items. `AuthorityPerEpochStoreTrait` gains two methods, called from the MPC producer at the exact point it builds the consensus output: - `cache_network_dkg_output(key_id, output_bytes)` - `cache_network_reconfiguration_output(key_id, output_bytes)` Concrete `AuthorityPerEpochStore` impl: - Hashes `output_bytes` to Blake2b256 (matching `mpc_data_blob_hash`'s function so peers can fetch this blob over the existing `GetMpcDataBlob` RPC). - Writes the digest into one of two new per-epoch tables — `network_dkg_output_digests` or `network_reconfiguration_output_digests` — keyed by `dwallet_network_encryption_key_id`. - Writes the blob bytes into perpetual `mpc_artifact_blobs` (if the perpetual handle is installed) so cross-restart serves work for free. - All writes are idempotent on byte-identical replays. `build_local_handoff_attestation` no longer takes the digest maps as parameters; it reads them straight off the per-epoch store. `EndOfPublishSender::send_handoff_signature` is updated to match. Producer hook: `DWalletMPCService::new_dwallet_mpc_output`'s User/System branch calls the trait methods for the DKG and reconfig protocols (`!rejected` only — rejected outputs are empty and shouldn't pollute the cache). Cache failures are logged, not propagated — they don't fail the consensus output emit, just degrade peer serveability. `TestingAuthorityPerEpochStore` gets no-op impls; the integration test gate doesn't exercise attestation contents so an in-memory mirror isn't needed. Tests: 2 new unit tests cover the per-epoch table semantics — digest roundtrip + replay idempotency, and independence of the DKG vs reconfig caches when keyed by the same key_id. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 141.54s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the per-network-key counterpart to `EpochMpcDataReadySignal`.
Validators can now signal readiness for a specific network key's
DKG (`NetworkKeyDKGReadySignal { authority, network_key_id,
epoch }`) earlier than the epoch-wide signal, because per-key
readiness is a narrower commitment — the validator only needs the
mpc_data required for *this* key, not all reconfig sessions.
Per-epoch state:
- `network_key_dkg_ready_signals: DBMap<(ObjectID, AuthorityName),
()>` — per-key, per-authority votes. Composite key keeps quorums
scoped: the same authority signaling readiness for two keys
produces two independent entries.
Record path:
- `record_network_key_dkg_ready_signal` is idempotent on replays.
Quorum is per-key (sum stake of all authorities that signaled
for `signal.network_key_id`). The first quorum of *any* signal
kind — epoch-wide or per-key — calls `freeze_mpc_data_if_first`,
which is already idempotent on a non-empty frozen set. Per-key
quorums after that point are still recorded (DKG kickoff per key
consumes them) but don't re-freeze.
- `has_network_key_dkg_ready_quorum(network_key_id)` exposes the
per-key quorum state for step 14's session-kickoff gating.
Consensus wiring:
- New `ConsensusTransactionKind::NetworkKeyDKGReadySignal` +
matching `ConsensusTransactionKey` variant.
- `new_network_key_dkg_ready_signal` constructor.
- Sender-authority check at verification time (consensus binding
is the only authentication; no payload signature).
- Metric label + validator pass-through arms.
Producer helper:
- `build_network_key_dkg_ready_signal_transaction(authority,
network_key_id, epoch)` wraps a signal in a
`ConsensusTransaction` ready for submission.
Tests: 1 new unit test on `AuthorityEpochTables`'s
`network_key_dkg_ready_signals` table covers composite-key
scoping + replay idempotency.
Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 142.54s.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Filters the frozen mpc_data input set down to the union of the current and next committees before it's consumed by handoff cert build (and, in step 14, reconfig MPC). Validators who announced mpc_data this epoch but withdrew before next_committee was selected get dropped — the cert no longer pins their entries and reconfig MPC won't allocate work for them. `compute_effective_reconfig_input_set(frozen, current, next) -> BTreeMap<AuthorityName, [u8;32]>` is the pure helper; it intersects with the union of both committee membership lists. Both committee inputs are `IntoIterator` so callers can hand it whatever shape they already have (Vec, &[..], `voting_rights` iter). `AuthorityPerEpochStore::get_effective_reconfig_input_set` reads the frozen set and the current committee from the store and delegates to the pure helper. `build_local_handoff_attestation` now goes through this method instead of pulling `frozen` raw, so cert items reflect the effective set. Tests: 2 new unit tests cover the intersection semantics — a four-author scenario where staying members, joiners, and withdrawers each take their expected path through the filter, plus the degenerate case where no announcer overlaps the committees. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 143.88s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the read-side abstraction that lets the sui_syncer prefer locally-cached protocol output blobs over the chain blobs when assembling `DWalletNetworkEncryptionKeyData`. The lightweight fields (id, current_epoch, dkg_at_epoch, state) always come from chain — those are authoritative — but the large `network_dkg_public_output` and `current_reconfiguration_public_output` blobs can come from the local content-addressed cache populated by step 9's producer caching. New in `ika-core::validator_metadata`: - `NetworkKeyBlobSource` trait: `network_dkg_output_blob(key_id)` and `network_reconfiguration_output_blob(key_id)`, both returning `Option<Vec<u8>>`. `None` means "fall back to chain". - `StaticNetworkKeyBlobSource` — empty-by-default in-memory impl, used by tests and as the typed-empty default. - `fetch_network_key_data_with_off_chain_blobs(chain_data, source) -> DWalletNetworkEncryptionKeyData`: takes the chain copy, overlays each large blob from `source` if present. `AuthorityPerEpochStore` implements `NetworkKeyBlobSource` by looking up the per-epoch digest cache from step 9 (`network_dkg_output_digests` / `network_reconfiguration_output_ digests`) and then fetching the blob bytes from the perpetual `mpc_artifact_blobs` store. A missing digest *or* a missing blob returns `None` — every step in the chain has the chain fallback behind it. Syncer wiring (replacing the chain-read in `sui_syncer::sync_dwallet_network_keys` with the wrapper) is the next commit; this one lays the infrastructure. Tests: 2 new unit tests cover the overlay semantics — partial overlay (DKG from source, reconfig from chain) and the all-fall-back case where the source is empty and the merged data equals the chain copy byte-for-byte. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 142.76s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the off-chain assembler for the load-bearing
`Committee.class_groups_public_keys_and_proofs` map — the
HashMap reconfig MPC reads to find each committee member's
class-groups encryption key + correctness proof. The new path
decodes blobs locally from the perpetual `mpc_artifact_blobs`
store, keyed by digests pinned in the validators'
`ValidatorMpcDataAnnouncement`s.
The completion gate (per the design memo) is strict:
`assemble_committee_class_groups_off_chain` returns
`OffChainClassGroupsAssembly::Complete(map)` *only* when every
supplied authority resolved successfully — blob found, BCS-
decoded to `VersionedMPCData`, inner bytes decoded to
`ClassGroupsEncryptionKeyAndProof`. Even one missing or
malformed entry forces `Incomplete { missing: [...] }`, and the
caller must fall back to the chain-read path.
Why strict: reconfig MPC reads
`Committee.class_groups_public_keys_and_proofs[authority]`
directly, and a missing/empty entry silently drops that
validator's share without aborting. The existing chain-read path
in `sui_syncer::new_committee` already has this footgun (a
`filter_map` that swallows decode errors per-validator); the
off-chain path *must not* repeat it. Hence: all-or-nothing.
Wiring `sui_syncer::new_committee` to try off-chain first and
fall back on `Incomplete` is the next commit; this commit lands
the pure assembler.
Tests: 3 new unit tests cover (a) the happy path — two seeded
blobs round-trip through `derive_mpc_data_blob` →
`mpc_data_blob_hash` → an in-memory store → assembly back into
the map; (b) missing-blob aborts with the missing authority
listed; (c) corrupt-blob (bytes don't decode as
`VersionedMPCData`) also aborts.
Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 143.26s.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
DKG and reconfig sessions now wait on the off-chain mpc_data freeze before instantiating. Honest validators that observe the chain event before the consensus-side freeze quorum lands park the request and retry on every subsequent batch cycle until the gate opens. Gate conditions, evaluated against the per-epoch store: - `NetworkEncryptionKeyDkg(key_id)` requires `is_mpc_data_frozen() && has_network_key_dkg_ready_quorum(key_id)`. Per-key quorum makes a stronger commitment than the epoch-wide signal: it certifies that this *specific* key has enough peers ready to actually participate. - `NetworkEncryptionKeyReconfiguration(_)` requires only `is_mpc_data_frozen()`. Reconfig sweeps every key the validator knows about; a per-key gate would deadlock if the per-key quorum needed reconfig output for kickoff. - Everything else (user DKG, presign, sign, etc.) is unaffected. `AuthorityPerEpochStoreTrait` gains the two query methods `is_mpc_data_frozen` and `has_network_key_dkg_ready_quorum`, implemented concretely against `frozen_validator_mpc_data_input_set` and `network_key_dkg_ready_signals` respectively. The previously inherent-only `has_network_key_dkg_ready_quorum` is gone — it's now exclusively a trait method. `TestingAuthorityPerEpochStore`'s impls return `Ok(true)` for both: integration tests don't drive the freeze flow end-to-end and would otherwise deadlock at the gate. Production builds use the real store where these reflect actual consensus-observed state. In the manager, a new `requests_pending_for_frozen_mpc_data: Vec<DWalletSessionRequest>` queue mirrors the existing pending queues. Drained at the top of every `handle_mpc_request_batch` by re-running each request through `handle_mpc_request`. Requests that don't pass get re-queued; those that do proceed through the existing kickoff path. Made `DWalletMPCManager.epoch_store` `pub(crate)` so the gate check in `mpc_session.rs` can reach it. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 144.14s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the producer-side task without which the off-chain freeze quorum can never be reached, leaving step 14's kickoff gate permanently closed and stalling network DKG / reconfig. The new `MpcDataAnnouncementSender` (sibling of `EndOfPublishSender` under `sui_connector`) runs once per epoch per validator and: 1. Derives the canonical class-groups `mpc_data` blob from the validator's `RootSeed` (via `derive_mpc_data_blob` — identical bytes to what the CLI submits on chain). 2. Persists the blob into perpetual `mpc_artifact_blobs` so peers can fetch it by digest over the existing `GetMpcDataBlob` RPC. 3. Signs and submits a `ValidatorMpcDataAnnouncement` over consensus. Submission is idempotent — replays use the latest- by-timestamp rule. 4. After its own announcement is in, submits an `EpochMpcDataReadySignal` — one of two signal types whose quorum drives `freeze_mpc_data_if_first`. 5. Submits `NetworkKeyDKGReadySignal` for every known network key (deduped via a `HashSet`). Each of (3), (4), (5) is gated by its own one-shot flag plus ack-on-success, so a transient consensus-adapter failure causes a retry on the next tick (every 2s) rather than blowing up the task. Step-14 gate softened to match the design memo's "first quorum of either signal type freezes mpc_data" — DKG kickoff now only requires `is_mpc_data_frozen()`, same as reconfig. The per-key signal stays as an alternate freeze trigger but isn't a separate hard requirement, since the sui_syncer skips `AwaitingNetworkDKG` keys from the network-keys snapshot, meaning the producer task can't observe a fresh DKG-target key to signal for until *after* DKG completes — which would deadlock. Wired from `ika-node::monitor_reconfiguration` alongside `EndOfPublishSender`. `AuthorityState::perpetual_tables()` added to expose the perpetual handle without making the field public. The aborted-on-epoch-end pattern follows `end_of_publish_sender_handle`. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 143.64s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Lights up step 6's joiner verify path by installing a
`StaticJoinerPubkeyProvider` on the current epoch store, sourced
from the next-epoch committee snapshot already kept live by
`sui_syncer::sync_next_committee` and exposed via
`next_epoch_committee_receiver`. Without this, every next-epoch
(joiner) `ValidatorMpcDataAnnouncement` drops silently because the
provider field is `None` by default.
The new per-epoch `JoinerPubkeyProviderUpdater` task watches the
receiver, computes the joiner set as `V_{e+1}.voting_rights`'s
authority names, and calls
`AuthorityPerEpochStore::install_joiner_pubkey_provider`. Since
`AuthorityName == AuthorityPublicKeyBytes`, the BLS sig verify in
`verify_joiner_announcement` runs against the announcer's claimed
authority directly — no separate pubkey lookup needed.
Idempotent: `last_installed` cache short-circuits re-installation
when the underlying set is byte-identical to the last one we
installed.
This is a *simplification* of the design memo's "verify against
PendingActiveSet" prescription: we wait until V_{e+1} is selected
on chain instead of reading `PendingActiveSet` directly. Trade-off
— joiners can't announce earlier than V_{e+1} selection, but
reading the `ExtendedField` for PendingActiveSet would require a
new Sui dynamic-field plumbing path that isn't justified for v1.
Early-announce can be added later if join-latency becomes a real
concern.
Spawned alongside the producer task in
`monitor_reconfiguration`; aborted on epoch end via the same
pattern as `end_of_publish_sender_handle`.
Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 271.18s.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes the verify side of step 7's handoff loop. Without this, the `ConsensusPubkeyProvider` field stays `None` and every incoming `HandoffSignatureMessage` drops as `UnknownSigner` — meaning no peer's signature ever counts toward the aggregator's quorum and the cert never gets minted. The new `ConsensusPubkeyProviderUpdater` task fetches the current committee's `StakingPool.validator_info.consensus_pubkey_bytes` directly via `sui_client.get_system_inner()` → `active_committee.members` → `get_validators_info_by_ids` → `verify().consensus_pubkey`. The result is mapped `AuthorityName -> Ed25519PublicKey` and installed as a `StaticConsensusPubkeyProvider` on the per-epoch store. Cadence: 15s (consensus pubkey is fixed at validator registration and shouldn't change mid-epoch). Idempotent re-install via a base64-serialized cache key on the last installed map. Sources the system inner directly rather than plumbing `system_object_receiver` out of `SuiSyncer` — one extra RPC every 15s is cheaper than the receiver-broadcast plumbing. Wired in `monitor_reconfiguration` alongside the joiner-pubkey-provider updater and the producer task; aborted on epoch end via the same pattern as `end_of_publish_sender_handle`. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 209.13s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wires step 12's overlay into the chain-read path. The syncer's `sync_dwallet_network_keys` task now applies `fetch_network_key_data_with_off_chain_blobs` to every chain copy before sending it on the watch channel, so consumers see locally- cached DKG / reconfig output blobs (populated by step 9's producer cache) instead of fetching them from chain on every re-read. Plumbing: - `SuiConnectorService` gains `network_key_blob_source: Arc<ArcSwapOption<Box<dyn NetworkKeyBlobSource>>>` plus an `install_network_key_blob_source` method. - The handle is created (empty) at service construction and passed by clone into the syncer task, where `sync_dwallet_network_keys` reads it on each fetch tick. - New adapter `EpochStoreBlobSource` wraps `Weak<AuthorityPerEpochStore>` so the long-lived service can hold a per-epoch reference; the weak upgrade returns `None` cleanly when the epoch ends, which makes the overlay fall back to the chain blob via `unwrap_or` on each field. - `ika-node::monitor_reconfiguration` calls `sui_connector_service.install_network_key_blob_source(...)` once per epoch with a fresh `EpochStoreBlobSource` pointing at the new `cur_epoch_store`. Each install atomically replaces the previous epoch's source. The lightweight metadata (id, current_epoch, dkg_at_epoch, state) always comes from chain — only the two large output blobs may be overlaid. When no source is installed, behavior is unchanged byte-for-byte. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 202.94s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wires step 13's pure assembler (`assemble_committee_class_groups_off_chain`) into the next-committee construction path. When the off-chain set covers every committee member, the resulting class-groups public-keys-and-proofs map comes straight from validators' own `mpc_data` announcements + the perpetual blob store instead of refetching from chain. `Incomplete` paths transparently fall through to the existing `get_mpc_data_from_validators_pool` read. New abstractions in `validator_metadata`: - `OffChainCommitteeClassGroupsSource` trait — single method `try_assemble_class_groups(&[AuthorityName]) -> OffChainClassGroupsAssembly`. - `EpochStoreClassGroupsSource` adapter holds `Weak<AuthorityPerEpochStore>` (for the per-authority announcement digest lookup) + `Arc<AuthorityPerpetualTables>` (for the digest→bytes blob lookup), and delegates to the pure assembler. Returns `Incomplete` cleanly when the weak upgrade fails (epoch ended). Plumbing: - `SuiConnectorService` gains a second `Arc<ArcSwapOption<Box<dyn OffChainCommitteeClassGroupsSource>>>` handle with a matching `install_class_groups_source` setter. - The handle is passed by clone into `SuiSyncer::run` and on to `sync_next_committee` → `new_committee`, where the off-chain attempt happens before the chain read. - `ika-node::monitor_reconfiguration` installs a fresh `EpochStoreClassGroupsSource` once per epoch right next to the blob-source install. Each install atomically replaces the previous epoch's source. Strict-gate rationale preserved: `new_committee` only short- circuits to the off-chain map on `Complete`. Any missing authority — joiner whose announcement hasn't been verified yet, blob not yet replicated, decode failure — falls through to chain, which is the only safe option since the load-bearing rule says reconfig MPC silently drops validators with no class-groups entry. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 265.04s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wires the consumer side of step 5. The Anemo
`SubmitMpcDataAnnouncement` handler had been returning
`Rejected{"relay not installed"}` for every joiner submission;
this commit installs a concrete relay per epoch so the RPC
actually forwards joiner announcements into consensus.
The relay (`ConsensusBackedAnnouncementRelay` in
`sui_connector::announcement_relay`) runs three steps:
1. Cheap envelope checks — refuses unless
`announcement.epoch == next_epoch`, since current-epoch
announcements come from members who can submit themselves
directly.
2. Joiner verify via the pure
`validator_metadata::verify_joiner_announcement` against the
per-epoch store's installed `JoinerPubkeyProvider` (populated
by the joiner-provider syncer from step 6). Rejection here
stops a malicious peer from using us as a spam pipe.
3. Wraps in `ConsensusTransaction::new_validator_mpc_data_announcement`
and submits via the consensus adapter.
Plumbing:
- `P2pComponents` gains a `mpc_announcement_relay` field
(`Arc<AnnouncementRelayHandle>`) so the long-lived handle the
Anemo server already holds is also reachable from
`monitor_reconfiguration`.
- `IkaNode` stashes the same handle so the per-epoch install
loop can swap relays without re-touching the network layer.
- New `AuthorityPerEpochStore::joiner_pubkey_provider()` getter
exposes the installed provider for the relay's verify step
(mirrors the existing install/clear pair).
Install point: alongside the other per-epoch installs in
`monitor_reconfiguration`. Each epoch's relay holds
`Weak<AuthorityPerEpochStore>` so it naturally fails closed when
the epoch ends (returns "epoch ended" until the new epoch's
relay replaces it).
Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 247.16s.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reorganizes the four files that have no Sui RPC dependency and shouldn't have been under `sui_connector/`. They all just hold a `Weak<AuthorityPerEpochStore>` + an `Arc<dyn SubmitToConsensus>` and run as per-epoch background tasks that emit `ConsensusTransaction`s; that's a different responsibility from `sui_connector/` (which talks to Sui RPC). Moved (identical bytes): - `sui_connector/end_of_publish_sender.rs` → `epoch_tasks/end_of_publish_sender.rs` - `sui_connector/mpc_data_announcement_sender.rs` → `epoch_tasks/mpc_data_announcement_sender.rs` - `sui_connector/joiner_pubkey_provider_updater.rs` → `epoch_tasks/joiner_pubkey_provider_updater.rs` - `sui_connector/announcement_relay.rs` → `epoch_tasks/announcement_relay.rs` Kept in `sui_connector/`: - `consensus_pubkey_provider_updater.rs` — actually calls `sui_client.get_system_inner()` + `get_validators_info_by_ids`, so it belongs with the Sui-side updaters. The four moved files use only `crate::` paths internally so no import edits inside them; the only external rename is in `ika-node/src/lib.rs` (s/sui_connector/epoch_tasks/ on four call sites). Module layout follows the CLAUDE.md `xxx.rs` convention: new `crates/ika-core/src/epoch_tasks.rs` declares the four submodules, files live in `epoch_tasks/`. No `mod.rs`. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 144.80s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three structural changes so the handoff loop is generic and not phrased as a validator-metadata feature: 1) Types extracted to `ika-types::handoff`. `HandoffItemKey`, `HandoffAttestation`, `HandoffSignatureMessage`, and `CertifiedHandoffAttestation` move out of `validator_metadata.rs`. `validator_metadata.rs` keeps only the four validator-specific types (`ValidatorMpcDataAnnouncement`, `SignedValidatorMpcDataAnnouncement`, `EpochMpcDataReadySignal`, `NetworkKeyDKGReadySignal`). Cross-crate import sites updated. 2) `HandoffSignatureSender` extracted from `EndOfPublishSender`. The latter shrinks back to "submit EndOfPublish on the local trigger" and nothing else. The new sender lives in `epoch_tasks/handoff_signature_sender.rs` and runs on the same `end_of_publish_receiver` independently. ika-node spawns both side-by-side and aborts both on epoch end. 3) `HandoffItemsBuilder` trait + concrete `MpcDataHandoffItemsBuilder`. Item contributors plug in via the trait; `AuthorityPerEpochStore::build_local_handoff_attestation` now takes `&[Arc<dyn HandoffItemsBuilder>]` and folds each contribution into the attestation. Today only the MPC-data builder is registered (via `default_handoff_items_builders`); new features (NOA, sui-state pinning, etc.) can append their own builder without touching the producer or aggregator. `HandoffItemKey` stays a typed enum for now — moving to opaque byte keys was the fourth level I called out and explicitly deferred. Adding a new item kind still requires a variant bump, which is the right trade-off while the variant count is small. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 295.42s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The module name `validator_metadata` was misleading — it bundled
three orthogonal P2P endpoints that have nothing to do with
"validator metadata" in the dictionary sense. Rename to
`mpc_artifacts` and split into purpose-named submodules:
- `mpc_artifacts/blob_store.rs` — content-addressed `mpc_data`
blob storage (`MpcDataBlobStorage`, `InMemoryBlobStore`,
`mpc_data_blob_hash`, `GetMpcDataBlobRequest`, `MpcDataBlob`,
`fetch_blob`).
- `mpc_artifacts/announcement_relay.rs` — joiner announcement
forwarding (`AnnouncementRelay`, `AnnouncementRelayHandle`,
`SubmitMpcDataAnnouncement{Request,Response}`,
`submit_announcement_to_peer`,
`submit_announcement_to_committee`).
- `mpc_artifacts/handoff_cert.rs` — handoff cert retrieval
(`HandoffCertStorage`, `GetCertifiedHandoffAttestationRequest`,
`fetch_certified_handoff_attestation`).
- `mpc_artifacts/server.rs` — Anemo `ValidatorMetadata` impl,
unchanged behavior (moved + import paths fixed).
- `mpc_artifacts.rs` — top-level module: `mod generated`,
submodule declarations, re-exports of every public surface so
external callers still write `ika_network::mpc_artifacts::X`
without caring which submodule X lives in, and the public
`build_server` constructor.
Anemo service wire name stays `ValidatorMetadata` (and the
codegen include stays `ika.ValidatorMetadata.rs`) — the
rename is internal-only, no protocol break. Tests for each
submodule moved next to their code (blob_store + relay tests).
External rename: `ika_network::validator_metadata` →
`ika_network::mpc_artifacts` across ika-core, ika-node, ika-types
inline paths, and ika-network's own build.rs request_type /
response_type paths.
Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 265.88s.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a single `off_chain_validator_metadata` feature flag and bumps `MAX_PROTOCOL_VERSION` from 4 to 5; the flag flips on at v5. All off-chain pipeline hooks now check this flag and fall back to legacy chain-only behavior when false. The Sui-style protocol- version advance means every validator switches together at the exact consensus round the network advances to v5 — no mixed- version freeze-quorum stalls, no asymmetric blob caches, no divergent handoff attestations. Six gates, all failing closed to legacy: 1. Producer tasks self-exit on `run()` when the flag is false: `MpcDataAnnouncementSender`, `HandoffSignatureSender`, `JoinerPubkeyProviderUpdater`, `ConsensusPubkeyProviderUpdater`. Each reads `epoch_store.protocol_config().off_chain_validator_metadata_enabled()` once at task start. 2. ika-node `monitor_reconfiguration` reads the flag once per epoch and skips spawning the four tasks, the relay install, and the two `SuiConnectorService` source installs (`install_network_key_blob_source`, `install_class_groups_source`) when off — saves the spawn churn even though the tasks self-gate. `EndOfPublishSender` stays unconditional since it's core-protocol. 3. Consumer record paths bail early when the flag is false — defensive, so a stray new-kind `ConsensusTransaction` from a peer can't allocate state: `record_validator_mpc_data_announcement`, `record_epoch_mpc_data_ready_signal`, `record_network_key_dkg_ready_signal`, `record_handoff_signature`. 4. Step-14 kickoff gate `off_chain_gate_passes` evaluates to `true` (legacy behavior) when the flag is off. Otherwise gates on `is_mpc_data_frozen()`. New trait method `off_chain_validator_metadata_enabled` on `AuthorityPerEpochStoreTrait` so the gate site can reach the flag through the trait object. `TestingAuthorityPerEpochStore` returns `true` to preserve existing integration-test behavior. 5. Step-9 producer cache hook in `DWalletMPCService::new_dwallet_mpc_output` skips when the flag is off — leaves the digest tables empty so the syncer overlay path naturally falls through to chain-only reads. 6. Syncer overlays (`sync_dwallet_network_keys`, `new_committee`) don't need explicit flag checks: when the flag is off, ika-node skips `install_*_source`, the source handles stay None inside `SuiConnectorService`, and the existing source-handle checks fall through to chain. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 313.64s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Brings PR #1 (cleanup, ika-benchmark removal), PR #2 (bootstrap library), PR #3 (ika-test-cluster), and the Inkrypto cryptography-private bump (post-PR-#1707 `ValidatorEncryptionKeysAndProofs` shape: class-groups + per-curve PVSS HPKE). Merge resolutions: * `authority_per_epoch_store.rs`: take origin/dev's tuple key `DBMap<(SessionIdentifier, u16), AssignedPresign>` for `assigned_presigns_schnorrkel_substrate` (PR #1707 fix) AND keep the seven off-chain metadata fields from this branch. * `pnpm-lock.yaml`: keep upstream `sdk/signature-mpc-wasm/pkg: {}` entry; the stale stashed `sdk/ows/...` entries are already removed. * `protocol-config/lib.rs`: keep `MAX_PROTOCOL_VERSION = 4`. Merge `network_encryption_key_version = Some(3)` and `reconfiguration_message_version = Some(3)` into the v4 arm so the Inkrypto crypto activates at the current MAX. The v5 arm (`noa_checkpoints = true`) is commented out as a forward-looking reference. Rewrote the version-history comment with one line per version. User's manual `internal_presign_sessions = false` at v4 preserved. * Off-chain pipeline PVSS extension: the Inkrypto bump expanded `Committee::new` with three new PVSS HashMaps (secp256k1, secp256r1, ristretto). Extended `OffChainCommitteeClassGroupsSource` to assemble all four maps from the same blob bytes via the shape- tolerant `decode_validator_encryption_keys`. Validators publishing under mainnet-v1.1.8 shape contribute only class-groups; post-PR-#1707 validators contribute the full bundle — matching chain-fallback semantics in `sui_syncer::new_committee`. * Test-only `Committee::new` call sites in `validator_metadata.rs`: pass three empty PVSS maps to satisfy the new 8-arg signature. * Protocol-config snapshots regenerated for v3/v4 (off-chain flag flipped on at v4, crypto-v3 active at v4) plus v5 snap files kept on disk as forward-looking reference for the commented v5 arm. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` passes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Make `request_add_validator_candidate`, `request_add_validator`, and `stake_ika` `pub` in `ika-swarm-config::sui_client` so the upcoming `IkaTestCluster` joiner helper can reuse the battle-tested PTB builders rather than duplicating them. No behavior change — same functions, broader visibility. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…d of on rayon msim's `Handle::spawn` re-resolves the CURRENT simulated node at spawn time, so a rayon-side completion send whose originating node was torn down mid-compute (an epoch swap in the simulation) panics at `NodeHandle::current().unwrap()` and rayon-core aborts the whole process — the smoke test died ~36 minutes in. Crypto is sequential under msim anyway (the `parallel` feature is dropped in that profile), so under `cfg(msim)` compute inline in the calling task and await the channel send there: the send then dies cleanly WITH the node, dropping the now-moot result. The non-msim path is unchanged. Verified: `MSIM_DISABLE_WATCHDOG=1 cargo simtest -p ika-test-cluster test_swarm_reaches_epoch_2` now PASSES (3h27m, single-threaded msim — the documented trade-off), where it previously aborted. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…stantiation The instrumented-localnet decomposition showed the dominant CI stall: instantiate_agreed_keys_from_voted_data awaited the network-key instantiation inline — minutes-scale on weak hardware (262-321s per key on the CI runner pod, where four concurrent instantiations contend), freezing every session on the validator at each epoch boundary and backing up completed computations behind the frozen loop (23,880s of accumulated pickup lag in one run — more than all compute combined). Split it: the per-round step only SPAWNS due instantiations on the rayon pool (tracked per key id, no duplicate spawns); completions are polled once per service ITERATION via poll_pending_network_key_instantiations. The poll deliberately does not live inside the per-round drain — with no new consensus rounds arriving, a completed key would otherwise never install (the integration-test harness shape, and a live-validator hazard during round stalls). Integration tests that asserted installation after a single post-vote iteration now converge via run_service_loops_until_network_key_installed (bounded by the computation-wait budget). Validated: network_dkg (3), network_dkg_bwd_compat (3), and create_dwallet_test pass locally in release. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two facts make parallel clusters viable: nextest's process-per-test isolates the publish flow's process-global set_current_dir (parallel `cargo test` threads race on cwd and corrupt each other's contract publishes), and each test cluster only consumes ~2-4 effective CPUs (serial-chain class-groups crypto), so the suite is latency-bound and parallel clusters mostly interleave waiting. Captured per-test output replaces the multi-GB interleaved --nocapture log; --no-fail-fast keeps one wedged cluster from hiding the rest of the suite. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…tiation The instantiation measures 5.5x slower per thread on the ika-k8s-large runner than on a workstation while targeted bignum microbenches run at parity — so either one sub-operation is pathological on that platform or the build executes different work. Timing each of the nine sub-calls (per-curve protocol + decryption-share parameters, NOA DKG) localizes the gap to a concrete operation. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…evel The integration-test tracing subscriber (fmt().try_init()) caps at INFO and ignores RUST_LOG, so debug-level timings never emit there; nine info lines per instantiation is acceptable noise. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Lets the same single-test measurement run on ubuntu-latest as a platform A/B against ika-k8s-large: a hosted x86 VM at workstation pace implicates the k8s pod environment; one that is equally slow implicates x86 codegen of the class-groups hot loops. Concurrency keyed by runner so the A/B runs in parallel with the self-hosted measurement. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The class-groups parameter precomputation profile is uniform across all eight per-curve sub-calls and scales NEGATIVELY with concurrency on the runner (4-way concurrent instantiations: ~0.73x aggregate of a single one) — the signature of allocation-churn serialization, and the ika binaries set no #[global_allocator], so Linux runs glibc malloc. Preloading jemalloc isolates the allocator's share of both the 5.5x per-thread gap and the concurrency collapse. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…announcement subsystems specs/ holds protocol-level behavioral contracts (actors, messages, decision rules, invariants, failure modes) written to be readable without the code open. Seeded with the two subsystems this PR introduces: the off-chain validator mpc-data pipeline (announcements, ready signals, freeze, assembly) and the cross-epoch handoff (attestation, EndOfPublish V2, certificate, joiner bootstrap, prepare-then-start barrier). CLAUDE.md instructs updating the relevant spec in the same PR as any behavior change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Separates the build from the measured run so the counters cover the test, not rustc. user-vs-real CPU time splits "burning more cycles" from "threads blocked"; perf instructions-vs-cycles (where the pod permits perf_event_open) splits "executing more work" from "same work at lower IPC". Mac reference for the single network-DKG test: 5.30T instructions, 1.32T cycles (IPC 4.0), 373s CPU / 162s wall, 6.2M minor faults, 3.3GB peak RSS. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ness root cause RUST_BACKTRACE=1 (set workflow-wide for panic diagnostics) makes std::backtrace::Backtrace::capture() perform a full DWARF unwind behind a process-global mutex. class-groups constructs backtrace-carrying errors EAGERLY on the success path of every group operation (ok_or(Error::from(..)) at ~15-20 sites per nucomp/nudupl), i.e. millions of captures per network-key instantiation. Measured on the same single test: 2020s CPU on the runner vs 373s on a workstation whose shell leaves RUST_BACKTRACE unset (the capture is then a ~0.17us stub), with NEGATIVE 4-way scaling from the global lock convoy (23x sys-time inflation). Reproduced in both directions: exporting RUST_BACKTRACE=1 locally collapses the same test. RUST_LIB_BACKTRACE=0 keeps panic backtraces while disabling library captures. The durable fix belongs upstream in cryptography-private (ok_or_else / a non-capturing error type on the hot paths) so operator environments that export RUST_BACKTRACE=1 don't silently pay this. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The Sui and ika swarms pick "available" ports by probing and bind them later; with nextest running each test in its own process, two concurrently-booting clusters can probe the same free port and the loser panics with EADDRINUSE at node start (seen as test_validator_removed_at_epoch_2 dying 0.25s into the first 8-way run). A fixed-port listener serves as a dependency-free cross-process mutex over the boot window (held until every node listener is bound); the OS releases it whenever the holder exits, panics included, so a dead test can't wedge the suite. The long test bodies run unlocked and fully parallel. Not compiled under msim (simulated per-node ports). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…entifiers
Internal-presign session identifiers baked in the consensus round at
which each validator happened to instantiate them. Network-key
installation now completes asynchronously (at a wall-clock-dependent
moment per validator), so validators instantiate the same logical
session while processing DIFFERENT rounds — each derived a private
session id, every session had one participant, quorum never formed, the
presign pool stayed permanently empty, and user sessions wedged the
epoch-advance gate ("epoch 2 was blocked"; global-presign object never
created). Observed directly: 36 distinct session ids for 9 logical
presigns across 4 localnet validators.
Drop the round from the identifier preimage — uniqueness comes from
(epoch, sequence number, session type) — and make the sequence-number
assignment walk deterministic orders: sorted network-key ids (was
HashMap order) and sorted curves/algorithms in
supported_curve_to_signature_algorithms (was nested HashMap order,
which only ever aligned because localnets/tests run all validators in
ONE process sharing one lazy_static instance; real multi-process
committees would have diverged on day one).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…a-v2 # Conflicts: # crates/ika-core/src/dwallet_mpc/mpc_manager.rs
84f47bd to
6634936
Compare
…per consensus round adopt_cert_verified_keys and the instantiation spawn lived inside the per-consensus-round drain, so they only ran when fresh rounds arrived. That deadlocks the key-arrives-after-request bootstrap: a validator that receives the network key AFTER a session request can never adopt it — no validator can emit a consensus round WITHOUT the key, and without a round there is no adoption. Production masks it because rounds flow continuously; the missing_network_key integration test reproduced it deterministically (zero adoptions across the whole run, every party's request parked forever). Move both to once-per-iteration in run_service_loop_iteration: their inputs (the overlay watch and the persisted handoff cert) do not depend on round content, the adoption pass early-returns in O(1) when neither input changed, and the round-free internal-presign session identifiers removed the only determinism coupling to adoption position. Also switch the test's tracing init to telemetry-subscribers' env-aware config — the plain fmt() subscriber caps at INFO and silently ignores RUST_LOG, which had been hiding this exact failure from debug tracing. network_key_received_after_start_event now passes (was a deterministic wedge that survived 10x iteration budgets); regression set green: network_dkg (3), create_dwallet, internal_presign (3). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
tikv-jemallocator as the global allocator (jemalloc default feature) in all five binaries: ika-node, ika-validator, ika-fullnode, ika-notifier, and the ika CLI (which hosts whole localnet swarms). Better fragmentation behavior than glibc malloc for long-running RocksDB-heavy processes, and arch-independent. The ika-node Dockerfile previously attempted jemalloc via LD_PRELOAD, but the assignment lived inside a RUN layer and never persisted — production containers were silently running glibc malloc. The broken block and the libjemalloc-dev package are removed; the allocator now ships inside the binary. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
docs/off-chain-metadata-v2-review.md and PR-1721-action-plan.md were working documents of the branch review (explicitly in-progress, verdicts pinned to intermediate commit hashes); their durable content graduated into specs/. The content survives in git history and the PR discussion. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… to measured values, propagate the keepers Removals (experiment debris from the solved slowness investigation): the allocator/jemalloc A/B input and LD_PRELOAD harness, the runner/ubuntu-latest A/B input with its concurrency-group keying, the perf-stat wrapper (perf_event_open is blocked on the pods) and linux-tools-generic, the target-cpu=native build flags and rustflags inputs (experimentally disproven), and every comment attributing slowness to weak hardware / codegen / memory bandwidth. Retunes to post-fix measured values: cluster test_threads default 8→4 (8-way OOM-killed the 96Gi runner pod; 13/13 in ~35min at 4), cluster timeout 420→150, TS suite timeout 360→180 (full suite ~60min, readiness ~10min), TS readiness cap 40→20 minutes. Propagation of the keepers: RUST_LIB_BACKTRACE=0 beside RUST_BACKTRACE=1 in publish-typescript-sdk.yml; the IPv4+retry apt pattern to ci.yaml and simtest.yaml; run_attempt-suffixed log artifacts so reruns don't collide. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…omments, vestiges
- Rename instantiate_agreed_keys_from_voted_data →
instantiate_adopted_network_keys: the NetworkKeyData consensus vote it
was named for was removed this PR; keys are adopted from the local
overlay gated by the prior epoch's handoff cert. Test comments
narrating the deleted vote/tally/quorum flow rewritten to the actual
adopt → spawn → poll flow, and the stale "Consensus-voted network key"
log message fixed.
- Drop the always-empty accumulated_new_key_ids return from
process_consensus_rounds_from_storage (instantiation completions
surface via the per-iteration poll); stop_on_epoch_end! simplified.
- Comments measured during the eager-backtrace incident lose their
platform-tier quantifications ("minutes-scale on weak hardware",
"platform-specific slowdown", "CI runners are slower"); the
load-bearing rationales stay.
- missing_network_key's tracing init switched to
telemetry_subscribers::init_for_testing() — the TelemetryConfig::init()
variant panics when another in-process test already set the global
subscriber (the single failure in the 45-test CI run).
- Self-contained msim comments at the remaining capture/re-enter rayon
sites (the orchestrator.rs cross-references were orphaned by its
inline-under-msim change); CLAUDE.md's simtest section updated to
prefer inline-under-msim for new code, and gains a section on running
the heavy suites via the dispatchable CI workflows instead of locally.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Process artifact of the branch review, same treatment as the review walkthrough doc; the durable content lives in specs/. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ain subsystems From a 39-finding verified audit (3 angles: info-spam, lifecycle log coverage, metrics coverage; every finding adversarially checked against the code). Log spam eliminated (production runs at info): the permanent 5s overlay-incomplete warn on healthy fullnodes/notifiers (~17k lines/day) is debug with a validators-only 60th-tick escalation; the pre-freeze assembly-retry warn AND its per-tick error! double-log are debug with a 30th-tick escalation; per-tick "assembled/sent committee" infos dedup on content change; barrier/cert-read/re-submit/byzantine-fetch warns are throttled or once-per-state; the ready-signal deadline warn moved behind the re-emit gate. Lifecycle coverage added at info/warn: handoff CERT FORMED (was fully silent), cert-digest mismatch skips in adoption (the security gate was a bare `continue`), ready-signal and EndOfPublish quorum anchors, joiner announcement relay + acceptance, local attestation install, NOA/presign starvation (throttled), buffered-signature drains, boot-lock contention. Metrics (26 new, registered on the existing registries, bounded labels, re-seeded at epoch-store open so restarts don't false-alarm): mpc_data freeze epoch + excluded count, ready signals/stake/validated peers, announcements received, handoff cert epoch + signatures collected/stake/buffered/rejected, internal presign pool size + global presign queue depth + served counter, network-key instantiation in-flight/failures/per-sub-call duration histogram (DKG and reconfiguration paths), blob store size/evictions vs the 512MiB cap, P2P blob fetch outcomes, joiner bootstrap outcomes; the barrier duration histogram re-bucketed to its real 1s-30min range. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…_network_key telemetry_subscribers::init_for_testing() still panics when a sibling test's fmt().try_init() installed the global subscriber first (its Lazy guard only protects against double-init of itself) — the single failure in two consecutive 44/45 CI runs at --test-threads=4. Use fmt().with_env_filter(RUST_LOG or info).try_init(): honors RUST_LOG when this test wins the init race (the debugging need), silently defers when it loses (the parallel-suite need). Validated under the exact contention shape: 4 in-process tests at 2 threads, all green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…e lock target The epoch-close wedge (cascading TS-suite timeouts when sessions land astride lock_last_active_session_sequence_number) was an on-chain counter OVERSHOOT, reproduced and confirmed against chain state: - Move's all_current_epoch_sessions_completed is a strict equality (completed_sessions_count == locked target) and complete_user_session has no lock check, so completing any user session beyond the frozen target wedges the epoch permanently — the counter never decreases. - The global-presign serving path popped presigns from the internal pool and completed them on-chain with no lock check (unlike MPC user sessions, which gate computation on the synced target). Reproduction: target frozen at 0, 97 sessions completed anyway, end-of-publish predicate false forever, epoch unhealably stuck. - Admission rejections had the same hole: a quorum'd Rejected counts as completed on-chain, and a malformed user request rejected after the lock would overshoot identically. Gate both at consensus SUBMISSION, not serving/checkpoint build: checkpoint contents must be a deterministic function of consensus, and the lock view is wall-clock fullnode state — gating at build would fork checkpoints. Gating what each validator votes for is sound: the chain target is monotone within an epoch and frozen by the lock, so quorum agreement implies an honest validator observed the target covering the request — an agreed request can never overshoot. - get_unsent_presign_requests holds back votes for requests beyond the locally-synced target; they retry every round as the target advances and re-enter next epoch via the uncompleted-events re-pull, exactly like lock-gated MPC user sessions. - Admission rejections defer through pending_rejected_sessions with the same gate. Computation-failure rejections need no gate: computation only runs once the target covers the session. Regression tests: a beyond-target global presign (with a stocked pool that would otherwise serve it) and a beyond-target admission rejection must not produce checkpoint messages until the target covers them. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ale result Second epoch-close wedge found by the same reproduction rig (undershoot direction this time): handle_computation_results_and_submit_to_consensus iterated the batch of completed computation results with `return` (not `continue`) in its missing-session and non-active-session guards. One result for a session that went non-active while its computation was in flight — routine under load, e.g. it completed via the peers' output quorum — silently dropped EVERY other session's round messages and outputs in the same batch on that validator. Reproduced live: at an epoch boundary burst the guard fired on two validators within the same window, dropping their round-two messages for all nine internal presign sessions; with two of four parties' messages gone the threshold became unreachable forever, the internal pool never refilled, the locked-set global presigns could not be served, and the epoch wedged with completed_sessions_count below the locked target. Diagnostic signature: total MPC silence while consensus and presign votes still flow, orchestrator drained (started == completed), identical stuck session sets on every validator, and the "received a computation update for a non-active session" warn at the stall onset. Pre-existing since the guard was introduced (2025-10-29, #1588); the batch-abandoning returns become per-item skips. Also set the harness lock target in missing_network_key: the harness never syncs it from a chain, and that test's flow completes via a rejection response, which is now (correctly) held back until the target covers the session. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- computation_results_batch_survives_stale_entries (integration): feeds handle_computation_results_and_submit_to_consensus a batch mixing six results for live internal presign sessions with six results for missing sessions, and requires every live session's round message to reach consensus. Under the old batch-abandoning `return`, HashMap iteration order drops at least one live message unless all six real entries happen to come first (probability < 0.2%). Widens the handler to pub(crate) for the direct-call test. - test_global_presigns_complete_across_epoch_switches (cluster): streams global presigns across two epoch boundaries — the traffic shape that reproduced both wedges (overshoot via the previously ungated pool-serving path, undershoot via batch-dropped round messages) — then requires epochs to keep advancing and every submitted user session to drain to completed on-chain. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…BACKTRACE guard cryptography-private #575 (lazy error construction on class-groups arithmetic hot paths) merged as the only commit between the old pin and the new one, so the bump carries exactly that fix. With eager Backtrace::capture() gone upstream, the RUST_LIB_BACKTRACE=0 workaround in the workflows is no longer needed; library-error backtraces return to CI logs. Validated by A/B on the crypto-heavy network-DKG instantiation path under the exact CI env (RUST_BACKTRACE=1): 161.6s without the guard vs 162.3s with it — identical within noise, on the path that previously ran 5x slower. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The earlier sweep (cdd1757) raised retryUntil's default budget to ~10 minutes precisely because 30-attempt (~60s) caps surfaced as spurious "Condition not met after 30 attempts" failures, but 16 call sites across helpers.ts, make-public-share-and-sign, imported-key-make-public-share- and-sign, and all-combinations-future-sign still passed explicit 30/2000 or 30/1000 overrides. One of them just failed a TS suite run (the ECDSASecp256r1 make-public case at 60s) while every surrounding test passed — a session that crosses an epoch boundary legitimately needs minutes, more so now that beyond-the-lock sessions correctly wait for the next epoch. Use the helper defaults everywhere. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Measured on a full Test Cluster run: 57% of the log (2,415 of 4,211 lines) was the cargo Downloaded/Compiling stream inside the test step. Build now happens in its own step — collapsed in the UI when green — so the test step carries only nextest/test progress and failure replays, plus --cargo-quiet on the nextest run for the residual cargo lines. Same split for integration-tests-ci (where it also lets `time` keep covering only test execution). Failure replays deliberately stay inline: when the runner pod dies (OOM/eviction) the artifact upload never happens and the live log is the only surviving evidence. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The lock-gating fixes introduced a decision rule (gate every user-session consensus submission on the synced lock target, never checkpoint content) and rest on a cross-epoch invariant (the strict-equality close predicate makes overshoot permanently unrecoverable). Per the specs maintenance rule, write them down: the frozen target's mechanics, why the equality cuts both ways, the quorum-safety argument for submission gating, the per-completion-path rule table, and the batch-processing rule for computation results. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Re-lands the off-chain validator-metadata pipeline on top of
dev(full clean-slate rebuild from the original branch), adds the EndOfPublishV2 consensus message that bundles the handoff signature with the EndOfPublish vote, and closes the final propagation gap (P2P blob fetch) so chain reads are eliminated formpc_data/ network-key / reconfig blobs in v4.Gated entirely by
off_chain_validator_metadata(and the relatednetwork_encryption_key_version/reconfiguration_message_versionbumps); v3 stays on the original chain-driven flow.Pipeline (all v4-gated)
MpcDataAnnouncementSenderbroadcasts each validator's mpc_data digest via consensus +EpochMpcDataReadySignalper epoch +NetworkKeyDKGReadySignalper key. Validator's own blob is mirrored into both the perpetual store and the in-memory cache backing the local AnemoGetMpcDataBlobserver.validator_mpc_data_announcementstable;EpochMpcDataReadySignalquorum freezes the snapshot intofrozen_validator_mpc_data_input_set(idempotent on first quorum).PeerBlobFetchertask pulls peer mpc_data blobs over Anemo, hash-verifies, writes to perpetual + in-memory cache so this validator can also serve other peers without restart.CertifiedHandoffAttestations; Anemo endpoint for joiners to bootstrap from a cert.JoinerPubkeyProvider(next-epoch committee) +SubmitMpcDataAnnouncementrelay RPC.get_network_encryption_key_with_full_data_by_epoch,get_mpc_data_from_validators_pool) in v4 — overlay fills bytes from local producer cache + P2P fetches.EndOfPublishV2
New consensus message variant
EndOfPublishV2 { authority, handoff_signature }that bundles the validator's signed handoff attestation with its EndOfPublish vote at exactly the consensus point EOP fires. Eliminates the V1 race where a separateHandoffSignatureconsensus tx could arrive at peers out of order with EndOfPublish and produce divergent aggregator state across the committee. Producer emits V2 instead of V1 + separate handoff whenoff_chain_validator_metadatais on; consumer side splits the bundled message back into its two parts and routes each through the existing v1 paths.Protocol version
Bumps
MAX_PROTOCOL_VERSIONto 4. v4 activates:internal_presign_sessions,off_chain_validator_metadata,consensus_skip_gced_blocks_in_direct_finalization,network_encryption_key_version = 3,reconfiguration_message_version = 3.Test plan
cargo test --release -p ika-core test_network_dkg_full_flowpasses (181s).off_chain_metadata_v4_does_not_read_blobs_from_chainpasses (256s) — chain blob reads delta = 0 across an epoch transition under v4.handoff_signature.signer == authority) is the right semantics.PeerBlobFetcherpolling cadence (2s) and the gating (skip self, skip already-cached) are appropriate.Review-hardening update (multi-pass byzantine/restart review)
A multi-agent critical review (byzantine-validator, honest-mistake, restart, and doc-clarity passes) surfaced correctness gaps that are now fixed. All work is v4-gated; v3 paths are untouched at runtime.
Fixes
Write-through/read-through
BlobCache— the perpetual store and the in-memory cache backing the Anemo server were synced by convention at each call site; a forgotten mirror left a durably-stored blob (e.g. a DKG/reconfiguration output) unservable until restart.BlobCacheowns both behind oneinsert(durable-then-cache) and one read-throughget(cache, then perpetual fallback). The fallback is the structural fix: a perpetual-only blob is now servable without a restart.No BLS in the off-chain pipeline; explicit self/relayed announcement kinds — the single
ValidatorMpcDataAnnouncementconsensus kind (with its implicit sender≠signer exemption and BLS payload signature) is split intoValidatorMpcDataAnnouncement(self-submission, no signature — consensus block-author authenticates, enforced bysender == validator) andRelayedValidatorMpcDataAnnouncement(carries the joiner's Ed25519 consensus-key signature, verified against its next-epoch consensus pubkey).JoinerPubkeyProvidernow returns the Ed25519 pubkey, fetched from next-epoch validator info via the Sui client.epochreturns to the announcement body so the joiner signature binds it (no cross-epoch replay).Joiner announcement fan-out + P2P retry (was unwired) —
submit_announcement_to_committeeexisted but nothing called it, so a joiner never announced and could never enter the next epoch's working set. NewJoinerAnnouncementSendersigns (Ed25519) and fans out to current-committee peers with retry untilf+1distinct peers accept (one honest relayer guaranteed) or a bounded budget is exhausted; wired into a node-startup watcher that fires when the node observes itself inV_{e+1}but not the current committee.Freeze delayed until joiners are attestable (closes the joiner-coverage gap) — the freeze fired on the first quorum of ready signals, which validators emitted early (before
V_{e+1}is published mid-epoch), so joiners were never in the frozen set, the next committee's class-groups map, or the handoff cert. Validators now withhold their ready signal untilV_{e+1}is published and all its members are locally validated (or an epoch-clock deadline as a liveness backstop). This is a producer emit-gate change — the deterministic single-freeze mechanism is preserved; the deadline is wall-clock and only affects when each validator emits, while the freeze snapshot is still computed deterministically at the consensus-ordered quorum point.Producer self-heals via confirmation retries — the producer marked itself done on submit handoff, not confirmation (
submit_to_consensusreturns Ok as soon as the background submit task is spawned, which can still fail to sequence at the epoch boundary or on crash). It now caches an idempotent announcement (same key on re-send → consensus dedup) and re-submits until its own entry appears in the table.Empty-input off-chain assembly + frozen-set-as-truth —
assemble_committee_class_groups_off_chainno longer returnsCompletewith empty maps;try_assemble_class_groupstreats the frozen set as the post-freeze source of truth so a never-announcer can't stall assembly (the delayed freeze ensures the frozen set includes joiners).Smaller hardening — sentinel
timestamp_ms == 0rejected at sign + record; self-attestation gated on own-blob health; cert duplicate-signer / quorum-boundary tests; doc sweep fixing stale "chain fallback" / "NetworkKeyDKGReadySignal freezes" / plan-phase references.Tests
blob_cache,validator_metadata,epoch_tasks) + 9 ika-types; clippy clean across ika-core/types/network/node.off_chain_metadata✅ (271s),joiner::test_joiner_added_at_epoch_2✅ (262s).network_dkg(incl. bwd-compat) 7/7 ✅ (1776s).cd42e9c015(Cargo Test Check, Format, all Move formatters pass).In-PR completion: cert-bootstrap consumer + F4-1 cycle-break (#4, #6)
Two items previously slated as follow-ups, now done in-PR.
#4 — Joiner cert-bootstrap consumer wired in
The handoff-cert Anemo endpoint had no consumer.
JoinerBootstrapVerifier(
epoch_tasks/joiner_bootstrap_verifier.rs) now fetches the prior-epochCertifiedHandoffAttestationfrom a peer (HandoffCertSourcetrait,P2pHandoffCertSourceimpl) and verifies it against the prior committeevia
verify_joiner_bootstrap_cert(cert, expected_prior_epoch, prior_committee, prior_consensus_pubkeys, expected_next_committee_pubkeys)— epoch-bound so a stale cert can't be replayed. Wired into
monitor_reconfiguration, gated on off-chain metadata +epoch >= 1+self absent from the prior committee (i.e. an actual joiner). Non-halting:
a failed fetch/verify logs
error!and retries on a bounded budget ratherthan stalling startup. 4 unit tests cover fetch / retry / verify-loop.
F4-1 cycle-break (the blocker the dedicated test caught)
The dedicated test surfaced a real deadlock: the joiner watcher and the
producer's freeze emit-gate both keyed off the assembled next-epoch
committee — but the assembly can't complete until the joiner announces,
and the joiner can't learn it's a joiner until the assembly publishes.
Fixed by publishing a lightweight chain view of the next-epoch
committee (members + stake, empty class-groups) over a new
chain_next_epoch_committee_receiveras soon as Sui selects it(
sui_syncer::sync_next_committee, before off-chain assembly). The joinerwatcher and the freeze emit-gate consume this chain view; the joiner now
demonstrably fans its mpc_data out (confirmed in logs — it never did
before). Pubkey-provider poll tightened 15s→5s and joiner fan-out retry
10s→3s so the relay path can complete inside the freeze window.
#6 — Dedicated F4-1 cluster test (added, passing)
test_joiner_lands_in_next_committee_class_groupsasserts a mid-epochjoiner lands in the next committee's
class_groups_public_keys_and_proofs.It did its job — driving it to green surfaced three distinct, interacting
defects in off-chain joiner integration, all now fixed (commit
5a241701d1). It passes end-to-end (322s; epoch-1→2 freeze isfrozen=5 excluded=0):Ready-signal canonicalization stripped joiners (root cause). Incoming
EpochMpcDataReadySignals were canonicalized against the currentcommittee (
drop weight==0). A next-epoch joiner has current-committeeweight 0, so it was filtered out of the recorded signal even though
every validator correctly attested it (emit-time
vcount=5). The freezepartition — which decides next-epoch membership — then saw zero
attestations for the joiner and excluded it (
frozen=4 excluded=1). Fix:treat announcers as valid attestation targets in canonicalization; a
joiner that announced has a signed announcement consensus-ordered before
any ready signal attesting it, so this is padding-safe.
The joiner's blob had no path to the committee. The relay forwarded
only the digest, and the peer blob fetcher pulls only from
current-committee peers (excluding the joiner) — so no validator could
obtain the joiner's blob to validate it. Fix: the joiner pushes its blob
on the fan-out RPC; the relayer verifies (hash + structural decode) and
caches it write-through, and the committee resolves it via the existing
content-addressed P2P fetch (joiner→committee direction, no dial-back).
Poll cadences too coarse for the freeze window. The integration path
must finish inside
[epoch/2, 3·epoch/4]. The fixed multi-second cadences(10s chain-committee sync, 5s pubkey refresh, 3s fan-out, 2s blob
fetch/producer) overrun that window in a short test epoch. New
epoch_scaled_poll_intervalscales each to ~1% of the epoch, clamped tothe production default — a no-op at production epoch lengths.
A fail-fast timeout was added on the joiner's epoch-2 wait so an excluded
joiner fails the test with a clear message instead of hanging.
Stale-share sign-failure fix: prepare-then-start handoff gate (+ epoch-advance robustness)
Heavy-load + slow-network testing surfaced two issues, now fixed (v4-gated where applicable; v3 runtime paths untouched).
Root cause — stale network-key shares after reconfiguration
Per-party adoption-time tracing showed the off-chain reconfiguration output is adopted at staggered times across validators (60–130s lags under load). The new epoch's MPC service started immediately at the reconfigure seam while the handoff data was still arriving asynchronously, so a validator entering epoch N could begin signing while still holding epoch N-1's
(t,n)decryption-key sharing — combined with peers who already adopted epoch N's sharing, the threshold sign round failed withFailedToAdvanceMPC(InvalidParameters)(~50–130s after each advance).Fixes
prepare-then-start barrier (
ika-node::monitor_reconfiguration) — before starting the new epoch's MPC components, install the new epoch's network-key blob-source overlay first (else the barrier deadlocks:network_keys_receiveris fed from whichever overlay is installed, and the per-iteration install runs only in the next loop iteration), then block until (a) the cross-epoch handoff cert anchoring the new epoch is present and re-verified against the signing committee — a second verification at consume-time on top of the one done before it is persisted, fail-closed on a tampered local cert DB — and (b) every tracked network key has surfaced its reconfiguration output for the new epoch (current_epoch >= next_epoch, non-empty output). Blocks indefinitely (safety-first; a stalled validator that is visibly not signing beats one signing with the wrong shares). Only a validator in the new epoch prepares. Clear logs (INFO entry/exit, WARN every ~10s with the breakdown) + metricsika_handoff_prepare_{waiting,retries_total,duration_seconds}.advance_epoch session-completion gate (
sui_executor) — under loadsessions_manager::advance_epochMoveAborts withENotAllCurrentEpochSessionsCompleted(code 6) when a network-key system session starts after the quorum'sreceived_end_of_publishsnapshot; the notifier retried for an hour thenpanic!d the validator → quorum loss → cascade. Now re-checksSessionsManager::all_current_epoch_sessions_completed()(new, unit-tested) against just-synced state and holds the tick rather than submitting a doomed tx; the hour-long panic only guards genuinely fatal submission errors.Integration-test poll timeouts — raised the SDK
#pollUntilConditiondefault timeout (30s → 10m) andretryUntildefault attempts (30 → 600), and removed the now-redundant per-call{ timeout: 600000 }overrides so every poll-site uses one default. MPC rounds legitimately take minutes under load; the short defaults were turning slowness into spuriousTimeout waiting for…failures (never crypto failures).Pre-existing test-compile fix (
new_validator_mpc_data_announcementmissingblobarg) that broke the ika-core test build.Tests
SessionsManager::all_current_epoch_sessions_completed+all_network_keys_ready_for_epochtruth tables.off_chain_metadata✅,protocol_version_transition(v3→v4) ✅ — barrier engages, verifies, and clears at every transition; noInvalidParameters, noWaitingForNetworkKeywedge, no panic.all-combinations-future-sign21/21 (previously produced 3InvalidParameterssign failures), plus dwallet-creation 8/8, make-public-share-and-sign 7/7, global-presign 5/5, imported-key 6/6, dwallet-sign-during-dkg 10/10 — zero sign failures across every run, reaching epoch 8+ under the heavy suite that previously crashed at epoch 7. Every remaining intermittent test failure was a polling timeout under CPU starvation, not a crypto failure (addressed by the timeout-default change above).