Skip to content

Off-chain validator metadata + EndOfPublishV2#1721

Open
omersadika wants to merge 206 commits into
devfrom
feat/off-chain-metadata-v2
Open

Off-chain validator metadata + EndOfPublishV2#1721
omersadika wants to merge 206 commits into
devfrom
feat/off-chain-metadata-v2

Conversation

@omersadika

@omersadika omersadika commented May 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Re-lands the off-chain validator-metadata pipeline on top of dev (full clean-slate rebuild from the original branch), adds the EndOfPublishV2 consensus message that bundles the handoff signature with the EndOfPublish vote, and closes the final propagation gap (P2P blob fetch) so chain reads are eliminated for mpc_data / network-key / reconfig blobs in v4.

Gated entirely by off_chain_validator_metadata (and the related network_encryption_key_version / reconfiguration_message_version bumps); v3 stays on the original chain-driven flow.

Pipeline (all v4-gated)

  • Producer side: MpcDataAnnouncementSender broadcasts each validator's mpc_data digest via consensus + EpochMpcDataReadySignal per epoch + NetworkKeyDKGReadySignal per key. Validator's own blob is mirrored into both the perpetual store and the in-memory cache backing the local Anemo GetMpcDataBlob server.
  • Consensus side: per-epoch validator_mpc_data_announcements table; EpochMpcDataReadySignal quorum freezes the snapshot into frozen_validator_mpc_data_input_set (idempotent on first quorum).
  • P2P side: new PeerBlobFetcher task pulls peer mpc_data blobs over Anemo, hash-verifies, writes to perpetual + in-memory cache so this validator can also serve other peers without restart.
  • Handoff path: pure attestation/signature/aggregate helpers; per-epoch record + per-validator perpetual CertifiedHandoffAttestations; Anemo endpoint for joiners to bootstrap from a cert.
  • Joiner path: late-binding JoinerPubkeyProvider (next-epoch committee) + SubmitMpcDataAnnouncement relay RPC.
  • Sui syncer: prefers off-chain class-groups assembly; skips chain blob reads (get_network_encryption_key_with_full_data_by_epoch, get_mpc_data_from_validators_pool) in v4 — overlay fills bytes from local producer cache + P2P fetches.

EndOfPublishV2

New consensus message variant EndOfPublishV2 { authority, handoff_signature } that bundles the validator's signed handoff attestation with its EndOfPublish vote at exactly the consensus point EOP fires. Eliminates the V1 race where a separate HandoffSignature consensus tx could arrive at peers out of order with EndOfPublish and produce divergent aggregator state across the committee. Producer emits V2 instead of V1 + separate handoff when off_chain_validator_metadata is on; consumer side splits the bundled message back into its two parts and routes each through the existing v1 paths.

Protocol version

Bumps MAX_PROTOCOL_VERSION to 4. v4 activates: internal_presign_sessions, off_chain_validator_metadata, consensus_skip_gced_blocks_in_direct_finalization, network_encryption_key_version = 3, reconfiguration_message_version = 3.

Test plan

  • Acceptance gate: cargo test --release -p ika-core test_network_dkg_full_flow passes (181s).
  • V2 unit tests: BCS round-trip + V1/V2 key distinctness pass.
  • Protocol-config snapshot tests pass.
  • 10-epoch churn cluster test passes (~21 min; 14 aggregate certs across the cluster, all 10 epoch transitions advance within timeout).
  • New cluster test off_chain_metadata_v4_does_not_read_blobs_from_chain passes (256s) — chain blob reads delta = 0 across an epoch transition under v4.
  • Existing joiner/remove/multi-epoch user-session cluster tests pass.
  • Reviewer: please double-check the EndOfPublishV2 producer/consumer gating and the wire-author check (handoff_signature.signer == authority) is the right semantics.
  • Reviewer: confirm the PeerBlobFetcher polling cadence (2s) and the gating (skip self, skip already-cached) are appropriate.

Review-hardening update (multi-pass byzantine/restart review)

A multi-agent critical review (byzantine-validator, honest-mistake, restart, and doc-clarity passes) surfaced correctness gaps that are now fixed. All work is v4-gated; v3 paths are untouched at runtime.

Fixes

  1. Write-through/read-through BlobCache — the perpetual store and the in-memory cache backing the Anemo server were synced by convention at each call site; a forgotten mirror left a durably-stored blob (e.g. a DKG/reconfiguration output) unservable until restart. BlobCache owns both behind one insert (durable-then-cache) and one read-through get (cache, then perpetual fallback). The fallback is the structural fix: a perpetual-only blob is now servable without a restart.

  2. No BLS in the off-chain pipeline; explicit self/relayed announcement kinds — the single ValidatorMpcDataAnnouncement consensus kind (with its implicit sender≠signer exemption and BLS payload signature) is split into ValidatorMpcDataAnnouncement (self-submission, no signature — consensus block-author authenticates, enforced by sender == validator) and RelayedValidatorMpcDataAnnouncement (carries the joiner's Ed25519 consensus-key signature, verified against its next-epoch consensus pubkey). JoinerPubkeyProvider now returns the Ed25519 pubkey, fetched from next-epoch validator info via the Sui client. epoch returns to the announcement body so the joiner signature binds it (no cross-epoch replay).

  3. Joiner announcement fan-out + P2P retry (was unwired)submit_announcement_to_committee existed but nothing called it, so a joiner never announced and could never enter the next epoch's working set. New JoinerAnnouncementSender signs (Ed25519) and fans out to current-committee peers with retry until f+1 distinct peers accept (one honest relayer guaranteed) or a bounded budget is exhausted; wired into a node-startup watcher that fires when the node observes itself in V_{e+1} but not the current committee.

  4. Freeze delayed until joiners are attestable (closes the joiner-coverage gap) — the freeze fired on the first quorum of ready signals, which validators emitted early (before V_{e+1} is published mid-epoch), so joiners were never in the frozen set, the next committee's class-groups map, or the handoff cert. Validators now withhold their ready signal until V_{e+1} is published and all its members are locally validated (or an epoch-clock deadline as a liveness backstop). This is a producer emit-gate change — the deterministic single-freeze mechanism is preserved; the deadline is wall-clock and only affects when each validator emits, while the freeze snapshot is still computed deterministically at the consensus-ordered quorum point.

  5. Producer self-heals via confirmation retries — the producer marked itself done on submit handoff, not confirmation (submit_to_consensus returns Ok as soon as the background submit task is spawned, which can still fail to sequence at the epoch boundary or on crash). It now caches an idempotent announcement (same key on re-send → consensus dedup) and re-submits until its own entry appears in the table.

  6. Empty-input off-chain assembly + frozen-set-as-truthassemble_committee_class_groups_off_chain no longer returns Complete with empty maps; try_assemble_class_groups treats the frozen set as the post-freeze source of truth so a never-announcer can't stall assembly (the delayed freeze ensures the frozen set includes joiners).

  7. Smaller hardening — sentinel timestamp_ms == 0 rejected at sign + record; self-attestation gated on own-blob health; cert duplicate-signer / quorum-boundary tests; doc sweep fixing stale "chain fallback" / "NetworkKeyDKGReadySignal freezes" / plan-phase references.

Tests

  • Unit: 75 ika-core (blob_cache, validator_metadata, epoch_tasks) + 9 ika-types; clippy clean across ika-core/types/network/node.
  • v4 cluster (exercise freeze-delay + joiner fan-out end-to-end): off_chain_metadata ✅ (271s), joiner::test_joiner_added_at_epoch_2 ✅ (262s).
  • v3 regression: network_dkg (incl. bwd-compat) 7/7 ✅ (1776s).
  • CI green at HEAD cd42e9c015 (Cargo Test Check, Format, all Move formatters pass).

In-PR completion: cert-bootstrap consumer + F4-1 cycle-break (#4, #6)

Two items previously slated as follow-ups, now done in-PR.

#4 — Joiner cert-bootstrap consumer wired in

The handoff-cert Anemo endpoint had no consumer. JoinerBootstrapVerifier
(epoch_tasks/joiner_bootstrap_verifier.rs) now fetches the prior-epoch
CertifiedHandoffAttestation from a peer (HandoffCertSource trait,
P2pHandoffCertSource impl) and verifies it against the prior committee
via verify_joiner_bootstrap_cert(cert, expected_prior_epoch, prior_committee, prior_consensus_pubkeys, expected_next_committee_pubkeys)
— epoch-bound so a stale cert can't be replayed. Wired into
monitor_reconfiguration, gated on off-chain metadata + epoch >= 1 +
self absent from the prior committee (i.e. an actual joiner). Non-halting:
a failed fetch/verify logs error! and retries on a bounded budget rather
than stalling startup. 4 unit tests cover fetch / retry / verify-loop.

F4-1 cycle-break (the blocker the dedicated test caught)

The dedicated test surfaced a real deadlock: the joiner watcher and the
producer's freeze emit-gate both keyed off the assembled next-epoch
committee — but the assembly can't complete until the joiner announces,
and the joiner can't learn it's a joiner until the assembly publishes.
Fixed by publishing a lightweight chain view of the next-epoch
committee (members + stake, empty class-groups) over a new
chain_next_epoch_committee_receiver as soon as Sui selects it
(sui_syncer::sync_next_committee, before off-chain assembly). The joiner
watcher and the freeze emit-gate consume this chain view; the joiner now
demonstrably fans its mpc_data out (confirmed in logs — it never did
before). Pubkey-provider poll tightened 15s→5s and joiner fan-out retry
10s→3s so the relay path can complete inside the freeze window.

#6 — Dedicated F4-1 cluster test (added, passing)

test_joiner_lands_in_next_committee_class_groups asserts a mid-epoch
joiner lands in the next committee's class_groups_public_keys_and_proofs.
It did its job — driving it to green surfaced three distinct, interacting
defects in off-chain joiner integration, all now fixed (commit
5a241701d1). It passes end-to-end (322s; epoch-1→2 freeze is
frozen=5 excluded=0):

  1. Ready-signal canonicalization stripped joiners (root cause). Incoming
    EpochMpcDataReadySignals were canonicalized against the current
    committee (drop weight==0). A next-epoch joiner has current-committee
    weight 0, so it was filtered out of the recorded signal even though
    every validator correctly attested it (emit-time vcount=5). The freeze
    partition — which decides next-epoch membership — then saw zero
    attestations for the joiner and excluded it (frozen=4 excluded=1). Fix:
    treat announcers as valid attestation targets in canonicalization; a
    joiner that announced has a signed announcement consensus-ordered before
    any ready signal attesting it, so this is padding-safe.

  2. The joiner's blob had no path to the committee. The relay forwarded
    only the digest, and the peer blob fetcher pulls only from
    current-committee peers (excluding the joiner) — so no validator could
    obtain the joiner's blob to validate it. Fix: the joiner pushes its blob
    on the fan-out RPC; the relayer verifies (hash + structural decode) and
    caches it write-through, and the committee resolves it via the existing
    content-addressed P2P fetch (joiner→committee direction, no dial-back).

  3. Poll cadences too coarse for the freeze window. The integration path
    must finish inside [epoch/2, 3·epoch/4]. The fixed multi-second cadences
    (10s chain-committee sync, 5s pubkey refresh, 3s fan-out, 2s blob
    fetch/producer) overrun that window in a short test epoch. New
    epoch_scaled_poll_interval scales each to ~1% of the epoch, clamped to
    the production default — a no-op at production epoch lengths.

A fail-fast timeout was added on the joiner's epoch-2 wait so an excluded
joiner fails the test with a clear message instead of hanging.


Stale-share sign-failure fix: prepare-then-start handoff gate (+ epoch-advance robustness)

Heavy-load + slow-network testing surfaced two issues, now fixed (v4-gated where applicable; v3 runtime paths untouched).

Root cause — stale network-key shares after reconfiguration

Per-party adoption-time tracing showed the off-chain reconfiguration output is adopted at staggered times across validators (60–130s lags under load). The new epoch's MPC service started immediately at the reconfigure seam while the handoff data was still arriving asynchronously, so a validator entering epoch N could begin signing while still holding epoch N-1's (t,n) decryption-key sharing — combined with peers who already adopted epoch N's sharing, the threshold sign round failed with FailedToAdvanceMPC(InvalidParameters) (~50–130s after each advance).

Fixes

  1. prepare-then-start barrier (ika-node::monitor_reconfiguration) — before starting the new epoch's MPC components, install the new epoch's network-key blob-source overlay first (else the barrier deadlocks: network_keys_receiver is fed from whichever overlay is installed, and the per-iteration install runs only in the next loop iteration), then block until (a) the cross-epoch handoff cert anchoring the new epoch is present and re-verified against the signing committee — a second verification at consume-time on top of the one done before it is persisted, fail-closed on a tampered local cert DB — and (b) every tracked network key has surfaced its reconfiguration output for the new epoch (current_epoch >= next_epoch, non-empty output). Blocks indefinitely (safety-first; a stalled validator that is visibly not signing beats one signing with the wrong shares). Only a validator in the new epoch prepares. Clear logs (INFO entry/exit, WARN every ~10s with the breakdown) + metrics ika_handoff_prepare_{waiting,retries_total,duration_seconds}.

  2. advance_epoch session-completion gate (sui_executor) — under load sessions_manager::advance_epoch MoveAborts with ENotAllCurrentEpochSessionsCompleted (code 6) when a network-key system session starts after the quorum's received_end_of_publish snapshot; the notifier retried for an hour then panic!d the validator → quorum loss → cascade. Now re-checks SessionsManager::all_current_epoch_sessions_completed() (new, unit-tested) against just-synced state and holds the tick rather than submitting a doomed tx; the hour-long panic only guards genuinely fatal submission errors.

  3. Integration-test poll timeouts — raised the SDK #pollUntilCondition default timeout (30s → 10m) and retryUntil default attempts (30 → 600), and removed the now-redundant per-call { timeout: 600000 } overrides so every poll-site uses one default. MPC rounds legitimately take minutes under load; the short defaults were turning slowness into spurious Timeout waiting for… failures (never crypto failures).

  4. Pre-existing test-compile fix (new_validator_mpc_data_announcement missing blob arg) that broke the ika-core test build.

Tests

  • Unit: SessionsManager::all_current_epoch_sessions_completed + all_network_keys_ready_for_epoch truth tables.
  • v4 cluster: off_chain_metadata ✅, protocol_version_transition (v3→v4) ✅ — barrier engages, verifies, and clears at every transition; no InvalidParameters, no WaitingForNetworkKey wedge, no panic.
  • TS integration (fresh networks): all-combinations-future-sign 21/21 (previously produced 3 InvalidParameters sign failures), plus dwallet-creation 8/8, make-public-share-and-sign 7/7, global-presign 5/5, imported-key 6/6, dwallet-sign-during-dkg 10/10 — zero sign failures across every run, reaching epoch 8+ under the heavy suite that previously crashed at epoch 7. Every remaining intermittent test failure was a polling timeout under CPU starvation, not a crypto failure (addressed by the timeout-default change above).

omersadika and others added 30 commits May 17, 2026 16:16
Foundation for the off-chain validator-metadata read flow. Pure
types and no-op consensus dispatch — no behavior change, so the
acceptance gate `test_network_dkg_full_flow` still passes.

New types in `ika_types::validator_metadata`:
- ValidatorMpcDataAnnouncement / SignedValidatorMpcDataAnnouncement
- HandoffItemKey (sorted enum: NetworkDkgOutput | NetworkReconfigurationOutput | ValidatorMpcData)
- HandoffAttestation with `items: Vec<(HandoffItemKey, [u8;32])>` sorted strictly ascending — plain length-prefixed BCS list, no map-aware bindings needed for non-Rust verifiers
- HandoffSignatureMessage (Ed25519 sig by consensus key, NOT protocol key)
- CertifiedHandoffAttestation (Vec<(AuthorityName, Ed25519Signature)>; Ed25519 doesn't aggregate)
- EpochMpcDataReadySignal

IntentScope: +ValidatorMpcDataAnnouncement, +HandoffAttestation.

ConsensusTransactionKind + Key: 3 new variants + constructors +
key extraction + Debug arms. AuthorityPerEpochStore /
consensus_handler / consensus_validator wire dispatch as no-ops
(actual handlers land in later steps); the per-epoch sender-author
match enforces wire-binding for HandoffSignature and
EpochMpcDataReadySignal (signer == consensus author), and is a
trivial pass for ValidatorMpcDataAnnouncement (the inner BLS sig
authenticates the validator's intent independent of the relayer).

Unit tests cover BCS roundtrip + sort stability + ready-signal
roundtrip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Anemo `ValidatorMetadata` service with one method
`GetMpcDataBlob(blob_hash) -> Option<MpcDataBlob>`. Backed by an
`InMemoryBlobStore` (RwLock<HashMap<[u8;32], Vec<u8>>>) implementing
`MpcDataBlobStorage`. Callers hash-verify returned bytes — the
network layer doesn't, and the doc comment on `fetch_blob` says so.

`AuthorityPerpetualTables::mpc_artifact_blobs: DBMap<[u8;32], Vec<u8>>`
with insert / get / iter helpers — the cross-restart store. At node
startup `create_p2p_network` iterates that table and hydrates the
in-memory cache before mounting the anemo server, so a restart
keeps serving whatever blobs the validator had persisted.

No producers or consumers wire up yet — those land in subsequent
steps. The endpoint just serves whatever's been inserted (initially
nothing on a fresh node).

Acceptance gate `test_network_dkg_full_flow` passes (142s).
2 new unit tests in ika-network (`in_memory_blob_store_roundtrip`,
`mpc_data_blob_hash_is_deterministic`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Producer side (ika_core::validator_metadata):
- derive_mpc_data_blob(seed) returns the canonical BCS-encoded
  VersionedMPCData::V1 bytes — same encoding the CLI submits on
  chain via set_next_epoch_mpc_data_bytes. Deterministic from
  seed, so off-chain blobs hash-match chain bytes.
- now_ms() for the announcement timestamp (latest-by-timestamp
  rule means later calls win, which is correct after a seed
  rotation).
- sign_validator_mpc_data_announcement(...) builds + BLS-signs the
  announcement ready for consensus.

Consumer side (AuthorityPerEpochStore):
- New per-epoch table validator_mpc_data_announcements:
  DBMap<AuthorityName, SignedValidatorMpcDataAnnouncement>.
- record_validator_mpc_data_announcement verifies the BLS sig
  against self.committee() (current-epoch path only — next-epoch
  joiner path deferred to step 6) and applies the
  latest-by-timestamp rule on insert. Replays and stale duplicates
  are silently dropped.
- get_validator_mpc_data_announcement accessor.
- Consensus dispatch wires the ConsensusTransactionKind::
  ValidatorMpcDataAnnouncement variant through.

Unit tests in ika-core::validator_metadata:
- derive_mpc_data_blob_is_deterministic
- sign_announcement_verifies_against_signer (covers intent
  scope + epoch binding + tamper detection).

Acceptance gate test_network_dkg_full_flow still passes (143s).
No producers wired up yet — they land in subsequent steps along
with the ready-signal freeze.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two new epoch tables and a producer helper for the freeze step
of the off-chain validator-metadata flow.

`epoch_mpc_data_ready_signals` records, per authority, that this
validator has decided its mpc_data input set is sufficient (`>=
quorum_threshold` announcements observed). The first incoming signal
that crosses quorum triggers `freeze_mpc_data_if_first`, which
idempotently snapshots `validator_mpc_data_announcements` into
`frozen_validator_mpc_data_input_set` — the immutable, content-
addressed view of validator mpc_data used by all downstream
consumers (handoff, reconfig, joiner bootstrap).

The signal payload itself is unauthenticated; authorisation is the
consensus binding (the authority that submitted the transaction).
This is enforced at consensus dispatch in `AuthorityPerEpochStore`.

Producer side: `build_epoch_mpc_data_ready_signal_transaction` wraps
the signal in a `ConsensusTransaction` ready for the consensus
adapter.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 142.28s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Joining validators (in V_{e+1} but not in V_e) can't submit
directly to consensus because they aren't members of the current
consensus committee. They fan out their signed mpc_data
announcement to every current-committee peer over a new Anemo RPC
`SubmitMpcDataAnnouncement`; one honest relayer is enough to land
the announcement in consensus.

This commit lands the transport only:
- `SubmitMpcDataAnnouncementRequest{Response}` wire types.
- `AnnouncementRelay` trait (impl supplied by the node once epoch
  store + consensus adapter are up).
- `AnnouncementRelayHandle` — an `ArcSwapOption` late-binding
  holder, installed at first epoch start and re-installed across
  epoch boundaries. The Anemo server is constructed at node
  startup before any epoch store exists, so install-after-the-fact
  is needed.
- Anemo server impl that returns `Rejected` while the relay is
  uninstalled (joiners retry) and dispatches to the active relay
  otherwise.
- Client helpers: `submit_announcement_to_peer` (single peer) and
  `submit_announcement_to_committee` (concurrent fan-out).

Installation of the actual relay impl (which performs signature
verification against the pending active set) is deferred to the
PendingActiveSet step, since the relay needs that verification
before it can safely submit.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 142.61s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replaces the placeholder next-epoch branch in
`record_validator_mpc_data_announcement` with real signature
verification gated on a `JoinerPubkeyProvider`.

`JoinerPubkeyProvider::is_registered_joiner(&AuthorityName) -> bool`
is the trait the Sui-backed lookup will implement; a future step
populates it from `validator_set.pending_active_set` plus each
entry's `StakingPool.validator_info`'s next-epoch pubkey. Until
that lands, `joiner_pubkey_provider` is unset and all next-epoch
announcements drop — current-epoch flow is unchanged.

`verify_joiner_announcement` is a pure helper (caller passes
`expected_epoch` and the provider). The per-epoch-store method
calls it and reacts to the four-way verdict
(Accept/UnregisteredJoiner/InvalidSignature/InconsistentEnvelope);
only `Accept` proceeds to the latest-by-timestamp insert rule.

The provider is held in an `ArcSwapOption` on
`AuthorityPerEpochStore`, swappable across epoch boundaries via
`install_joiner_pubkey_provider` / `clear_joiner_pubkey_provider`.
`AuthorityName == AuthorityPublicKeyBytes`, so the verifier uses
`signed.auth_sig.authority` as the pubkey directly — the provider
only authorizes *which* names are joinable.

Tests cover Accept, UnregisteredJoiner, InvalidSignature (tampered
blob hash), InconsistentEnvelope (wrong epoch + authority field
mismatch), and `StaticJoinerPubkeyProvider` membership semantics.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 148.28s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Lands the canonical, off-chain handoff attestation primitives
behind the next-step record/persist plumbing. These are the
building blocks each validator runs locally at EndOfPublish
(builder + signer) and that every validator runs on incoming
consensus signatures (verifier + aggregator).

- `build_handoff_attestation`: sorts items strictly ascending by
  `HandoffItemKey` (the wire format is a Vec, not a map, so the
  sort defines the canonical bytes every signer commits to);
  rejects duplicate keys.
- `hash_next_committee_pubkey_set`: dedup + sort + BCS-encode +
  Blake2b256 over the next committee's pubkey set. This goes in
  the attestation header, so verifiers can confirm the cert is
  bound to the committee they're handing off to.
- `sign_handoff_attestation`: Ed25519 over
  `bcs(IntentMessage::new(HandoffAttestation, attestation))` —
  signed with the validator's *consensus* key, NOT BLS. (Joiners
  look up signers' consensus pubkeys in the prior committee's
  on-chain validator info.)
- `ConsensusPubkeyProvider` trait + `StaticConsensusPubkeyProvider`
  for the consensus-pubkey lookup, mirroring the joiner-provider
  shape from step 6.
- `verify_handoff_signature` returns a four-way verdict
  (Accept/UnknownSigner/InvalidSignature/AttestationMismatch).
- `HandoffAggregator`: one-shot stake-weighted aggregator that
  emits `CertifiedHandoffAttestation` the first time signers
  cross `committee.quorum_threshold()`. Replacements don't
  double-count; non-committee signers are silently dropped (the
  consensus path also rejects them at the dispatch site, but the
  aggregator is defense-in-depth).
- `verify_certified_handoff_attestation`: standalone re-verify
  against a committee + provider — what joiners run during
  bootstrap on the cert they fetched.

Tests cover sort canonicalization, duplicate-key rejection,
pubkey-set hash invariance under reorder and dedup, sign+verify
round trip with the four verdict outcomes, aggregator quorum
crossing, replacement no-op, non-committee signer no-op, and
end-to-end certify-then-re-verify-with-tampered-sig.

Record / persist / EndOfPublish-trigger wiring land in
follow-on commits; these helpers are isolated and consumed at
those sites.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 143.26s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wires the consensus dispatch path for `HandoffSignature` to verify,
persist, and aggregate incoming Ed25519 signatures over the epoch's
handoff attestation.

Per-epoch state on `AuthorityPerEpochStore`:
- `handoff_signatures: DBMap<AuthorityName, Ed25519Signature>` —
  durable record of each verified signer's sig. Replays are
  no-ops via typed-store insert semantics.
- `expected_handoff_attestation: ArcSwapOption<HandoffAttestation>`
  — this validator's locally-computed attestation, installed by
  the producer side once mpc_data is frozen + DKG/reconfig digests
  are known. Until installed, incoming signatures drop silently
  (`AttestationMismatch` is the only possible verdict).
- `consensus_pubkey_provider: ArcSwapOption<...>` — Ed25519 lookup
  for signer pubkeys, populated by the same sui_syncer task that
  feeds the joiner provider.
- `handoff_aggregator: Mutex<Option<HandoffAggregator>>` — in-memory
  stake accumulator. Rebuilt from persisted signatures when the
  expected attestation is (re)installed, so restart replay folds
  prior consensus-ordered signatures back in correctly.

New pure helper in `validator_metadata`:
- `process_handoff_signature` runs `verify_handoff_signature` and,
  on `Accept`, inserts into the aggregator. Returns one of
  `Recorded`, `Certified(cert)`, or `Rejected(verdict)`. Three new
  unit tests cover quorum-crossing, attestation mismatch, and
  unknown-signer paths.

`PartialEq`/`Eq` added to `HandoffSignatureMessage` and
`CertifiedHandoffAttestation` so the record-outcome enum can derive
those traits for tests.

Consensus dispatch: the `HandoffSignature` arm now calls
`record_handoff_signature`. The returned cert (when quorum just
crossed) is intentionally dropped on the floor for now — the
perpetual-persist plumbing (step 7c) hangs off a dedicated drain
task that pulls from the in-memory aggregator. Dropping is safe
because the *next* ordered signature crossing quorum still mints a
cert, and restart-replay rebuilds the aggregator.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 142.08s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes the handoff write path: once `record_handoff_signature`'s
in-memory aggregator crosses quorum, the resulting
`CertifiedHandoffAttestation` is immediately persisted into a
keep-forever perpetual table.

`AuthorityPerpetualTables`:
- New `certified_handoff_attestations: DBMap<EpochId,
  CertifiedHandoffAttestation>` table, keyed by the epoch the
  outgoing committee is handing off *from*.
- `insert_certified_handoff_attestation`,
  `get_certified_handoff_attestation`,
  `iter_certified_handoff_attestations` accessors.

The handoff feedback rule (keep certs forever) is load-bearing
because a joiner pulling history may need to verify the chain back
to whichever cert it has a trusted committee for; skipping any
single epoch's cert would permanently break their ability to
bootstrap.

`AuthorityPerEpochStore` gains
`perpetual_tables_for_handoff: ArcSwapOption<...>` plus
`install_perpetual_tables_for_handoff`. `ika-node` installs the
perpetual handle directly after constructing the epoch store, so
the very first cert produced by consensus lands on disk. When
nothing is installed (e.g. unit tests that don't wire perpetual),
the record path logs at debug level and keeps going — the cert
stays in the in-memory aggregator and joiner-bootstrap consumers
will simply miss it.

The `Certified` arm of `record_handoff_signature` now also
performs the perpetual write, with the persist failure logged
(not propagated) — failing the entire consensus-dispatch path on
a perpetual-DB hiccup would be far worse than a missing cert.

Tests: 3 new perpetual-table unit tests cover insert/get
roundtrip, ordered iteration across epochs, and byte-level
idempotency on identical re-writes.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 141.68s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes the producer half of the handoff loop: when this validator
reaches EndOfPublish, the same task that submits its
`EndOfPublish` consensus transaction also builds, installs, signs,
and submits its `HandoffSignatureMessage` for the epoch — exactly
once.

The trigger pipeline:
1. `compute_handoff_items` (pure): combines frozen mpc_data set +
   per-network-key DKG output digests + per-network-key reconfig
   output digests into a sorted Vec<(HandoffItemKey, [u8;32])>.
   Empty inputs are valid (yields an empty list) — important
   because DKG/reconfig digest caching is step 9, and the
   attestation needs to be signable before then.
2. `AuthorityPerEpochStore::build_local_handoff_attestation`:
   reads the frozen set, hashes the supplied next-committee
   pubkey set, calls compute_handoff_items, and builds a
   well-formed attestation.
3. `AuthorityPerEpochStore::build_local_handoff_signature_transaction`:
   installs the attestation locally (so the per-epoch record path
   accepts matching peer signatures), signs it with the consensus
   key, and wraps it in a `ConsensusTransaction`.
4. `EndOfPublishSender` is upgraded to take the consensus keypair
   (Arc) + a `Receiver<Committee>` for the next epoch, plus an
   `AtomicBool` one-shot flag. The handoff submit happens after
   the EndOfPublish submit on the same tick.

Determinism across validators: identical inputs → identical
attestation bytes → matching signatures. The frozen set is
already agreed (step 4's quorum freeze); the next-committee
pubkey set is read from chain. Until step 9 populates DKG/reconfig
digests, every validator computes an attestation with those slots
empty — still agreed.

The handoff record path (step 7b) was already wired to consume
these signatures, and the perpetual persist (step 7c) writes the
cert as soon as quorum is reached. With this commit, the cycle
runs end-to-end given an actual EndOfPublish trigger.

Tests: 2 new unit tests cover `compute_handoff_items` sorting +
empty-input semantics, in addition to the existing 19 helpers
tests.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 144.29s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the read side that closes the handoff loop: peers can pull a
`CertifiedHandoffAttestation` for any persisted epoch over a new
`ValidatorMetadata::GetCertifiedHandoffAttestation` RPC, and joiners
have a single-hop verification helper that binds the cert to the
specific committee they're trying to join.

Network layer:
- New `GetCertifiedHandoffAttestationRequest { epoch }` wire type.
- New `HandoffCertStorage` trait — the read-only counterpart to
  the perpetual store. Server holds an `Arc<C: HandoffCertStorage>`
  alongside the existing blob store.
- `ValidatorMetadataServer` is now `Server<S, C>`; the
  `build_server(storage, relay, cert_storage)` signature gained the
  `cert_storage` arg.
- Joiner-side `fetch_certified_handoff_attestation(network, peer,
  epoch)` mirrors the existing `fetch_blob`.

Adapter:
- `AuthorityPerpetualTables` implements `HandoffCertStorage` by
  delegating to `get_certified_handoff_attestation` and logging
  (not propagating) a perpetual-read error as `None`. The Anemo
  hot path can't surface a typed error usefully.

ika-node:
- The perpetual handle is now passed into `build_server` so peers
  immediately see every cert that lands on disk (via step 7c's
  perpetual persist). No additional installation needed because
  `AuthorityPerpetualTables` is constructed eagerly at startup.

Joiner bootstrap helper in `ika-core::validator_metadata`:
- `verify_joiner_bootstrap_cert(cert, prior_committee, prior_
  consensus_pubkeys, expected_next_committee_pubkeys)` runs the
  full check: pubkey-set-hash binding (so a malicious peer can't
  hand a real cert for a different committee), then delegates to
  the existing `verify_certified_handoff_attestation` for the
  signature/stake check. One-hop only — joiners verify against
  the *prior* committee, not back to genesis. (Per handoff design
  memo: anchoring trust to the prior committee is sufficient since
  the joiner gets there through earlier hops they either already
  trust or are themselves bootstrapping from a known anchor.)

Tests: 1 new unit test exercising both the happy path and the
pubkey-set-mismatch refusal.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 143.31s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Populates the producer-side caches that feed the handoff
attestation's `NetworkDkgOutput` / `NetworkReconfigurationOutput`
items.

`AuthorityPerEpochStoreTrait` gains two methods, called from the
MPC producer at the exact point it builds the consensus output:
- `cache_network_dkg_output(key_id, output_bytes)`
- `cache_network_reconfiguration_output(key_id, output_bytes)`

Concrete `AuthorityPerEpochStore` impl:
- Hashes `output_bytes` to Blake2b256 (matching `mpc_data_blob_hash`'s
  function so peers can fetch this blob over the existing
  `GetMpcDataBlob` RPC).
- Writes the digest into one of two new per-epoch tables —
  `network_dkg_output_digests` or
  `network_reconfiguration_output_digests` — keyed by
  `dwallet_network_encryption_key_id`.
- Writes the blob bytes into perpetual `mpc_artifact_blobs` (if
  the perpetual handle is installed) so cross-restart serves work
  for free.
- All writes are idempotent on byte-identical replays.

`build_local_handoff_attestation` no longer takes the digest maps
as parameters; it reads them straight off the per-epoch store.
`EndOfPublishSender::send_handoff_signature` is updated to match.

Producer hook: `DWalletMPCService::new_dwallet_mpc_output`'s
User/System branch calls the trait methods for the DKG and
reconfig protocols (`!rejected` only — rejected outputs are
empty and shouldn't pollute the cache). Cache failures are
logged, not propagated — they don't fail the consensus output
emit, just degrade peer serveability.

`TestingAuthorityPerEpochStore` gets no-op impls; the integration
test gate doesn't exercise attestation contents so an in-memory
mirror isn't needed.

Tests: 2 new unit tests cover the per-epoch table semantics —
digest roundtrip + replay idempotency, and independence of the
DKG vs reconfig caches when keyed by the same key_id.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 141.54s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the per-network-key counterpart to `EpochMpcDataReadySignal`.
Validators can now signal readiness for a specific network key's
DKG (`NetworkKeyDKGReadySignal { authority, network_key_id,
epoch }`) earlier than the epoch-wide signal, because per-key
readiness is a narrower commitment — the validator only needs the
mpc_data required for *this* key, not all reconfig sessions.

Per-epoch state:
- `network_key_dkg_ready_signals: DBMap<(ObjectID, AuthorityName),
  ()>` — per-key, per-authority votes. Composite key keeps quorums
  scoped: the same authority signaling readiness for two keys
  produces two independent entries.

Record path:
- `record_network_key_dkg_ready_signal` is idempotent on replays.
  Quorum is per-key (sum stake of all authorities that signaled
  for `signal.network_key_id`). The first quorum of *any* signal
  kind — epoch-wide or per-key — calls `freeze_mpc_data_if_first`,
  which is already idempotent on a non-empty frozen set. Per-key
  quorums after that point are still recorded (DKG kickoff per key
  consumes them) but don't re-freeze.
- `has_network_key_dkg_ready_quorum(network_key_id)` exposes the
  per-key quorum state for step 14's session-kickoff gating.

Consensus wiring:
- New `ConsensusTransactionKind::NetworkKeyDKGReadySignal` +
  matching `ConsensusTransactionKey` variant.
- `new_network_key_dkg_ready_signal` constructor.
- Sender-authority check at verification time (consensus binding
  is the only authentication; no payload signature).
- Metric label + validator pass-through arms.

Producer helper:
- `build_network_key_dkg_ready_signal_transaction(authority,
  network_key_id, epoch)` wraps a signal in a
  `ConsensusTransaction` ready for submission.

Tests: 1 new unit test on `AuthorityEpochTables`'s
`network_key_dkg_ready_signals` table covers composite-key
scoping + replay idempotency.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 142.54s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Filters the frozen mpc_data input set down to the union of the
current and next committees before it's consumed by handoff cert
build (and, in step 14, reconfig MPC). Validators who announced
mpc_data this epoch but withdrew before next_committee was
selected get dropped — the cert no longer pins their entries and
reconfig MPC won't allocate work for them.

`compute_effective_reconfig_input_set(frozen, current, next) ->
BTreeMap<AuthorityName, [u8;32]>` is the pure helper; it
intersects with the union of both committee membership lists.
Both committee inputs are `IntoIterator` so callers can hand it
whatever shape they already have (Vec, &[..], `voting_rights`
iter).

`AuthorityPerEpochStore::get_effective_reconfig_input_set` reads
the frozen set and the current committee from the store and
delegates to the pure helper. `build_local_handoff_attestation`
now goes through this method instead of pulling `frozen` raw,
so cert items reflect the effective set.

Tests: 2 new unit tests cover the intersection semantics —
a four-author scenario where staying members, joiners, and
withdrawers each take their expected path through the filter, plus
the degenerate case where no announcer overlaps the committees.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 143.88s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the read-side abstraction that lets the sui_syncer prefer
locally-cached protocol output blobs over the chain blobs when
assembling `DWalletNetworkEncryptionKeyData`. The lightweight
fields (id, current_epoch, dkg_at_epoch, state) always come from
chain — those are authoritative — but the large
`network_dkg_public_output` and
`current_reconfiguration_public_output` blobs can come from the
local content-addressed cache populated by step 9's producer
caching.

New in `ika-core::validator_metadata`:
- `NetworkKeyBlobSource` trait: `network_dkg_output_blob(key_id)`
  and `network_reconfiguration_output_blob(key_id)`, both
  returning `Option<Vec<u8>>`. `None` means "fall back to chain".
- `StaticNetworkKeyBlobSource` — empty-by-default in-memory impl,
  used by tests and as the typed-empty default.
- `fetch_network_key_data_with_off_chain_blobs(chain_data,
  source) -> DWalletNetworkEncryptionKeyData`: takes the chain
  copy, overlays each large blob from `source` if present.

`AuthorityPerEpochStore` implements `NetworkKeyBlobSource` by
looking up the per-epoch digest cache from step 9
(`network_dkg_output_digests` / `network_reconfiguration_output_
digests`) and then fetching the blob bytes from the perpetual
`mpc_artifact_blobs` store. A missing digest *or* a missing blob
returns `None` — every step in the chain has the chain fallback
behind it.

Syncer wiring (replacing the chain-read in
`sui_syncer::sync_dwallet_network_keys` with the wrapper) is the
next commit; this one lays the infrastructure.

Tests: 2 new unit tests cover the overlay semantics — partial
overlay (DKG from source, reconfig from chain) and the
all-fall-back case where the source is empty and the merged data
equals the chain copy byte-for-byte.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 142.76s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the off-chain assembler for the load-bearing
`Committee.class_groups_public_keys_and_proofs` map — the
HashMap reconfig MPC reads to find each committee member's
class-groups encryption key + correctness proof. The new path
decodes blobs locally from the perpetual `mpc_artifact_blobs`
store, keyed by digests pinned in the validators'
`ValidatorMpcDataAnnouncement`s.

The completion gate (per the design memo) is strict:
`assemble_committee_class_groups_off_chain` returns
`OffChainClassGroupsAssembly::Complete(map)` *only* when every
supplied authority resolved successfully — blob found, BCS-
decoded to `VersionedMPCData`, inner bytes decoded to
`ClassGroupsEncryptionKeyAndProof`. Even one missing or
malformed entry forces `Incomplete { missing: [...] }`, and the
caller must fall back to the chain-read path.

Why strict: reconfig MPC reads
`Committee.class_groups_public_keys_and_proofs[authority]`
directly, and a missing/empty entry silently drops that
validator's share without aborting. The existing chain-read path
in `sui_syncer::new_committee` already has this footgun (a
`filter_map` that swallows decode errors per-validator); the
off-chain path *must not* repeat it. Hence: all-or-nothing.

Wiring `sui_syncer::new_committee` to try off-chain first and
fall back on `Incomplete` is the next commit; this commit lands
the pure assembler.

Tests: 3 new unit tests cover (a) the happy path — two seeded
blobs round-trip through `derive_mpc_data_blob` →
`mpc_data_blob_hash` → an in-memory store → assembly back into
the map; (b) missing-blob aborts with the missing authority
listed; (c) corrupt-blob (bytes don't decode as
`VersionedMPCData`) also aborts.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 143.26s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
DKG and reconfig sessions now wait on the off-chain mpc_data
freeze before instantiating. Honest validators that observe the
chain event before the consensus-side freeze quorum lands park
the request and retry on every subsequent batch cycle until the
gate opens.

Gate conditions, evaluated against the per-epoch store:
- `NetworkEncryptionKeyDkg(key_id)` requires
  `is_mpc_data_frozen() && has_network_key_dkg_ready_quorum(key_id)`.
  Per-key quorum makes a stronger commitment than the epoch-wide
  signal: it certifies that this *specific* key has enough peers
  ready to actually participate.
- `NetworkEncryptionKeyReconfiguration(_)` requires only
  `is_mpc_data_frozen()`. Reconfig sweeps every key the validator
  knows about; a per-key gate would deadlock if the per-key
  quorum needed reconfig output for kickoff.
- Everything else (user DKG, presign, sign, etc.) is unaffected.

`AuthorityPerEpochStoreTrait` gains the two query methods
`is_mpc_data_frozen` and `has_network_key_dkg_ready_quorum`,
implemented concretely against `frozen_validator_mpc_data_input_set`
and `network_key_dkg_ready_signals` respectively. The previously
inherent-only `has_network_key_dkg_ready_quorum` is gone — it's
now exclusively a trait method.

`TestingAuthorityPerEpochStore`'s impls return `Ok(true)` for
both: integration tests don't drive the freeze flow end-to-end
and would otherwise deadlock at the gate. Production builds use
the real store where these reflect actual consensus-observed
state.

In the manager, a new `requests_pending_for_frozen_mpc_data:
Vec<DWalletSessionRequest>` queue mirrors the existing pending
queues. Drained at the top of every `handle_mpc_request_batch`
by re-running each request through `handle_mpc_request`. Requests
that don't pass get re-queued; those that do proceed through the
existing kickoff path.

Made `DWalletMPCManager.epoch_store` `pub(crate)` so the gate
check in `mpc_session.rs` can reach it.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 144.14s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the producer-side task without which the off-chain freeze
quorum can never be reached, leaving step 14's kickoff gate
permanently closed and stalling network DKG / reconfig.

The new `MpcDataAnnouncementSender` (sibling of
`EndOfPublishSender` under `sui_connector`) runs once per epoch
per validator and:
1. Derives the canonical class-groups `mpc_data` blob from the
   validator's `RootSeed` (via `derive_mpc_data_blob` — identical
   bytes to what the CLI submits on chain).
2. Persists the blob into perpetual `mpc_artifact_blobs` so
   peers can fetch it by digest over the existing
   `GetMpcDataBlob` RPC.
3. Signs and submits a `ValidatorMpcDataAnnouncement` over
   consensus. Submission is idempotent — replays use the latest-
   by-timestamp rule.
4. After its own announcement is in, submits an
   `EpochMpcDataReadySignal` — one of two signal types whose
   quorum drives `freeze_mpc_data_if_first`.
5. Submits `NetworkKeyDKGReadySignal` for every known network
   key (deduped via a `HashSet`).

Each of (3), (4), (5) is gated by its own one-shot flag plus
ack-on-success, so a transient consensus-adapter failure causes
a retry on the next tick (every 2s) rather than blowing up the
task.

Step-14 gate softened to match the design memo's "first quorum
of either signal type freezes mpc_data" — DKG kickoff now only
requires `is_mpc_data_frozen()`, same as reconfig. The per-key
signal stays as an alternate freeze trigger but isn't a separate
hard requirement, since the sui_syncer skips
`AwaitingNetworkDKG` keys from the network-keys snapshot,
meaning the producer task can't observe a fresh DKG-target key
to signal for until *after* DKG completes — which would
deadlock.

Wired from `ika-node::monitor_reconfiguration` alongside
`EndOfPublishSender`. `AuthorityState::perpetual_tables()` added
to expose the perpetual handle without making the field public.

The aborted-on-epoch-end pattern follows
`end_of_publish_sender_handle`.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 143.64s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Lights up step 6's joiner verify path by installing a
`StaticJoinerPubkeyProvider` on the current epoch store, sourced
from the next-epoch committee snapshot already kept live by
`sui_syncer::sync_next_committee` and exposed via
`next_epoch_committee_receiver`. Without this, every next-epoch
(joiner) `ValidatorMpcDataAnnouncement` drops silently because the
provider field is `None` by default.

The new per-epoch `JoinerPubkeyProviderUpdater` task watches the
receiver, computes the joiner set as `V_{e+1}.voting_rights`'s
authority names, and calls
`AuthorityPerEpochStore::install_joiner_pubkey_provider`. Since
`AuthorityName == AuthorityPublicKeyBytes`, the BLS sig verify in
`verify_joiner_announcement` runs against the announcer's claimed
authority directly — no separate pubkey lookup needed.

Idempotent: `last_installed` cache short-circuits re-installation
when the underlying set is byte-identical to the last one we
installed.

This is a *simplification* of the design memo's "verify against
PendingActiveSet" prescription: we wait until V_{e+1} is selected
on chain instead of reading `PendingActiveSet` directly. Trade-off
— joiners can't announce earlier than V_{e+1} selection, but
reading the `ExtendedField` for PendingActiveSet would require a
new Sui dynamic-field plumbing path that isn't justified for v1.
Early-announce can be added later if join-latency becomes a real
concern.

Spawned alongside the producer task in
`monitor_reconfiguration`; aborted on epoch end via the same
pattern as `end_of_publish_sender_handle`.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 271.18s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes the verify side of step 7's handoff loop. Without this, the
`ConsensusPubkeyProvider` field stays `None` and every incoming
`HandoffSignatureMessage` drops as `UnknownSigner` — meaning no
peer's signature ever counts toward the aggregator's quorum and the
cert never gets minted.

The new `ConsensusPubkeyProviderUpdater` task fetches the current
committee's `StakingPool.validator_info.consensus_pubkey_bytes`
directly via `sui_client.get_system_inner()` →
`active_committee.members` →
`get_validators_info_by_ids` → `verify().consensus_pubkey`. The
result is mapped `AuthorityName -> Ed25519PublicKey` and installed
as a `StaticConsensusPubkeyProvider` on the per-epoch store.

Cadence: 15s (consensus pubkey is fixed at validator registration
and shouldn't change mid-epoch). Idempotent re-install via a
base64-serialized cache key on the last installed map.

Sources the system inner directly rather than plumbing
`system_object_receiver` out of `SuiSyncer` — one extra RPC every
15s is cheaper than the receiver-broadcast plumbing.

Wired in `monitor_reconfiguration` alongside the
joiner-pubkey-provider updater and the producer task; aborted on
epoch end via the same pattern as `end_of_publish_sender_handle`.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 209.13s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wires step 12's overlay into the chain-read path. The syncer's
`sync_dwallet_network_keys` task now applies
`fetch_network_key_data_with_off_chain_blobs` to every chain copy
before sending it on the watch channel, so consumers see locally-
cached DKG / reconfig output blobs (populated by step 9's
producer cache) instead of fetching them from chain on every
re-read.

Plumbing:
- `SuiConnectorService` gains
  `network_key_blob_source: Arc<ArcSwapOption<Box<dyn
  NetworkKeyBlobSource>>>` plus an
  `install_network_key_blob_source` method.
- The handle is created (empty) at service construction and
  passed by clone into the syncer task, where
  `sync_dwallet_network_keys` reads it on each fetch tick.
- New adapter `EpochStoreBlobSource` wraps
  `Weak<AuthorityPerEpochStore>` so the long-lived service can
  hold a per-epoch reference; the weak upgrade returns `None`
  cleanly when the epoch ends, which makes the overlay fall back
  to the chain blob via `unwrap_or` on each field.
- `ika-node::monitor_reconfiguration` calls
  `sui_connector_service.install_network_key_blob_source(...)`
  once per epoch with a fresh `EpochStoreBlobSource` pointing at
  the new `cur_epoch_store`. Each install atomically replaces the
  previous epoch's source.

The lightweight metadata (id, current_epoch, dkg_at_epoch, state)
always comes from chain — only the two large output blobs may be
overlaid. When no source is installed, behavior is unchanged
byte-for-byte.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 202.94s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wires step 13's pure assembler (`assemble_committee_class_groups_off_chain`)
into the next-committee construction path. When the off-chain set
covers every committee member, the resulting class-groups
public-keys-and-proofs map comes straight from validators' own
`mpc_data` announcements + the perpetual blob store instead of
refetching from chain. `Incomplete` paths transparently fall
through to the existing `get_mpc_data_from_validators_pool` read.

New abstractions in `validator_metadata`:
- `OffChainCommitteeClassGroupsSource` trait — single method
  `try_assemble_class_groups(&[AuthorityName]) ->
  OffChainClassGroupsAssembly`.
- `EpochStoreClassGroupsSource` adapter holds
  `Weak<AuthorityPerEpochStore>` (for the per-authority
  announcement digest lookup) + `Arc<AuthorityPerpetualTables>`
  (for the digest→bytes blob lookup), and delegates to the pure
  assembler. Returns `Incomplete` cleanly when the weak upgrade
  fails (epoch ended).

Plumbing:
- `SuiConnectorService` gains a second
  `Arc<ArcSwapOption<Box<dyn OffChainCommitteeClassGroupsSource>>>`
  handle with a matching `install_class_groups_source` setter.
- The handle is passed by clone into `SuiSyncer::run` and on to
  `sync_next_committee` → `new_committee`, where the off-chain
  attempt happens before the chain read.
- `ika-node::monitor_reconfiguration` installs a fresh
  `EpochStoreClassGroupsSource` once per epoch right next to the
  blob-source install. Each install atomically replaces the
  previous epoch's source.

Strict-gate rationale preserved: `new_committee` only short-
circuits to the off-chain map on `Complete`. Any missing
authority — joiner whose announcement hasn't been verified yet,
blob not yet replicated, decode failure — falls through to chain,
which is the only safe option since the load-bearing rule says
reconfig MPC silently drops validators with no class-groups
entry.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 265.04s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wires the consumer side of step 5. The Anemo
`SubmitMpcDataAnnouncement` handler had been returning
`Rejected{"relay not installed"}` for every joiner submission;
this commit installs a concrete relay per epoch so the RPC
actually forwards joiner announcements into consensus.

The relay (`ConsensusBackedAnnouncementRelay` in
`sui_connector::announcement_relay`) runs three steps:
1. Cheap envelope checks — refuses unless
   `announcement.epoch == next_epoch`, since current-epoch
   announcements come from members who can submit themselves
   directly.
2. Joiner verify via the pure
   `validator_metadata::verify_joiner_announcement` against the
   per-epoch store's installed `JoinerPubkeyProvider` (populated
   by the joiner-provider syncer from step 6). Rejection here
   stops a malicious peer from using us as a spam pipe.
3. Wraps in `ConsensusTransaction::new_validator_mpc_data_announcement`
   and submits via the consensus adapter.

Plumbing:
- `P2pComponents` gains a `mpc_announcement_relay` field
  (`Arc<AnnouncementRelayHandle>`) so the long-lived handle the
  Anemo server already holds is also reachable from
  `monitor_reconfiguration`.
- `IkaNode` stashes the same handle so the per-epoch install
  loop can swap relays without re-touching the network layer.
- New `AuthorityPerEpochStore::joiner_pubkey_provider()` getter
  exposes the installed provider for the relay's verify step
  (mirrors the existing install/clear pair).

Install point: alongside the other per-epoch installs in
`monitor_reconfiguration`. Each epoch's relay holds
`Weak<AuthorityPerEpochStore>` so it naturally fails closed when
the epoch ends (returns "epoch ended" until the new epoch's
relay replaces it).

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 247.16s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reorganizes the four files that have no Sui RPC dependency and
shouldn't have been under `sui_connector/`. They all just hold a
`Weak<AuthorityPerEpochStore>` + an `Arc<dyn SubmitToConsensus>`
and run as per-epoch background tasks that emit
`ConsensusTransaction`s; that's a different responsibility from
`sui_connector/` (which talks to Sui RPC).

Moved (identical bytes):
- `sui_connector/end_of_publish_sender.rs` →
  `epoch_tasks/end_of_publish_sender.rs`
- `sui_connector/mpc_data_announcement_sender.rs` →
  `epoch_tasks/mpc_data_announcement_sender.rs`
- `sui_connector/joiner_pubkey_provider_updater.rs` →
  `epoch_tasks/joiner_pubkey_provider_updater.rs`
- `sui_connector/announcement_relay.rs` →
  `epoch_tasks/announcement_relay.rs`

Kept in `sui_connector/`:
- `consensus_pubkey_provider_updater.rs` — actually calls
  `sui_client.get_system_inner()` + `get_validators_info_by_ids`,
  so it belongs with the Sui-side updaters.

The four moved files use only `crate::` paths internally so no
import edits inside them; the only external rename is in
`ika-node/src/lib.rs` (s/sui_connector/epoch_tasks/ on four
call sites).

Module layout follows the CLAUDE.md `xxx.rs` convention:
new `crates/ika-core/src/epoch_tasks.rs` declares the four
submodules, files live in `epoch_tasks/`. No `mod.rs`.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 144.80s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three structural changes so the handoff loop is generic and not
phrased as a validator-metadata feature:

1) Types extracted to `ika-types::handoff`.
   `HandoffItemKey`, `HandoffAttestation`,
   `HandoffSignatureMessage`, and `CertifiedHandoffAttestation`
   move out of `validator_metadata.rs`. `validator_metadata.rs`
   keeps only the four validator-specific types
   (`ValidatorMpcDataAnnouncement`,
   `SignedValidatorMpcDataAnnouncement`,
   `EpochMpcDataReadySignal`, `NetworkKeyDKGReadySignal`).
   Cross-crate import sites updated.

2) `HandoffSignatureSender` extracted from `EndOfPublishSender`.
   The latter shrinks back to "submit EndOfPublish on the local
   trigger" and nothing else. The new sender lives in
   `epoch_tasks/handoff_signature_sender.rs` and runs on the same
   `end_of_publish_receiver` independently. ika-node spawns both
   side-by-side and aborts both on epoch end.

3) `HandoffItemsBuilder` trait + concrete
   `MpcDataHandoffItemsBuilder`. Item contributors plug in via the
   trait; `AuthorityPerEpochStore::build_local_handoff_attestation`
   now takes `&[Arc<dyn HandoffItemsBuilder>]` and folds each
   contribution into the attestation. Today only the MPC-data
   builder is registered (via `default_handoff_items_builders`);
   new features (NOA, sui-state pinning, etc.) can append their
   own builder without touching the producer or aggregator.

`HandoffItemKey` stays a typed enum for now — moving to opaque
byte keys was the fourth level I called out and explicitly
deferred. Adding a new item kind still requires a variant bump,
which is the right trade-off while the variant count is small.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 295.42s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The module name `validator_metadata` was misleading — it bundled
three orthogonal P2P endpoints that have nothing to do with
"validator metadata" in the dictionary sense. Rename to
`mpc_artifacts` and split into purpose-named submodules:

- `mpc_artifacts/blob_store.rs` — content-addressed `mpc_data`
  blob storage (`MpcDataBlobStorage`, `InMemoryBlobStore`,
  `mpc_data_blob_hash`, `GetMpcDataBlobRequest`, `MpcDataBlob`,
  `fetch_blob`).
- `mpc_artifacts/announcement_relay.rs` — joiner announcement
  forwarding (`AnnouncementRelay`, `AnnouncementRelayHandle`,
  `SubmitMpcDataAnnouncement{Request,Response}`,
  `submit_announcement_to_peer`,
  `submit_announcement_to_committee`).
- `mpc_artifacts/handoff_cert.rs` — handoff cert retrieval
  (`HandoffCertStorage`, `GetCertifiedHandoffAttestationRequest`,
  `fetch_certified_handoff_attestation`).
- `mpc_artifacts/server.rs` — Anemo `ValidatorMetadata` impl,
  unchanged behavior (moved + import paths fixed).
- `mpc_artifacts.rs` — top-level module: `mod generated`,
  submodule declarations, re-exports of every public surface so
  external callers still write `ika_network::mpc_artifacts::X`
  without caring which submodule X lives in, and the public
  `build_server` constructor.

Anemo service wire name stays `ValidatorMetadata` (and the
codegen include stays `ika.ValidatorMetadata.rs`) — the
rename is internal-only, no protocol break. Tests for each
submodule moved next to their code (blob_store + relay tests).

External rename: `ika_network::validator_metadata` →
`ika_network::mpc_artifacts` across ika-core, ika-node, ika-types
inline paths, and ika-network's own build.rs request_type /
response_type paths.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 265.88s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a single `off_chain_validator_metadata` feature flag and
bumps `MAX_PROTOCOL_VERSION` from 4 to 5; the flag flips on at v5.
All off-chain pipeline hooks now check this flag and fall back to
legacy chain-only behavior when false. The Sui-style protocol-
version advance means every validator switches together at the
exact consensus round the network advances to v5 — no mixed-
version freeze-quorum stalls, no asymmetric blob caches, no
divergent handoff attestations.

Six gates, all failing closed to legacy:

1. Producer tasks self-exit on `run()` when the flag is false:
   `MpcDataAnnouncementSender`, `HandoffSignatureSender`,
   `JoinerPubkeyProviderUpdater`,
   `ConsensusPubkeyProviderUpdater`. Each reads
   `epoch_store.protocol_config().off_chain_validator_metadata_enabled()`
   once at task start.

2. ika-node `monitor_reconfiguration` reads the flag once per
   epoch and skips spawning the four tasks, the relay install,
   and the two `SuiConnectorService` source installs
   (`install_network_key_blob_source`,
   `install_class_groups_source`) when off — saves the spawn
   churn even though the tasks self-gate. `EndOfPublishSender`
   stays unconditional since it's core-protocol.

3. Consumer record paths bail early when the flag is false —
   defensive, so a stray new-kind `ConsensusTransaction` from a
   peer can't allocate state:
   `record_validator_mpc_data_announcement`,
   `record_epoch_mpc_data_ready_signal`,
   `record_network_key_dkg_ready_signal`,
   `record_handoff_signature`.

4. Step-14 kickoff gate `off_chain_gate_passes` evaluates to
   `true` (legacy behavior) when the flag is off. Otherwise
   gates on `is_mpc_data_frozen()`. New trait method
   `off_chain_validator_metadata_enabled` on
   `AuthorityPerEpochStoreTrait` so the gate site can reach the
   flag through the trait object. `TestingAuthorityPerEpochStore`
   returns `true` to preserve existing integration-test behavior.

5. Step-9 producer cache hook in
   `DWalletMPCService::new_dwallet_mpc_output` skips when the
   flag is off — leaves the digest tables empty so the syncer
   overlay path naturally falls through to chain-only reads.

6. Syncer overlays
   (`sync_dwallet_network_keys`, `new_committee`) don't need
   explicit flag checks: when the flag is off, ika-node skips
   `install_*_source`, the source handles stay None inside
   `SuiConnectorService`, and the existing source-handle checks
   fall through to chain.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 313.64s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Brings PR #1 (cleanup, ika-benchmark removal), PR #2 (bootstrap library),
PR #3 (ika-test-cluster), and the Inkrypto cryptography-private bump
(post-PR-#1707 `ValidatorEncryptionKeysAndProofs` shape:
class-groups + per-curve PVSS HPKE).

Merge resolutions:

* `authority_per_epoch_store.rs`: take origin/dev's tuple key
  `DBMap<(SessionIdentifier, u16), AssignedPresign>` for
  `assigned_presigns_schnorrkel_substrate` (PR #1707 fix) AND keep the
  seven off-chain metadata fields from this branch.

* `pnpm-lock.yaml`: keep upstream `sdk/signature-mpc-wasm/pkg: {}`
  entry; the stale stashed `sdk/ows/...` entries are already removed.

* `protocol-config/lib.rs`: keep `MAX_PROTOCOL_VERSION = 4`. Merge
  `network_encryption_key_version = Some(3)` and
  `reconfiguration_message_version = Some(3)` into the v4 arm so the
  Inkrypto crypto activates at the current MAX. The v5 arm
  (`noa_checkpoints = true`) is commented out as a forward-looking
  reference. Rewrote the version-history comment with one line per
  version. User's manual `internal_presign_sessions = false` at v4
  preserved.

* Off-chain pipeline PVSS extension: the Inkrypto bump expanded
  `Committee::new` with three new PVSS HashMaps (secp256k1, secp256r1,
  ristretto). Extended `OffChainCommitteeClassGroupsSource` to
  assemble all four maps from the same blob bytes via the shape-
  tolerant `decode_validator_encryption_keys`. Validators publishing
  under mainnet-v1.1.8 shape contribute only class-groups; post-PR-#1707
  validators contribute the full bundle — matching chain-fallback
  semantics in `sui_syncer::new_committee`.

* Test-only `Committee::new` call sites in `validator_metadata.rs`:
  pass three empty PVSS maps to satisfy the new 8-arg signature.

* Protocol-config snapshots regenerated for v3/v4 (off-chain flag
  flipped on at v4, crypto-v3 active at v4) plus v5 snap files kept
  on disk as forward-looking reference for the commented v5 arm.

Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` passes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Make `request_add_validator_candidate`, `request_add_validator`, and
`stake_ika` `pub` in `ika-swarm-config::sui_client` so the upcoming
`IkaTestCluster` joiner helper can reuse the battle-tested PTB
builders rather than duplicating them. No behavior change — same
functions, broader visibility.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
omersadika and others added 13 commits June 11, 2026 15:04
…d of on rayon

msim's `Handle::spawn` re-resolves the CURRENT simulated node at spawn
time, so a rayon-side completion send whose originating node was torn
down mid-compute (an epoch swap in the simulation) panics at
`NodeHandle::current().unwrap()` and rayon-core aborts the whole
process — the smoke test died ~36 minutes in. Crypto is sequential
under msim anyway (the `parallel` feature is dropped in that profile),
so under `cfg(msim)` compute inline in the calling task and await the
channel send there: the send then dies cleanly WITH the node, dropping
the now-moot result. The non-msim path is unchanged.

Verified: `MSIM_DISABLE_WATCHDOG=1 cargo simtest -p ika-test-cluster
test_swarm_reaches_epoch_2` now PASSES (3h27m, single-threaded msim —
the documented trade-off), where it previously aborted.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…stantiation

The instrumented-localnet decomposition showed the dominant CI stall:
instantiate_agreed_keys_from_voted_data awaited the network-key
instantiation inline — minutes-scale on weak hardware (262-321s per key
on the CI runner pod, where four concurrent instantiations contend),
freezing every session on the validator at each epoch boundary and
backing up completed computations behind the frozen loop (23,880s of
accumulated pickup lag in one run — more than all compute combined).

Split it: the per-round step only SPAWNS due instantiations on the rayon
pool (tracked per key id, no duplicate spawns); completions are polled
once per service ITERATION via poll_pending_network_key_instantiations.
The poll deliberately does not live inside the per-round drain — with no
new consensus rounds arriving, a completed key would otherwise never
install (the integration-test harness shape, and a live-validator hazard
during round stalls).

Integration tests that asserted installation after a single post-vote
iteration now converge via run_service_loops_until_network_key_installed
(bounded by the computation-wait budget).

Validated: network_dkg (3), network_dkg_bwd_compat (3), and
create_dwallet_test pass locally in release.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two facts make parallel clusters viable: nextest's process-per-test
isolates the publish flow's process-global set_current_dir (parallel
`cargo test` threads race on cwd and corrupt each other's contract
publishes), and each test cluster only consumes ~2-4 effective CPUs
(serial-chain class-groups crypto), so the suite is latency-bound and
parallel clusters mostly interleave waiting. Captured per-test output
replaces the multi-GB interleaved --nocapture log; --no-fail-fast keeps
one wedged cluster from hiding the rest of the suite.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…tiation

The instantiation measures 5.5x slower per thread on the ika-k8s-large
runner than on a workstation while targeted bignum microbenches run at
parity — so either one sub-operation is pathological on that platform or
the build executes different work. Timing each of the nine sub-calls
(per-curve protocol + decryption-share parameters, NOA DKG) localizes
the gap to a concrete operation.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…evel

The integration-test tracing subscriber (fmt().try_init()) caps at INFO
and ignores RUST_LOG, so debug-level timings never emit there; nine
info lines per instantiation is acceptable noise.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Lets the same single-test measurement run on ubuntu-latest as a
platform A/B against ika-k8s-large: a hosted x86 VM at workstation pace
implicates the k8s pod environment; one that is equally slow implicates
x86 codegen of the class-groups hot loops. Concurrency keyed by runner
so the A/B runs in parallel with the self-hosted measurement.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The class-groups parameter precomputation profile is uniform across all
eight per-curve sub-calls and scales NEGATIVELY with concurrency on the
runner (4-way concurrent instantiations: ~0.73x aggregate of a single
one) — the signature of allocation-churn serialization, and the ika
binaries set no #[global_allocator], so Linux runs glibc malloc.
Preloading jemalloc isolates the allocator's share of both the 5.5x
per-thread gap and the concurrency collapse.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…announcement subsystems

specs/ holds protocol-level behavioral contracts (actors, messages,
decision rules, invariants, failure modes) written to be readable
without the code open. Seeded with the two subsystems this PR
introduces: the off-chain validator mpc-data pipeline (announcements,
ready signals, freeze, assembly) and the cross-epoch handoff
(attestation, EndOfPublish V2, certificate, joiner bootstrap,
prepare-then-start barrier). CLAUDE.md instructs updating the relevant
spec in the same PR as any behavior change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Separates the build from the measured run so the counters cover the
test, not rustc. user-vs-real CPU time splits "burning more cycles"
from "threads blocked"; perf instructions-vs-cycles (where the pod
permits perf_event_open) splits "executing more work" from "same work
at lower IPC". Mac reference for the single network-DKG test: 5.30T
instructions, 1.32T cycles (IPC 4.0), 373s CPU / 162s wall, 6.2M minor
faults, 3.3GB peak RSS.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ness root cause

RUST_BACKTRACE=1 (set workflow-wide for panic diagnostics) makes
std::backtrace::Backtrace::capture() perform a full DWARF unwind behind
a process-global mutex. class-groups constructs backtrace-carrying
errors EAGERLY on the success path of every group operation
(ok_or(Error::from(..)) at ~15-20 sites per nucomp/nudupl), i.e.
millions of captures per network-key instantiation. Measured on the
same single test: 2020s CPU on the runner vs 373s on a workstation
whose shell leaves RUST_BACKTRACE unset (the capture is then a ~0.17us
stub), with NEGATIVE 4-way scaling from the global lock convoy
(23x sys-time inflation). Reproduced in both directions: exporting
RUST_BACKTRACE=1 locally collapses the same test.

RUST_LIB_BACKTRACE=0 keeps panic backtraces while disabling library
captures. The durable fix belongs upstream in cryptography-private
(ok_or_else / a non-capturing error type on the hot paths) so operator
environments that export RUST_BACKTRACE=1 don't silently pay this.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The Sui and ika swarms pick "available" ports by probing and bind them
later; with nextest running each test in its own process, two
concurrently-booting clusters can probe the same free port and the
loser panics with EADDRINUSE at node start (seen as
test_validator_removed_at_epoch_2 dying 0.25s into the first 8-way
run). A fixed-port listener serves as a dependency-free cross-process
mutex over the boot window (held until every node listener is bound);
the OS releases it whenever the holder exits, panics included, so a
dead test can't wedge the suite. The long test bodies run unlocked and
fully parallel. Not compiled under msim (simulated per-node ports).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…entifiers

Internal-presign session identifiers baked in the consensus round at
which each validator happened to instantiate them. Network-key
installation now completes asynchronously (at a wall-clock-dependent
moment per validator), so validators instantiate the same logical
session while processing DIFFERENT rounds — each derived a private
session id, every session had one participant, quorum never formed, the
presign pool stayed permanently empty, and user sessions wedged the
epoch-advance gate ("epoch 2 was blocked"; global-presign object never
created). Observed directly: 36 distinct session ids for 9 logical
presigns across 4 localnet validators.

Drop the round from the identifier preimage — uniqueness comes from
(epoch, sequence number, session type) — and make the sequence-number
assignment walk deterministic orders: sorted network-key ids (was
HashMap order) and sorted curves/algorithms in
supported_curve_to_signature_algorithms (was nested HashMap order,
which only ever aligned because localnets/tests run all validators in
ONE process sharing one lazy_static instance; real multi-process
committees would have diverged on day one).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…a-v2

# Conflicts:
#	crates/ika-core/src/dwallet_mpc/mpc_manager.rs
@omersadika omersadika force-pushed the feat/off-chain-metadata-v2 branch from 84f47bd to 6634936 Compare June 11, 2026 19:23
omersadika and others added 5 commits June 12, 2026 00:37
…per consensus round

adopt_cert_verified_keys and the instantiation spawn lived inside the
per-consensus-round drain, so they only ran when fresh rounds arrived.
That deadlocks the key-arrives-after-request bootstrap: a validator that
receives the network key AFTER a session request can never adopt it —
no validator can emit a consensus round WITHOUT the key, and without a
round there is no adoption. Production masks it because rounds flow
continuously; the missing_network_key integration test reproduced it
deterministically (zero adoptions across the whole run, every party's
request parked forever).

Move both to once-per-iteration in run_service_loop_iteration: their
inputs (the overlay watch and the persisted handoff cert) do not depend
on round content, the adoption pass early-returns in O(1) when neither
input changed, and the round-free internal-presign session identifiers
removed the only determinism coupling to adoption position.

Also switch the test's tracing init to telemetry-subscribers' env-aware
config — the plain fmt() subscriber caps at INFO and silently ignores
RUST_LOG, which had been hiding this exact failure from debug tracing.

network_key_received_after_start_event now passes (was a deterministic
wedge that survived 10x iteration budgets); regression set green:
network_dkg (3), create_dwallet, internal_presign (3).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
tikv-jemallocator as the global allocator (jemalloc default feature) in
all five binaries: ika-node, ika-validator, ika-fullnode, ika-notifier,
and the ika CLI (which hosts whole localnet swarms). Better
fragmentation behavior than glibc malloc for long-running RocksDB-heavy
processes, and arch-independent.

The ika-node Dockerfile previously attempted jemalloc via LD_PRELOAD,
but the assignment lived inside a RUN layer and never persisted —
production containers were silently running glibc malloc. The broken
block and the libjemalloc-dev package are removed; the allocator now
ships inside the binary.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
docs/off-chain-metadata-v2-review.md and PR-1721-action-plan.md were
working documents of the branch review (explicitly in-progress, verdicts
pinned to intermediate commit hashes); their durable content graduated
into specs/. The content survives in git history and the PR discussion.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… to measured values, propagate the keepers

Removals (experiment debris from the solved slowness investigation):
the allocator/jemalloc A/B input and LD_PRELOAD harness, the
runner/ubuntu-latest A/B input with its concurrency-group keying, the
perf-stat wrapper (perf_event_open is blocked on the pods) and
linux-tools-generic, the target-cpu=native build flags and rustflags
inputs (experimentally disproven), and every comment attributing
slowness to weak hardware / codegen / memory bandwidth.

Retunes to post-fix measured values: cluster test_threads default 8→4
(8-way OOM-killed the 96Gi runner pod; 13/13 in ~35min at 4), cluster
timeout 420→150, TS suite timeout 360→180 (full suite ~60min,
readiness ~10min), TS readiness cap 40→20 minutes.

Propagation of the keepers: RUST_LIB_BACKTRACE=0 beside RUST_BACKTRACE=1
in publish-typescript-sdk.yml; the IPv4+retry apt pattern to ci.yaml and
simtest.yaml; run_attempt-suffixed log artifacts so reruns don't collide.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…omments, vestiges

- Rename instantiate_agreed_keys_from_voted_data →
  instantiate_adopted_network_keys: the NetworkKeyData consensus vote it
  was named for was removed this PR; keys are adopted from the local
  overlay gated by the prior epoch's handoff cert. Test comments
  narrating the deleted vote/tally/quorum flow rewritten to the actual
  adopt → spawn → poll flow, and the stale "Consensus-voted network key"
  log message fixed.
- Drop the always-empty accumulated_new_key_ids return from
  process_consensus_rounds_from_storage (instantiation completions
  surface via the per-iteration poll); stop_on_epoch_end! simplified.
- Comments measured during the eager-backtrace incident lose their
  platform-tier quantifications ("minutes-scale on weak hardware",
  "platform-specific slowdown", "CI runners are slower"); the
  load-bearing rationales stay.
- missing_network_key's tracing init switched to
  telemetry_subscribers::init_for_testing() — the TelemetryConfig::init()
  variant panics when another in-process test already set the global
  subscriber (the single failure in the 45-test CI run).
- Self-contained msim comments at the remaining capture/re-enter rayon
  sites (the orchestrator.rs cross-references were orphaned by its
  inline-under-msim change); CLAUDE.md's simtest section updated to
  prefer inline-under-msim for new code, and gains a section on running
  the heavy suites via the dispatchable CI workflows instead of locally.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
omersadika and others added 10 commits June 12, 2026 01:38
Process artifact of the branch review, same treatment as the review
walkthrough doc; the durable content lives in specs/.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ain subsystems

From a 39-finding verified audit (3 angles: info-spam, lifecycle log
coverage, metrics coverage; every finding adversarially checked against
the code).

Log spam eliminated (production runs at info): the permanent 5s
overlay-incomplete warn on healthy fullnodes/notifiers (~17k lines/day)
is debug with a validators-only 60th-tick escalation; the pre-freeze
assembly-retry warn AND its per-tick error! double-log are debug with a
30th-tick escalation; per-tick "assembled/sent committee" infos dedup on
content change; barrier/cert-read/re-submit/byzantine-fetch warns are
throttled or once-per-state; the ready-signal deadline warn moved behind
the re-emit gate.

Lifecycle coverage added at info/warn: handoff CERT FORMED (was fully
silent), cert-digest mismatch skips in adoption (the security gate was a
bare `continue`), ready-signal and EndOfPublish quorum anchors, joiner
announcement relay + acceptance, local attestation install, NOA/presign
starvation (throttled), buffered-signature drains, boot-lock contention.

Metrics (26 new, registered on the existing registries, bounded labels,
re-seeded at epoch-store open so restarts don't false-alarm): mpc_data
freeze epoch + excluded count, ready signals/stake/validated peers,
announcements received, handoff cert epoch + signatures
collected/stake/buffered/rejected, internal presign pool size + global
presign queue depth + served counter, network-key instantiation
in-flight/failures/per-sub-call duration histogram (DKG and
reconfiguration paths), blob store size/evictions vs the 512MiB cap,
P2P blob fetch outcomes, joiner bootstrap outcomes; the barrier duration
histogram re-bucketed to its real 1s-30min range.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…_network_key

telemetry_subscribers::init_for_testing() still panics when a sibling
test's fmt().try_init() installed the global subscriber first (its Lazy
guard only protects against double-init of itself) — the single failure
in two consecutive 44/45 CI runs at --test-threads=4. Use
fmt().with_env_filter(RUST_LOG or info).try_init(): honors RUST_LOG when
this test wins the init race (the debugging need), silently defers when
it loses (the parallel-suite need). Validated under the exact contention
shape: 4 in-process tests at 2 threads, all green.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…e lock target

The epoch-close wedge (cascading TS-suite timeouts when sessions land
astride lock_last_active_session_sequence_number) was an on-chain
counter OVERSHOOT, reproduced and confirmed against chain state:

- Move's all_current_epoch_sessions_completed is a strict equality
  (completed_sessions_count == locked target) and complete_user_session
  has no lock check, so completing any user session beyond the frozen
  target wedges the epoch permanently — the counter never decreases.
- The global-presign serving path popped presigns from the internal
  pool and completed them on-chain with no lock check (unlike MPC user
  sessions, which gate computation on the synced target). Reproduction:
  target frozen at 0, 97 sessions completed anyway, end-of-publish
  predicate false forever, epoch unhealably stuck.
- Admission rejections had the same hole: a quorum'd Rejected counts as
  completed on-chain, and a malformed user request rejected after the
  lock would overshoot identically.

Gate both at consensus SUBMISSION, not serving/checkpoint build:
checkpoint contents must be a deterministic function of consensus, and
the lock view is wall-clock fullnode state — gating at build would fork
checkpoints. Gating what each validator votes for is sound: the chain
target is monotone within an epoch and frozen by the lock, so quorum
agreement implies an honest validator observed the target covering the
request — an agreed request can never overshoot.

- get_unsent_presign_requests holds back votes for requests beyond the
  locally-synced target; they retry every round as the target advances
  and re-enter next epoch via the uncompleted-events re-pull, exactly
  like lock-gated MPC user sessions.
- Admission rejections defer through pending_rejected_sessions with the
  same gate. Computation-failure rejections need no gate: computation
  only runs once the target covers the session.

Regression tests: a beyond-target global presign (with a stocked pool
that would otherwise serve it) and a beyond-target admission rejection
must not produce checkpoint messages until the target covers them.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ale result

Second epoch-close wedge found by the same reproduction rig (undershoot
direction this time): handle_computation_results_and_submit_to_consensus
iterated the batch of completed computation results with `return` (not
`continue`) in its missing-session and non-active-session guards. One
result for a session that went non-active while its computation was in
flight — routine under load, e.g. it completed via the peers' output
quorum — silently dropped EVERY other session's round messages and
outputs in the same batch on that validator.

Reproduced live: at an epoch boundary burst the guard fired on two
validators within the same window, dropping their round-two messages for
all nine internal presign sessions; with two of four parties' messages
gone the threshold became unreachable forever, the internal pool never
refilled, the locked-set global presigns could not be served, and the
epoch wedged with completed_sessions_count below the locked target.
Diagnostic signature: total MPC silence while consensus and presign
votes still flow, orchestrator drained (started == completed), identical
stuck session sets on every validator, and the "received a computation
update for a non-active session" warn at the stall onset.

Pre-existing since the guard was introduced (2025-10-29, #1588); the
batch-abandoning returns become per-item skips.

Also set the harness lock target in missing_network_key: the harness
never syncs it from a chain, and that test's flow completes via a
rejection response, which is now (correctly) held back until the target
covers the session.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- computation_results_batch_survives_stale_entries (integration): feeds
  handle_computation_results_and_submit_to_consensus a batch mixing six
  results for live internal presign sessions with six results for
  missing sessions, and requires every live session's round message to
  reach consensus. Under the old batch-abandoning `return`, HashMap
  iteration order drops at least one live message unless all six real
  entries happen to come first (probability < 0.2%). Widens the handler
  to pub(crate) for the direct-call test.

- test_global_presigns_complete_across_epoch_switches (cluster):
  streams global presigns across two epoch boundaries — the traffic
  shape that reproduced both wedges (overshoot via the previously
  ungated pool-serving path, undershoot via batch-dropped round
  messages) — then requires epochs to keep advancing and every
  submitted user session to drain to completed on-chain.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…BACKTRACE guard

cryptography-private #575 (lazy error construction on class-groups
arithmetic hot paths) merged as the only commit between the old pin and
the new one, so the bump carries exactly that fix.

With eager Backtrace::capture() gone upstream, the RUST_LIB_BACKTRACE=0
workaround in the workflows is no longer needed; library-error
backtraces return to CI logs. Validated by A/B on the crypto-heavy
network-DKG instantiation path under the exact CI env (RUST_BACKTRACE=1):
161.6s without the guard vs 162.3s with it — identical within noise, on
the path that previously ran 5x slower.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The earlier sweep (cdd1757) raised retryUntil's default budget to ~10
minutes precisely because 30-attempt (~60s) caps surfaced as spurious
"Condition not met after 30 attempts" failures, but 16 call sites across
helpers.ts, make-public-share-and-sign, imported-key-make-public-share-
and-sign, and all-combinations-future-sign still passed explicit 30/2000
or 30/1000 overrides. One of them just failed a TS suite run (the
ECDSASecp256r1 make-public case at 60s) while every surrounding test
passed — a session that crosses an epoch boundary legitimately needs
minutes, more so now that beyond-the-lock sessions correctly wait for
the next epoch. Use the helper defaults everywhere.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Measured on a full Test Cluster run: 57% of the log (2,415 of 4,211
lines) was the cargo Downloaded/Compiling stream inside the test step.
Build now happens in its own step — collapsed in the UI when green — so
the test step carries only nextest/test progress and failure replays,
plus --cargo-quiet on the nextest run for the residual cargo lines.
Same split for integration-tests-ci (where it also lets `time` keep
covering only test execution).

Failure replays deliberately stay inline: when the runner pod dies
(OOM/eviction) the artifact upload never happens and the live log is
the only surviving evidence.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The lock-gating fixes introduced a decision rule (gate every
user-session consensus submission on the synced lock target, never
checkpoint content) and rest on a cross-epoch invariant (the
strict-equality close predicate makes overshoot permanently
unrecoverable). Per the specs maintenance rule, write them down: the
frozen target's mechanics, why the equality cuts both ways, the
quorum-safety argument for submission gating, the per-completion-path
rule table, and the batch-processing rule for computation results.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants