Skip to content

fix(dwallet-mpc): network-key adoption gates — no committee-unapproved parameter set may go live#1735

Merged
omersadika merged 1 commit into
devfrom
fix/stale-network-key-adoption
Jun 12, 2026
Merged

fix(dwallet-mpc): network-key adoption gates — no committee-unapproved parameter set may go live#1735
omersadika merged 1 commit into
devfrom
fix/stale-network-key-adoption

Conversation

@omersadika

Copy link
Copy Markdown
Contributor

Summary

Fixes the false-malicious conviction failure mode surfaced during the epoch-close wedge investigation (see PR #1721's "known issue" note): a validator that installs a different network-key parameter set than its peers honestly computes byte-divergent MPC outputs, and the output-quorum byte-equality tally convicts it as malicious — observed live, including the node logging node recognized itself as malicious. One conviction silently converts a 4-validator committee into zero-redundancy 3-of-4; the network keeps running degraded with no error-level signal until any further loss freezes MPC.

Root context: at every epoch-N entry, early-starting validators transiently see stale/incomplete overlay data (the epoch-entry stale-mpc_data race). Two adoption holes let that transient state become an installed parameter set:

  1. Empty-reconfiguration-output adoption — an overlay entry whose reconfiguration output is transiently empty slipped through the initial-DKG adoption branch even when the prior epoch's handoff cert pins a reconfiguration digest for the key, instantiating DKG-derived parameters the committee never agreed to run this epoch. Now skipped (warn once per cert digest, deduped) and retried until the cert-pinned bytes resolve locally — the overlay re-merges every sync tick and the prepare-then-start barrier installs the pinned blob by digest.
  2. Stale-epoch pre-spawn rejection — adopted data whose current_epoch metadata mismatches the manager's epoch was only rejected ~10s after the parameter derivation burnt on the rayon pool, and the doomed in-flight instantiation blocked the same key's correct data behind it, widening the entry key gap during which sessions park. Now rejected before spawning.

Plus an observability fix that blinded two post-mortems: a session whose protocol-cryptographic-data generation fails was silently skipped every 20ms service tick (.ok()?). The skip semantics stay; it now logs once per session (deduped).

Tests

  • New network_key_adoption.rs: both gates with positive controls — the cert-pinned empty/mismatching/matching overlay progression (self-validating: if the cert weren't consumed, the mismatch assertion would fail via the blind-adopt path), and stale-vs-current epoch spawn discrimination.
  • Regression battery green: beyond_lock_target 2/2, computation_results_batch 1/1, missing_network_key 1/1; clippy clean on touched code.
  • Spec updated: specs/handoff.md certificate-consumption section documents both guards and why divergence ends in conviction.

What this does NOT fix (tracked separately)

The residual epoch-entry race (validators still fetch the prior epoch's overlay at entry) and the CI-only "parked sessions never compute after the key installs" link — see the tracking issue.

🤖 Generated with Claude Code

…ttee didn't agree on

A validator that installs different network-key parameters than its
peers honestly computes byte-divergent MPC outputs and is convicted
malicious by the output-quorum byte-equality tally — observed live
("node recognized itself as malicious"), silently dropping a
4-validator committee to zero-redundancy 3-of-4 until a later loss
froze MPC entirely. Root: the epoch-entry stale-mpc_data race pairs
with two adoption holes:

- An overlay entry whose reconfiguration output is transiently EMPTY
  slipped through the initial-DKG adoption branch while the prior
  epoch's handoff cert pins a reconfiguration digest for the key,
  instantiating DKG-derived parameters the committee never agreed to
  run this epoch. Skip and retry until the cert-pinned bytes resolve.
- Adopted data whose current_epoch metadata mismatches the manager's
  epoch was only rejected ~10s AFTER the parameter derivation burnt on
  the rayon pool — and the doomed in-flight instantiation blocked the
  same key's correct data, widening the entry key gap during which
  sessions park. Reject before spawning.

Also un-silence the computation-spawn loop: a session whose protocol-
cryptographic-data generation fails was skipped every 20ms tick with no
log at any level, which blinded two wedge post-mortems. The skip stays
(correct); it now logs once per session.

Regression tests for both gates (cert-pinned-empty-overlay adoption and
stale-epoch pre-spawn rejection) with positive controls; spec'd in
handoff.md's certificate-consumption section.

Salvaged from an interrupted background agent's worktree; reviewed,
completed (module registration, spec), and validated: new tests 2/2,
beyond_lock_target 2/2, computation_results_batch 1/1,
missing_network_key 1/1, clippy clean on touched code.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@omersadika

Copy link
Copy Markdown
Contributor Author

Tracking issue for the residual race + the CI-only parked-sessions link: #1736

@omersadika omersadika merged commit 6e983a0 into dev Jun 12, 2026
12 checks passed
@omersadika omersadika deleted the fix/stale-network-key-adoption branch June 12, 2026 21:20
omersadika added a commit that referenced this pull request Jun 12, 2026
#4)

The v4 pipeline's three designed halt/block modes are safety-first by
design, so a blocked validator looks healthy from outside. The metrics
all exist (verified against the registries); what was missing is the
alerting contract: rules for ika_handoff_prepare_waiting (barrier
blocked > 2x epoch) and off_chain_assembly_wedged (EverythingExcluded —
the one mode with no self-heal), the log-based signal for the joiner
bootstrap fail-closed halt, the operator action for each, and the
secondary dashboard signals (pruner advancement, presign-queue drain,
rejected handoff signatures).

Also merges origin/dev (PRs #1734/#1735): git's rename detection carried
the new adoption-guards section of specs/handoff.md into the relocated
dev-docs/specs/handoff.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant