Skip to content

test(ika-upgrade-test): out-of-process cross-binary upgrade harness#1727

Open
ycscaly wants to merge 28 commits into
feat/off-chain-metadata-v2from
feat/ika-upgrade-test
Open

test(ika-upgrade-test): out-of-process cross-binary upgrade harness#1727
ycscaly wants to merge 28 commits into
feat/off-chain-metadata-v2from
feat/ika-upgrade-test

Conversation

@ycscaly

@ycscaly ycscaly commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary

New additive crate crates/ika-upgrade-test — an out-of-process test harness that spawns real, separately-compiled ika-validator binaries against an external sui localnet, swaps binaries on validators across epochs, and drives dWallet workloads. Unlike ika-test-cluster (in-process IkaNode, one binary), it can host genuinely different binaries in one committee. No changes to ika-node / ika-swarm.

Implements docs/cross-binary-upgrade-testing*.md; see docs/cross-binary-upgrade-testing-results.md for the full write-up.

Tests (all opt-in via env flags; need real binaries + a workspace-tag sui)

Test Status Proves
tests/smoke.rs (go/no-go) ✅ ~396s 4 out-of-process validators + notifier, network DKG, reach epoch 2
tests/cross_binary.rs ✅ ~722s Boot 4 on a v3-only binary, swap all to dev, capability vote advances v3 → v4
tests/workload.rs ✅ ~415s Full user DKG → Presign → Sign completes on-chain

The cross-binary run exercises, out of process: protocol-vote arithmetic, mid-epoch reconfiguration MPC across the swap, mixed-committee wire compat, and on-disk compat (restart on a new binary against the old RocksDB).

Crate layout

  • sui.rs — spawn external sui start --with-faucet --force-regenesis (waits for RPC and faucet).
  • cluster.rs — chain bootstrap via init_ika_on_sui + ValidatorConfigBuilder + a notifier fullnode; NodeConfig → YAML → child; on-chain wait_for_epoch / protocol-version via IkaClient.
  • process.rsValidatorProcess: spawn / stop / swap_binary, health via the admin RPC.
  • binary.rsBinarySpec (path / tag / sha / branch) + a sha-keyed git worktree build cache honoring each commit's pinned toolchain.
  • scenario.rs — imperative DSL (start / wait_for_epoch / stop_and_swap / expect_protocol_version).
  • workload.rs — drives a user dWallet lifecycle by orchestrating the canonical ika CLI; confirms completion on-chain.

Key finding: mainnet-v1.1.8dev is not a naive rolling swap

Running the harness with the real mainnet-v1.1.8 ika-node fails at boot for the expected reason, not a harness bug: v1.1.8 links class_groups from dwallet-labs/inkrypto, dev from dwallet-labs/cryptography-private, and v4 publishes the combined ValidatorEncryptionKeysAndProofs where v1.1.8 expects the bare ClassGroupsEncryptionKeyAndProof — the exact incompatibility flagged in validator_initialization_config.rs (⚠️ MAINNET WIRE-FORMAT INCOMPATIBILITY ⚠️). The v1.1.8 node loads its config, connects to Sui, reads the contracts, then fails decoding the on-chain validator record (class groups public key … remaining input). So the harness genuinely runs a different binary and fails on the documented wire-format divergence.

To demonstrate a successful heterogeneous upgrade, the green cross_binary run uses an OLD binary that is a dev build pinned to MAX_PROTOCOL_VERSION=3 (same crypto, differs only in advertised protocol version) — disclosed in the test and results doc.

Notes

  • Builds use --no-default-features to drop enforce-minimum-cpu (panics on < 16-core hosts).
  • The workload uses a dedicated funded user (faucet SUI + IKA transfer) to avoid contention with the notifier; register-encryption-key precedes create; v4 genesis for internal_presign_sessions; long epoch to clear the mid-epoch reconfiguration; sign confirmed via the coordinator's on-chain completed-session count.

🤖 Generated with Claude Code

@ycscaly

ycscaly commented Jun 7, 2026

Copy link
Copy Markdown
Contributor Author

Update: hardening + findings from a real mainnet-v1.1.8dev run

Pushed 8672c96 — harness hardening surfaced by driving an actual v1.1.8dev rolling swap (not the same-crypto MAX=3 stand-in):

  • Graceful SIGTERM swap. stop() previously hard-SIGKILLed the node. That interrupts it mid-consensus-round and leaves dwallet-MPC replay state partial on disk, crashing the next binary on replay (consensus round mismatch ...). A swap is a planned restart — it now sends SIGTERM (handled by node_runner's wait_termination), waits for clean exit, SIGKILL only as fallback.
  • Wait for the real epoch boundary. The genesis network DKG runs during epoch 1, so the scenario waits for epoch 2 before swapping (the epoch counter advancing to N is the completion signal for N-1). Documented in cluster.rs.
  • set_buffer_stake(0) step. With n=4 the default 50% buffer rounds up to requiring all four votes; a rolling swap can leave one validator's fresh capability uncommitted at the boundary tally. Dropping the buffer to a quorum is the realistic behavior on larger committees.

What the real run established (in order)

  1. validator class-groups key shape — needs bare-shape publication + shape-tolerant verify;
  2. ^ the graceful-shutdown and epoch-wait fixes above;
  3. protocol-vote buffer stake — handled by the new step;
  4. internal-MPC-output dense round assert — filed + fixed in fix(dwallet-mpc): gate post-v1.1.8 consensus-output streams (write + read) for cross-binary replay #1728 (independent dev bug; affects any v3→v4 transition);
  5. network-key handoff certstructural blocker, filed on Off-chain validator metadata + EndOfPublishV2 #1721: the off-chain-metadata PR replaces the network-key consensus vote with a per-epoch handoff cert that v1.1.8 epochs don't produce, so dev validators can't adopt the network key inherited from a v1.1.8 epoch. Not harness-fixable.

Net: the harness takes a real v1.1.8dev swap through boot → genesis DKG → clean binary swap → epoch advance → protocol-vote-ready, and pinpoints the remaining blocker as a dev-side backward-compat gap (#1721), with #1728 fixing one dev bug along the way.

ycscaly and others added 11 commits June 9, 2026 08:13
Add the research plan and counter-proposal for an out-of-process harness
that runs validators on different compiled binaries (dev vs mainnet-v1.1.8)
and tests them across epoch boundaries. Add a CLAUDE.md rule to express
plans as ordering/dependencies, never durations.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…harness

New additive crate (no changes to ika-node/ika-swarm). Spawns real
ika-validator child processes against an external sui localnet, driven via
the existing admin RPC and the coordinator contract; epochs advance on a
short genesis epoch_duration_ms (force-close-epoch is a dead admin constant).

- binary.rs:   BinarySpec (path/tag/sha/branch) + sha-keyed build cache via
               git worktree, honoring each commit's pinned toolchain.
- process.rs:  ValidatorProcess (spawn/stop/swap_binary, admin-RPC health).
- sui.rs:      external `sui start --with-faucet --force-regenesis` localnet.
- cluster.rs:  bring-up reusing init_ika_on_sui + ValidatorConfigBuilder +
               notifier fullnode; NodeConfig -> YAML -> child; on-chain
               wait_for_epoch / protocol-version reads via IkaClient.
- workload.rs / scenario.rs / bin: typed contracts + imperative DSL + CLI;
               submission wiring and scenario runner land next.

Compiles and clippy-clean. Runtime go/no-go (epochs advance on real
binaries) and the workload driver are the next tasks.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…reach epoch 2

The smoke test brings up an external sui localnet + 4 real ika-validator
child processes + a notifier, completes network DKG, and advances to epoch 2
entirely out-of-process. test result: ok (396s on an 8-core host).

Fixes found by driving it green:
- sui.rs: wait for the faucet (:9123), not just the RPC — init_ika_on_sui
  hits the faucet immediately and it comes up a beat after the RPC.
- binary.rs / prebuilt bins: build ika-validator with --no-default-features
  to drop `enforce-minimum-cpu` (panics on < 16-core hosts).
- cluster.rs: chdir into the per-run base around init_ika_on_sui so the
  chain-keyed Pub.localnet.toml dies with the run (stale pubfile rejected the
  next regenesis chain).
- process.rs: health-check and admin reads use response *text*, not JSON —
  the ika admin server returns (StatusCode, String) debug text. This was why
  a healthy validator was never recognized.
- workload.rs: real Curve25519 DKG submission path (compiles); network-key
  bytes via the full-data fetch (the key stores a TableVec, not bytes).
- scenario.rs: real imperative runner (start/wait/swap/assert).

Run: RUN_UPGRADE_SMOKE=1 IKA_VALIDATOR_BIN=... IKA_NOTIFIER_BIN=... SUI_BIN=...
  cargo test --release -p ika-upgrade-test --test smoke -- --nocapture

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ad tests

- workload.rs: issue_dkg_and_confirm — submit a real Curve25519 user DKG and
  block until the coordinator's user completed_sessions_count rises (honest
  on-chain completion; no per-dWallet Move decoding, for which no Rust type
  exists). user_session_counts reads
  DWalletCoordinatorInnerV1.sessions_manager.user_sessions_keeper.
- tests/workload.rs: single-binary cluster + one DKG completes on-chain.
- tests/cross_binary.rs: v1.1.8 -> dev rolling upgrade; genesis v3, swap all
  four to dev, assert the protocol vote advances to v4 (gated: both binaries
  support v3, only all-dev supports v4).

Tests are opt-in (RUN_WORKLOAD_TEST / RUN_CROSS_BINARY); all compile + clippy
clean. cross_binary awaits a built v1.1.8 ika-node binary.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Scenario::with_base_dir keeps node logs after a failure (default temp dir is
cleaned on drop, which hid the v1.1.8 boot panic). with_epoch_timeout for
slower heterogeneous runs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The cross_binary test passes end-to-end (722s): 4 out-of-process validators
boot on a v3-only binary, complete network DKG, then all swap to dev (v3..v4)
and the capability vote advances v3 -> v4 at the next epoch. Exercises the
protocol-vote arithmetic, mid-epoch reconfiguration across the swap, mixed-
committee wire compat, and on-disk compat (restart on new binary, old RocksDB).

Tuning that made it pass: 10-min epochs and swap-all-then-one-transition, to
avoid the known sui_executor gas-coin-contention epoch wedge (short epochs +
swap churn froze the notifier's advance-epoch executor) and to keep each swap
clear of the mid-epoch reconfiguration window. Scenario gains
with_epoch_duration_ms / with_epoch_timeout.

The OLD binary is a dev build pinned to MAX_PROTOCOL_VERSION=3 (same crypto as
dev, differs only in advertised version). The literal mainnet-v1.1.8 ika-node
is crypto-incompatible (inkrypto vs cryptography-private class_groups; v4 key-
shape change) and cannot share a committee with dev — a finding documented in
the test, confirming the dual-pin premise of docs/plan-update-crypto-latest.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
workload.rs: issue_dkg returns the txn digest (completion is confirmed via the
coordinator session counter, not a per-dWallet read); protocol-pp fetch retries
on a partial-TableVec decode error. tests/workload.rs (GREEN) proves the
submission path end-to-end: protocol params from the on-chain network key,
centralized Curve25519 party, coordinator txn executes.

KNOWN GAP (documented in the test + results doc): the coordinator ignores the
emitted event ("not a DWalletSessionEvent"), so the session does not complete —
the driver must call register_encryption_key before the DKG (as the TS SDK
does). Presign/Sign build on a completed DKG and are not implemented.

docs/cross-binary-upgrade-testing-results.md summarizes what was built, the
green go/no-go + cross-binary(v3->v4) runs, the v1.1.8 crypto-incompat finding,
the epoch-wedge tuning, and the workload gap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on-chain

tests/workload.rs passes (~415s): a user dWallet completes register-encryption-
key -> DKG(Active) -> presign(verified) -> sign, the sign confirmed on-chain via
the coordinator's user completed_sessions_count. Proves the session-lifecycle
invariant (sessions started in an epoch complete; no silent drops).

The driver orchestrates the canonical `ika` CLI (the tested Rust client) rather
than re-deriving the user-side 2PC. Making it reliably green surfaced real
system properties, all handled:
- dedicated, separately-funded user (faucet SUI + IKA transfer) — sharing the
  publisher key with the notifier causes coin-lock contention;
- register-encryption-key before create (encrypted DKG borrows the user key
  from the coordinator);
- v4 genesis (internal_presign_sessions is a v4 feature);
- 30-min epoch so the lifecycle runs clear of the mid-epoch reconfiguration;
- confirm sign via on-chain completed-count, not the CLI's racy --wait poll.

Adds shared-crypto + fastcrypto deps for the IKA-funding transfer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…it, buffer-stake override

Hardening surfaced while driving a real mainnet-v1.1.8 -> dev rolling swap:

- process.rs: swap now stops the node with SIGTERM (which ika-node's
  `wait_termination` handles for an orderly shutdown) and waits for clean exit,
  with SIGKILL only as a fallback. The previous hard SIGKILL interrupted the
  node mid-consensus-round and left dwallet-MPC replay state partial on disk,
  which crashed the next binary on replay (`consensus round mismatch ...`). A
  binary swap is a planned restart, not a crash, so it must be graceful.

- cluster.rs: document that the epoch counter advancing to N is itself the
  completion signal for epoch N-1 (reconfiguration into a new epoch is gated on
  that epoch's network-key MPC finishing), so callers wait for the epoch *after*
  the work they depend on rather than polling key state.

- scenario.rs / process.rs: add a `set_buffer_stake(bps)` step
  (POST /set-override-buffer-stake) so a quorum, not unanimity, advances the
  protocol version. With n=4 the default 50% buffer rounds up to requiring all
  four votes; a rolling swap can leave one validator's fresh capability
  uncommitted at the boundary tally.

- cross_binary.rs: wait for epoch 2 before swapping (the genesis network DKG
  runs during epoch 1, so epoch 2 guarantees it finished under the old binary),
  drop the buffer stake to a quorum, then wait for epoch 3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t; gate internal-presign read on v4

Wire the cross-binary upgrade path against the off-chain-metadata branch so a
mixed-version committee survives a rolling binary swap:

- verify_validator_keys decodes whatever class-groups key shape is on-chain
  (bare mainnet-v1.1.8 `ClassGroupsEncryptionKeyAndProof` or the post-bump
  combined `ValidatorEncryptionKeysAndProofs`) via
  `decode_validator_encryption_keys`, comparing only the class-groups component
  that identifies the seed. PVSS keys are verified off-chain on the assembly path.
- validator_initialization_config publishes the BARE mainnet-v1.1.8 shape
  on-chain (the richer bundle travels off-chain via validator P2P), so a
  v1.1.8 binary can still decode the record during the upgrade window.
- The internal-presign output read is gated on `internal_presign_sessions_enabled()`
  (a v4 feature) so a pre-v4 node mid rolling-upgrade skips the sparse stream
  instead of panicking on the dense per-round assertion.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- bin default scenario `rolling_majority_then_minority` mirrors the proven-good
  config (10-min epochs + 1800s wait timeout + `set_buffer_stake(0)` before the
  upgrade-crossing wait) so the v3->v4 vote can land under n=4.
- `wait_for_epoch` logs a failed `current_epoch` read instead of silently
  treating it as epoch 0 until the deadline.
- Document that the workload sign-completion check (coordinator user
  `completed_sessions_count`) is sound only because the harness drives a single
  dedicated user with one sign in flight.
- results doc: caveat the cross_binary GREEN row (version-only swap; the real
  v1.1.8 crypto-boundary swap is not exercised) and note the single-instance /
  fixed-port (9000/9123) constraint.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ycscaly ycscaly changed the base branch from dev to feat/off-chain-metadata-v2 June 9, 2026 08:18
@ycscaly ycscaly force-pushed the feat/ika-upgrade-test branch from 8672c96 to f858d37 Compare June 9, 2026 08:19
ycscaly and others added 13 commits June 9, 2026 13:11
…te + read)

Same fix landed on dev via #1728, applied here on the off-chain-metadata file
structure. The consensus-output replay loop asserts each per-round table's
record round equals the driver round (`dwallet_mpc_messages`). Tables added
after mainnet-v1.1.8 are sparse when a dev binary replays a v1.1.8-written
RocksDB (rolling binary swap), tripping the assertion. Gate write + read of each
on the feature that introduced it:

- internal_presign_sessions: dwallet_internal_mpc_outputs,
  global_presign_requests, idle_status_updates
- noa_checkpoints: verified_system_checkpoint_messages, noa_observations,
  sui_chain_observation_updates

(`network_key_data_messages` is already removed on this branch by the off-chain
work, so it needs no gate here.) Validated by the cross-binary upgrade harness:
mainnet-v1.1.8 -> dev rolling swap reaches protocol v4.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts:
#	crates/ika-core/src/authority/authority_per_epoch_store.rs
#	crates/ika-core/src/dwallet_mpc/dwallet_mpc_service.rs
…ore driving

The workload waited for epoch 1 (genesis, reached immediately) and drove the
dWallet lifecycle right away, with a 30-minute epoch. At v4 the genesis
network-key DKG is gated on the off-chain mpc_data freeze, whose ready-signal —
with no next-epoch committee published yet at genesis — only fires at the
3/4*epoch_duration deadline (~22 min on a 30-min epoch). So the network key
wasn't on-chain when the CLI tried to derive protocol parameters, and the
driver's ~2-min retry budget gave up long before.

Wait for epoch 2 instead of 1: the counter advancing to 2 is itself the
completion signal for the genesis network DKG (reconfiguration into epoch 2
reshares that key, which can't happen until the DKG finished), so the key is
guaranteed readable — and don't drive the lifecycle before then, when it could
only fail. Shorten the epoch to 4 min so the freeze deadline (~3 min) is
reachable while still clearing validator bring-up + announcement recording
(~90s) and leaving the lifecycle room inside the next epoch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… mid-epoch at the v3->v4 boundary

The off-chain network-key blob overlay is keyed by key ID only, so the
moment this epoch's mid-epoch reconfiguration finalizes locally, the
syncer's merged key data starts carrying the output produced *for the
next epoch's committee* — shares encrypted to next-epoch party IDs,
which need not align with this epoch's (on-chain committee order is not
stable across epochs). In steady-state v4 the cert anchor in
`adopt_cert_verified_keys` rejects it (the prior epoch's handoff cert
pins the output produced FOR the current epoch), but the first v4 epoch
after the v3->v4 upgrade has no prior cert, fell into the cert-less
boundary path, and adopted blindly — every validator then failed
decryption with ClassGroup(Decryption) using this epoch's identity on
next-epoch-dealt shares.

Guard the boundary path the same way the cert anchor does: skip
adoption when the reconfiguration output's digest matches the one this
epoch's own reconfiguration session recorded (epoch-keyed perpetual
digest, new point lookup). The next epoch's manager adopts and decrypts
it with next-epoch identity at epoch start, as in steady state.

Also hoist the last-failed check in the instantiation filter so it
applies to every branch: previously the `Some(prev)` branch re-selected
the failing bytes every poll tick (they differ from the last
*successfully* instantiated ones by definition), re-running a doomed
~18s class-groups decrypt per tick and starving the service loop —
checkpoints (including EndOfPublish) never certified, wedging epoch
advance behind the decryption failure.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ejections

The stale-gas recovery (drop the cached gas ref, floor the re-fetch at
the rejected version) only ran when the rejection arrived inside
`tx_response.errors`. But the fullnode also rejects at the JSON-RPC
layer (ServerError -32002, "Transaction needs to be rebuilt ... object
unavailable for consumption"), which surfaces as `Err` from
`execute_transaction_block_with_effects` and bailed out before the
recovery code — the cached gas ref survived, so every
`retry_with_max_elapsed_time!` attempt rebuilt the byte-identical stale
tx and re-rejected, wedging checkpoint delivery to Sui for the full
one-hour window (observed: dwallet checkpoints stuck behind a gas coin
advanced by the shared publisher address in the test cluster, blocking
DKG settlement, mid-epoch reconfiguration, and epoch advance).

Factor the recovery into
`NotifierSubmitState::handle_possible_stale_gas_rejection` and apply it
on both paths. Note `IkaError` derives strum's `AsRefStr`, so
`err.as_ref()` yields only the variant name — match the
`SuiClientTxFailureGeneric` payload to get the actual message.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…grade boundary

Genesis at v3 (MIN) instead of v4 (MAX): a v4 *genesis* network DKG is
rejected forever (PVSS keys only arrive through the next-committee-only
off-chain assembly), so the supported path — and the one mainnet takes —
is genesis v3, then upgrade into v4 via the capability vote. The test
now waits for epoch 2 (v3 genesis DKG + reshare done), zeroes the
buffer stake so the 4-validator vote tallies at bare quorum, waits for
epoch 3, asserts protocol >= v4, and only then drives the
DKG -> Presign -> Sign lifecycle. This exercises the cert-less v3->v4
reconfiguration-adoption boundary fixed in the previous commits.

Remove HANDOFF.md — the reshare-decrypt bug it described is fixed and
the workload test is green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e-gas recovery

Clippy's `unnecessary_to_owned` suggests `err.as_ref()` for
`&err.to_string()` here; it compiles because `IkaError` derives strum's
`AsRefStr`, but that returns only the variant name — never the rejection
markers — silently disabling the recovery.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…dators

Internal presign sessions get their sequence number from a single shared
counter, assigned in iteration order over (network key id) x (curve) x
(signature algorithm), and the sequence number is bound into the session
identifier transcript. Both iteration sources were unordered:

- SUPPORTED_CURVES_TO_SIGNATURE_ALGORITHMS_TO_HASH_SCHEMES was a
  HashMap<u32, HashMap<u32, Vec<u32>>> — iteration order is random per
  process (RandomState), so each validator walked curves/algorithms in a
  different order;
- the agreed network key ids were iterated straight off a HashMap.

Each validator therefore derived *different* session identifiers for the
same (curve, algorithm) work. Those sessions could never reach quorum, so
they never completed, and the instantiated != completed gate then blocked
that algorithm's pool top-ups for the entire epoch. Once a user presign
request locked onto the starved pool, the EndOfPublish condition was
unsatisfiable and the epoch could not advance.

Observed live: in a 4-validator run the validators logged three distinct
top-up orders, and exactly the sequence numbers whose (curve, algorithm)
assignment happened to agree on 3+ validators completed — the rest hung
forever, the ECDSA pool stayed empty all epoch, and the run timed out.
A previous green run was a per-process-seed coin flip.

Fix: BTreeMap at both nesting levels of the static, and collect the
agreed key ids into a BTreeSet before the instantiation loop. Pre-existing
bug from the internal sessions instantiation logic (#1638), not specific
to this branch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The reconfiguration overlay (next-epoch network key data computed off-chain
during reconfiguration) was stored bare and adopted based on a
produced-this-epoch digest guard. Instead, store the target epoch alongside
the key data and adopt it only when it matches the epoch actually being
entered — epoch-correct by construction, no guard heuristics.

- validator_metadata.rs: overlay entries carry the epoch they were computed
  for; lookups take the target epoch.
- authority_per_epoch_store.rs / authority_perpetual_tables.rs: persist and
  reload the epoch alongside the overlay data; drop the digest-guard
  plumbing.
- mpc_manager.rs / sui_syncer.rs: pass the target epoch through adoption and
  ignore overlay data for any other epoch.

Validated by a full workload run: genesis at v3, upgrade into v4 at epoch 3,
v3 -> v4 reshare decrypts cleanly on all validators, DKG -> Presign -> Sign
lifecycle green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two upgrades to the cross-binary rolling-upgrade test:

1. Committee changes at every epoch boundary, with a different committee
   size each epoch (4 -> 3 -> 5 -> 4): a validator removal coincides with
   the v3 -> v4 protocol bump, two brand-new validators join via the full
   candidate -> stake -> activate flow (their class-groups keys registered
   on-chain, so the v4 reshare encrypts to parties that never held the
   key), and a final removal reshapes 5 -> 4. Every boundary after the
   first is a real reshare to a different party set.
   - sui_client.rs: request_add_validator / request_remove_validator /
     candidate registration helpers with explicit sender, shared version,
     and cap so the test can drive membership without touching the active
     wallet address.
   - network_config_builder.rs: configurable min_validator_count (the dip
     to 3 is below the protocol default of 4).
   - scenario.rs / cluster.rs / process.rs: join_validator,
     remove_validator, expect_committee_size scenario steps; spawn /
     stop / swap of individual validators on different binaries.

2. Rough per-protocol MPC timing report (mpc_timings.rs): scrape the MPC
   duration metrics from each validator after the v3 (old binary) and v4
   (new binary, churned committee) workload runs, and print a comparison
   table at the end of the run. Informational and flagged, not asserted —
   wall-clock on a loaded developer machine is too noisy to gate on.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…set post-upgrade

The cross-binary churn test wedged at the v3 workload: genesis wrote the
full GlobalPresignConfig, routing ECDSA presigns to the global path, which
is served exclusively from the validators' internal presign pool — and that
pool only fills once internal_presign_sessions activates at protocol v4.
At v3 the presign was unservable forever.

Genesis now takes GenesisGlobalPresignConfig (Full | Empty). Empty is the
mainnet-v1.1.8 on-chain state (the config object must still exist — the
coordinator reads it with a bare dynamic-field borrow). The cross-binary
scenario uses Empty at genesis and a new SetGlobalPresignConfig step right
after the v4 upgrade is confirmed — the same operational ordering a real
mainnet rollout must follow: set_global_presign_config only after v4
activates, or ECDSA presigns stall network-wide until it does.

Existing genesis-at-v4 tests keep Full (exact current behavior). Also
rewords the cross_binary doc comment: the literal v1.1.8 binary failing on
harness genesis is a registration-shape artifact (post-#1707 bundle bytes),
not a production-direction gap — the new binary reads v1.1.8 keys via the
shape-tolerant decode.

Verified in the churn run: v3 workload completed (vs infinite wedge),
v3→v4 vote passed, post-upgrade config set succeeded.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ycscaly and others added 4 commits June 10, 2026 22:01
…le coin

transfer_one_ika took the publisher's first IKA coin — the genesis supply
coin (ika_supply_id) — and transferred it whole to the workload user. The
first churn run to stake a joiner after a workload exposed it: stake_ika
splits the joining stake from ika_supply_id signed by the publisher, which
no longer owned it ("Transaction was not signed by the correct sender").
A second workload on the same cluster would have failed the same way
("publisher owns no IKA").

Pay a fixed 1000-IKA allowance to the workload user instead; the supply
coin stays with the publisher.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s activates

The global-presign pipeline is gated on the internal_presign_sessions
feature flag (protocol v4) on its consensus side, but session intake
diverted every global presign request to the pool unconditionally. Below
v4 that strands the request: no pool to serve it, no MPC session spawned,
and the session is locked into its epoch — all_current_epoch_sessions_-
completed blocks advance_epoch, so the epoch can never end and v4 can
never activate. Mainnet's GlobalPresignConfig is already populated (every
production ECDSA presign routes to global), so a single presign request
in flight after the upgrade restart would have wedged the network at v3
permanently.

Gate the diversion on the same flag: pre-activation, the request falls
through to a user-requested MPC session — the v1.1.8 serving behavior,
whose input (dwallet-output-less presign computation) and output
(RespondDWalletPresign with no dwallet_id, VersionedPresignOutput::V2)
paths are intact on this branch.

Caught by the new v118_upgrade rehearsal: genesis a 4-validator committee
on the literal mainnet-v1.1.8 ika-node with the verified mainnet-shape
populated GlobalPresignConfig, run the mainnet user flow at v3 (DKG with
Universal output, global presign as a user session, sign), atomically
swap all validators to the local build, and probe the pre-activation
window with a workload that must complete its global presign at v3 via
the fallback before the boundary. The run then crosses into v4 (the local
binaries reshare the 1.1.8-created network key), serves a pool-backed
global presign, and completes one more clean reshare.

Also corrects GenesisGlobalPresignConfig and cross_binary docs: Full is
the actual verified mainnet on-chain state, Empty is a harness
arrangement (and the only targeted-presign coverage), not the mainnet
shape.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ith window-delta tables

The scraped MPC duration metrics are cumulative per process, so a later
snapshot blends everything the validator ever ran — a v3-protocol reshare
and a v4-protocol reshare land in one row and the ratio table reads 1.00x.
Add a window table to each comparison: (sum2-sum1)/(completions2-completions1)
between consecutive snapshots isolates just the work done between them
(skipped across a swap, where the counter reset makes the delta negative).

Extend the v118 rehearsal past epoch 4 with two new snapshots:
- v4-reshare: the first reshare executed under the v4 reconfiguration
  math (reconfiguration_message_version = 3, PVSS HPKE) — the epoch 2->3
  reshare still ran the v3 protocol, so the previous run never measured
  v4 reconfiguration cleanly;
- local-v4-settled: a full lifecycle after the pools finished their
  initial fill, pricing v4 DKG / pool presign / sign without the
  boundary work.

Run is green (1105s): the v4-math reshare window prices at 53.2s/7.4s/30.8s/9.6s
per round vs the local binary's v3-math reshare at 9.5s/2.5s/8.6s/2.8s —
with continuous internal-presign pool top-ups sharing the cores.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts:
#	crates/ika-core/src/dwallet_mpc/mpc_manager.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant