test(ika-upgrade-test): out-of-process cross-binary upgrade harness by ycscaly · Pull Request #1727 · dwallet-labs/ika

ycscaly · 2026-06-05T09:14:47Z

Summary

New additive crate crates/ika-upgrade-test — an out-of-process test harness that spawns real, separately-compiled ika-validator binaries against an external sui localnet, swaps binaries on validators across epochs, and drives dWallet workloads. Unlike ika-test-cluster (in-process IkaNode, one binary), it can host genuinely different binaries in one committee. No changes to ika-node / ika-swarm.

Implements docs/cross-binary-upgrade-testing*.md; see docs/cross-binary-upgrade-testing-results.md for the full write-up.

Tests (all opt-in via env flags; need real binaries + a workspace-tag `sui`)

Test	Status	Proves
`tests/smoke.rs` (go/no-go)	✅ ~396s	4 out-of-process validators + notifier, network DKG, reach epoch 2
`tests/cross_binary.rs`	✅ ~722s	Boot 4 on a v3-only binary, swap all to dev, capability vote advances v3 → v4
`tests/workload.rs`	✅ ~415s	Full user DKG → Presign → Sign completes on-chain

The cross-binary run exercises, out of process: protocol-vote arithmetic, mid-epoch reconfiguration MPC across the swap, mixed-committee wire compat, and on-disk compat (restart on a new binary against the old RocksDB).

Crate layout

sui.rs — spawn external sui start --with-faucet --force-regenesis (waits for RPC and faucet).
cluster.rs — chain bootstrap via init_ika_on_sui + ValidatorConfigBuilder + a notifier fullnode; NodeConfig → YAML → child; on-chain wait_for_epoch / protocol-version via IkaClient.
process.rs — ValidatorProcess: spawn / stop / swap_binary, health via the admin RPC.
binary.rs — BinarySpec (path / tag / sha / branch) + a sha-keyed git worktree build cache honoring each commit's pinned toolchain.
scenario.rs — imperative DSL (start / wait_for_epoch / stop_and_swap / expect_protocol_version).
workload.rs — drives a user dWallet lifecycle by orchestrating the canonical ika CLI; confirms completion on-chain.

Key finding: `mainnet-v1.1.8` ↔ `dev` is not a naive rolling swap

Running the harness with the real mainnet-v1.1.8 ika-node fails at boot for the expected reason, not a harness bug: v1.1.8 links class_groups from dwallet-labs/inkrypto, dev from dwallet-labs/cryptography-private, and v4 publishes the combined ValidatorEncryptionKeysAndProofs where v1.1.8 expects the bare ClassGroupsEncryptionKeyAndProof — the exact incompatibility flagged in validator_initialization_config.rs (⚠️ MAINNET WIRE-FORMAT INCOMPATIBILITY ⚠️). The v1.1.8 node loads its config, connects to Sui, reads the contracts, then fails decoding the on-chain validator record (class groups public key … remaining input). So the harness genuinely runs a different binary and fails on the documented wire-format divergence.

To demonstrate a successful heterogeneous upgrade, the green cross_binary run uses an OLD binary that is a dev build pinned to MAX_PROTOCOL_VERSION=3 (same crypto, differs only in advertised protocol version) — disclosed in the test and results doc.

Notes

Builds use --no-default-features to drop enforce-minimum-cpu (panics on < 16-core hosts).
The workload uses a dedicated funded user (faucet SUI + IKA transfer) to avoid contention with the notifier; register-encryption-key precedes create; v4 genesis for internal_presign_sessions; long epoch to clear the mid-epoch reconfiguration; sign confirmed via the coordinator's on-chain completed-session count.

🤖 Generated with Claude Code

ycscaly · 2026-06-07T08:45:44Z

Update: hardening + findings from a real `mainnet-v1.1.8` → `dev` run

Pushed 8672c96 — harness hardening surfaced by driving an actual v1.1.8 → dev rolling swap (not the same-crypto MAX=3 stand-in):

Graceful SIGTERM swap. stop() previously hard-SIGKILLed the node. That interrupts it mid-consensus-round and leaves dwallet-MPC replay state partial on disk, crashing the next binary on replay (consensus round mismatch ...). A swap is a planned restart — it now sends SIGTERM (handled by node_runner's wait_termination), waits for clean exit, SIGKILL only as fallback.
Wait for the real epoch boundary. The genesis network DKG runs during epoch 1, so the scenario waits for epoch 2 before swapping (the epoch counter advancing to N is the completion signal for N-1). Documented in cluster.rs.
set_buffer_stake(0) step. With n=4 the default 50% buffer rounds up to requiring all four votes; a rolling swap can leave one validator's fresh capability uncommitted at the boundary tally. Dropping the buffer to a quorum is the realistic behavior on larger committees.

What the real run established (in order)

validator class-groups key shape — needs bare-shape publication + shape-tolerant verify;
^ the graceful-shutdown and epoch-wait fixes above;
protocol-vote buffer stake — handled by the new step;
internal-MPC-output dense round assert — filed + fixed in fix(dwallet-mpc): gate post-v1.1.8 consensus-output streams (write + read) for cross-binary replay #1728 (independent dev bug; affects any v3→v4 transition);
network-key handoff cert — structural blocker, filed on Off-chain validator metadata + EndOfPublishV2 #1721: the off-chain-metadata PR replaces the network-key consensus vote with a per-epoch handoff cert that v1.1.8 epochs don't produce, so dev validators can't adopt the network key inherited from a v1.1.8 epoch. Not harness-fixable.

Net: the harness takes a real v1.1.8 → dev swap through boot → genesis DKG → clean binary swap → epoch advance → protocol-vote-ready, and pinpoints the remaining blocker as a dev-side backward-compat gap (#1721), with #1728 fixing one dev bug along the way.

Add the research plan and counter-proposal for an out-of-process harness that runs validators on different compiled binaries (dev vs mainnet-v1.1.8) and tests them across epoch boundaries. Add a CLAUDE.md rule to express plans as ordering/dependencies, never durations. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…harness New additive crate (no changes to ika-node/ika-swarm). Spawns real ika-validator child processes against an external sui localnet, driven via the existing admin RPC and the coordinator contract; epochs advance on a short genesis epoch_duration_ms (force-close-epoch is a dead admin constant). - binary.rs: BinarySpec (path/tag/sha/branch) + sha-keyed build cache via git worktree, honoring each commit's pinned toolchain. - process.rs: ValidatorProcess (spawn/stop/swap_binary, admin-RPC health). - sui.rs: external `sui start --with-faucet --force-regenesis` localnet. - cluster.rs: bring-up reusing init_ika_on_sui + ValidatorConfigBuilder + notifier fullnode; NodeConfig -> YAML -> child; on-chain wait_for_epoch / protocol-version reads via IkaClient. - workload.rs / scenario.rs / bin: typed contracts + imperative DSL + CLI; submission wiring and scenario runner land next. Compiles and clippy-clean. Runtime go/no-go (epochs advance on real binaries) and the workload driver are the next tasks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…reach epoch 2 The smoke test brings up an external sui localnet + 4 real ika-validator child processes + a notifier, completes network DKG, and advances to epoch 2 entirely out-of-process. test result: ok (396s on an 8-core host). Fixes found by driving it green: - sui.rs: wait for the faucet (:9123), not just the RPC — init_ika_on_sui hits the faucet immediately and it comes up a beat after the RPC. - binary.rs / prebuilt bins: build ika-validator with --no-default-features to drop `enforce-minimum-cpu` (panics on < 16-core hosts). - cluster.rs: chdir into the per-run base around init_ika_on_sui so the chain-keyed Pub.localnet.toml dies with the run (stale pubfile rejected the next regenesis chain). - process.rs: health-check and admin reads use response *text*, not JSON — the ika admin server returns (StatusCode, String) debug text. This was why a healthy validator was never recognized. - workload.rs: real Curve25519 DKG submission path (compiles); network-key bytes via the full-data fetch (the key stores a TableVec, not bytes). - scenario.rs: real imperative runner (start/wait/swap/assert). Run: RUN_UPGRADE_SMOKE=1 IKA_VALIDATOR_BIN=... IKA_NOTIFIER_BIN=... SUI_BIN=... cargo test --release -p ika-upgrade-test --test smoke -- --nocapture Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ad tests - workload.rs: issue_dkg_and_confirm — submit a real Curve25519 user DKG and block until the coordinator's user completed_sessions_count rises (honest on-chain completion; no per-dWallet Move decoding, for which no Rust type exists). user_session_counts reads DWalletCoordinatorInnerV1.sessions_manager.user_sessions_keeper. - tests/workload.rs: single-binary cluster + one DKG completes on-chain. - tests/cross_binary.rs: v1.1.8 -> dev rolling upgrade; genesis v3, swap all four to dev, assert the protocol vote advances to v4 (gated: both binaries support v3, only all-dev supports v4). Tests are opt-in (RUN_WORKLOAD_TEST / RUN_CROSS_BINARY); all compile + clippy clean. cross_binary awaits a built v1.1.8 ika-node binary. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Scenario::with_base_dir keeps node logs after a failure (default temp dir is cleaned on drop, which hid the v1.1.8 boot panic). with_epoch_timeout for slower heterogeneous runs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The cross_binary test passes end-to-end (722s): 4 out-of-process validators boot on a v3-only binary, complete network DKG, then all swap to dev (v3..v4) and the capability vote advances v3 -> v4 at the next epoch. Exercises the protocol-vote arithmetic, mid-epoch reconfiguration across the swap, mixed- committee wire compat, and on-disk compat (restart on new binary, old RocksDB). Tuning that made it pass: 10-min epochs and swap-all-then-one-transition, to avoid the known sui_executor gas-coin-contention epoch wedge (short epochs + swap churn froze the notifier's advance-epoch executor) and to keep each swap clear of the mid-epoch reconfiguration window. Scenario gains with_epoch_duration_ms / with_epoch_timeout. The OLD binary is a dev build pinned to MAX_PROTOCOL_VERSION=3 (same crypto as dev, differs only in advertised version). The literal mainnet-v1.1.8 ika-node is crypto-incompatible (inkrypto vs cryptography-private class_groups; v4 key- shape change) and cannot share a committee with dev — a finding documented in the test, confirming the dual-pin premise of docs/plan-update-crypto-latest.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

workload.rs: issue_dkg returns the txn digest (completion is confirmed via the coordinator session counter, not a per-dWallet read); protocol-pp fetch retries on a partial-TableVec decode error. tests/workload.rs (GREEN) proves the submission path end-to-end: protocol params from the on-chain network key, centralized Curve25519 party, coordinator txn executes. KNOWN GAP (documented in the test + results doc): the coordinator ignores the emitted event ("not a DWalletSessionEvent"), so the session does not complete — the driver must call register_encryption_key before the DKG (as the TS SDK does). Presign/Sign build on a completed DKG and are not implemented. docs/cross-binary-upgrade-testing-results.md summarizes what was built, the green go/no-go + cross-binary(v3->v4) runs, the v1.1.8 crypto-incompat finding, the epoch-wedge tuning, and the workload gap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…on-chain tests/workload.rs passes (~415s): a user dWallet completes register-encryption- key -> DKG(Active) -> presign(verified) -> sign, the sign confirmed on-chain via the coordinator's user completed_sessions_count. Proves the session-lifecycle invariant (sessions started in an epoch complete; no silent drops). The driver orchestrates the canonical `ika` CLI (the tested Rust client) rather than re-deriving the user-side 2PC. Making it reliably green surfaced real system properties, all handled: - dedicated, separately-funded user (faucet SUI + IKA transfer) — sharing the publisher key with the notifier causes coin-lock contention; - register-encryption-key before create (encrypted DKG borrows the user key from the coordinator); - v4 genesis (internal_presign_sessions is a v4 feature); - 30-min epoch so the lifecycle runs clear of the mid-epoch reconfiguration; - confirm sign via on-chain completed-count, not the CLI's racy --wait poll. Adds shared-crypto + fastcrypto deps for the IKA-funding transfer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…it, buffer-stake override Hardening surfaced while driving a real mainnet-v1.1.8 -> dev rolling swap: - process.rs: swap now stops the node with SIGTERM (which ika-node's `wait_termination` handles for an orderly shutdown) and waits for clean exit, with SIGKILL only as a fallback. The previous hard SIGKILL interrupted the node mid-consensus-round and left dwallet-MPC replay state partial on disk, which crashed the next binary on replay (`consensus round mismatch ...`). A binary swap is a planned restart, not a crash, so it must be graceful. - cluster.rs: document that the epoch counter advancing to N is itself the completion signal for epoch N-1 (reconfiguration into a new epoch is gated on that epoch's network-key MPC finishing), so callers wait for the epoch *after* the work they depend on rather than polling key state. - scenario.rs / process.rs: add a `set_buffer_stake(bps)` step (POST /set-override-buffer-stake) so a quorum, not unanimity, advances the protocol version. With n=4 the default 50% buffer rounds up to requiring all four votes; a rolling swap can leave one validator's fresh capability uncommitted at the boundary tally. - cross_binary.rs: wait for epoch 2 before swapping (the genesis network DKG runs during epoch 1, so epoch 2 guarantees it finished under the old binary), drop the buffer stake to a quorum, then wait for epoch 3. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…t; gate internal-presign read on v4 Wire the cross-binary upgrade path against the off-chain-metadata branch so a mixed-version committee survives a rolling binary swap: - verify_validator_keys decodes whatever class-groups key shape is on-chain (bare mainnet-v1.1.8 `ClassGroupsEncryptionKeyAndProof` or the post-bump combined `ValidatorEncryptionKeysAndProofs`) via `decode_validator_encryption_keys`, comparing only the class-groups component that identifies the seed. PVSS keys are verified off-chain on the assembly path. - validator_initialization_config publishes the BARE mainnet-v1.1.8 shape on-chain (the richer bundle travels off-chain via validator P2P), so a v1.1.8 binary can still decode the record during the upgrade window. - The internal-presign output read is gated on `internal_presign_sessions_enabled()` (a v4 feature) so a pre-v4 node mid rolling-upgrade skips the sparse stream instead of panicking on the dense per-round assertion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- bin default scenario `rolling_majority_then_minority` mirrors the proven-good config (10-min epochs + 1800s wait timeout + `set_buffer_stake(0)` before the upgrade-crossing wait) so the v3->v4 vote can land under n=4. - `wait_for_epoch` logs a failed `current_epoch` read instead of silently treating it as epoch 0 until the deadline. - Document that the workload sign-completion check (coordinator user `completed_sessions_count`) is sound only because the harness drives a single dedicated user with one sign in flight. - results doc: caveat the cross_binary GREEN row (version-only swap; the real v1.1.8 crypto-boundary swap is not exercised) and note the single-instance / fixed-port (9000/9123) constraint. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…te + read) Same fix landed on dev via #1728, applied here on the off-chain-metadata file structure. The consensus-output replay loop asserts each per-round table's record round equals the driver round (`dwallet_mpc_messages`). Tables added after mainnet-v1.1.8 are sparse when a dev binary replays a v1.1.8-written RocksDB (rolling binary swap), tripping the assertion. Gate write + read of each on the feature that introduced it: - internal_presign_sessions: dwallet_internal_mpc_outputs, global_presign_requests, idle_status_updates - noa_checkpoints: verified_system_checkpoint_messages, noa_observations, sui_chain_observation_updates (`network_key_data_messages` is already removed on this branch by the off-chain work, so it needs no gate here.) Validated by the cross-binary upgrade harness: mainnet-v1.1.8 -> dev rolling swap reaches protocol v4. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

# Conflicts: # crates/ika-core/src/authority/authority_per_epoch_store.rs # crates/ika-core/src/dwallet_mpc/dwallet_mpc_service.rs

… feat/ika-upgrade-test

…ore driving The workload waited for epoch 1 (genesis, reached immediately) and drove the dWallet lifecycle right away, with a 30-minute epoch. At v4 the genesis network-key DKG is gated on the off-chain mpc_data freeze, whose ready-signal — with no next-epoch committee published yet at genesis — only fires at the 3/4*epoch_duration deadline (~22 min on a 30-min epoch). So the network key wasn't on-chain when the CLI tried to derive protocol parameters, and the driver's ~2-min retry budget gave up long before. Wait for epoch 2 instead of 1: the counter advancing to 2 is itself the completion signal for the genesis network DKG (reconfiguration into epoch 2 reshares that key, which can't happen until the DKG finished), so the key is guaranteed readable — and don't drive the lifecycle before then, when it could only fail. Shorten the epoch to 4 min so the freeze deadline (~3 min) is reachable while still clearing validator bring-up + announcement recording (~90s) and leaving the lifecycle room inside the next epoch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… mid-epoch at the v3->v4 boundary The off-chain network-key blob overlay is keyed by key ID only, so the moment this epoch's mid-epoch reconfiguration finalizes locally, the syncer's merged key data starts carrying the output produced *for the next epoch's committee* — shares encrypted to next-epoch party IDs, which need not align with this epoch's (on-chain committee order is not stable across epochs). In steady-state v4 the cert anchor in `adopt_cert_verified_keys` rejects it (the prior epoch's handoff cert pins the output produced FOR the current epoch), but the first v4 epoch after the v3->v4 upgrade has no prior cert, fell into the cert-less boundary path, and adopted blindly — every validator then failed decryption with ClassGroup(Decryption) using this epoch's identity on next-epoch-dealt shares. Guard the boundary path the same way the cert anchor does: skip adoption when the reconfiguration output's digest matches the one this epoch's own reconfiguration session recorded (epoch-keyed perpetual digest, new point lookup). The next epoch's manager adopts and decrypts it with next-epoch identity at epoch start, as in steady state. Also hoist the last-failed check in the instantiation filter so it applies to every branch: previously the `Some(prev)` branch re-selected the failing bytes every poll tick (they differ from the last *successfully* instantiated ones by definition), re-running a doomed ~18s class-groups decrypt per tick and starving the service loop — checkpoints (including EndOfPublish) never certified, wedging epoch advance behind the decryption failure. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ejections The stale-gas recovery (drop the cached gas ref, floor the re-fetch at the rejected version) only ran when the rejection arrived inside `tx_response.errors`. But the fullnode also rejects at the JSON-RPC layer (ServerError -32002, "Transaction needs to be rebuilt ... object unavailable for consumption"), which surfaces as `Err` from `execute_transaction_block_with_effects` and bailed out before the recovery code — the cached gas ref survived, so every `retry_with_max_elapsed_time!` attempt rebuilt the byte-identical stale tx and re-rejected, wedging checkpoint delivery to Sui for the full one-hour window (observed: dwallet checkpoints stuck behind a gas coin advanced by the shared publisher address in the test cluster, blocking DKG settlement, mid-epoch reconfiguration, and epoch advance). Factor the recovery into `NotifierSubmitState::handle_possible_stale_gas_rejection` and apply it on both paths. Note `IkaError` derives strum's `AsRefStr`, so `err.as_ref()` yields only the variant name — match the `SuiClientTxFailureGeneric` payload to get the actual message. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…grade boundary Genesis at v3 (MIN) instead of v4 (MAX): a v4 *genesis* network DKG is rejected forever (PVSS keys only arrive through the next-committee-only off-chain assembly), so the supported path — and the one mainnet takes — is genesis v3, then upgrade into v4 via the capability vote. The test now waits for epoch 2 (v3 genesis DKG + reshare done), zeroes the buffer stake so the 4-validator vote tallies at bare quorum, waits for epoch 3, asserts protocol >= v4, and only then drives the DKG -> Presign -> Sign lifecycle. This exercises the cert-less v3->v4 reconfiguration-adoption boundary fixed in the previous commits. Remove HANDOFF.md — the reshare-decrypt bug it described is fixed and the workload test is green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…e-gas recovery Clippy's `unnecessary_to_owned` suggests `err.as_ref()` for `&err.to_string()` here; it compiles because `IkaError` derives strum's `AsRefStr`, but that returns only the variant name — never the rejection markers — silently disabling the recovery. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…dators Internal presign sessions get their sequence number from a single shared counter, assigned in iteration order over (network key id) x (curve) x (signature algorithm), and the sequence number is bound into the session identifier transcript. Both iteration sources were unordered: - SUPPORTED_CURVES_TO_SIGNATURE_ALGORITHMS_TO_HASH_SCHEMES was a HashMap<u32, HashMap<u32, Vec<u32>>> — iteration order is random per process (RandomState), so each validator walked curves/algorithms in a different order; - the agreed network key ids were iterated straight off a HashMap. Each validator therefore derived *different* session identifiers for the same (curve, algorithm) work. Those sessions could never reach quorum, so they never completed, and the instantiated != completed gate then blocked that algorithm's pool top-ups for the entire epoch. Once a user presign request locked onto the starved pool, the EndOfPublish condition was unsatisfiable and the epoch could not advance. Observed live: in a 4-validator run the validators logged three distinct top-up orders, and exactly the sequence numbers whose (curve, algorithm) assignment happened to agree on 3+ validators completed — the rest hung forever, the ECDSA pool stayed empty all epoch, and the run timed out. A previous green run was a per-process-seed coin flip. Fix: BTreeMap at both nesting levels of the static, and collect the agreed key ids into a BTreeSet before the instantiation loop. Pre-existing bug from the internal sessions instantiation logic (#1638), not specific to this branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The reconfiguration overlay (next-epoch network key data computed off-chain during reconfiguration) was stored bare and adopted based on a produced-this-epoch digest guard. Instead, store the target epoch alongside the key data and adopt it only when it matches the epoch actually being entered — epoch-correct by construction, no guard heuristics. - validator_metadata.rs: overlay entries carry the epoch they were computed for; lookups take the target epoch. - authority_per_epoch_store.rs / authority_perpetual_tables.rs: persist and reload the epoch alongside the overlay data; drop the digest-guard plumbing. - mpc_manager.rs / sui_syncer.rs: pass the target epoch through adoption and ignore overlay data for any other epoch. Validated by a full workload run: genesis at v3, upgrade into v4 at epoch 3, v3 -> v4 reshare decrypts cleanly on all validators, DKG -> Presign -> Sign lifecycle green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Two upgrades to the cross-binary rolling-upgrade test: 1. Committee changes at every epoch boundary, with a different committee size each epoch (4 -> 3 -> 5 -> 4): a validator removal coincides with the v3 -> v4 protocol bump, two brand-new validators join via the full candidate -> stake -> activate flow (their class-groups keys registered on-chain, so the v4 reshare encrypts to parties that never held the key), and a final removal reshapes 5 -> 4. Every boundary after the first is a real reshare to a different party set. - sui_client.rs: request_add_validator / request_remove_validator / candidate registration helpers with explicit sender, shared version, and cap so the test can drive membership without touching the active wallet address. - network_config_builder.rs: configurable min_validator_count (the dip to 3 is below the protocol default of 4). - scenario.rs / cluster.rs / process.rs: join_validator, remove_validator, expect_committee_size scenario steps; spawn / stop / swap of individual validators on different binaries. 2. Rough per-protocol MPC timing report (mpc_timings.rs): scrape the MPC duration metrics from each validator after the v3 (old binary) and v4 (new binary, churned committee) workload runs, and print a comparison table at the end of the run. Informational and flagged, not asserted — wall-clock on a loaded developer machine is too noisy to gate on. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…set post-upgrade The cross-binary churn test wedged at the v3 workload: genesis wrote the full GlobalPresignConfig, routing ECDSA presigns to the global path, which is served exclusively from the validators' internal presign pool — and that pool only fills once internal_presign_sessions activates at protocol v4. At v3 the presign was unservable forever. Genesis now takes GenesisGlobalPresignConfig (Full | Empty). Empty is the mainnet-v1.1.8 on-chain state (the config object must still exist — the coordinator reads it with a bare dynamic-field borrow). The cross-binary scenario uses Empty at genesis and a new SetGlobalPresignConfig step right after the v4 upgrade is confirmed — the same operational ordering a real mainnet rollout must follow: set_global_presign_config only after v4 activates, or ECDSA presigns stall network-wide until it does. Existing genesis-at-v4 tests keep Full (exact current behavior). Also rewords the cross_binary doc comment: the literal v1.1.8 binary failing on harness genesis is a registration-shape artifact (post-#1707 bundle bytes), not a production-direction gap — the new binary reads v1.1.8 keys via the shape-tolerant decode. Verified in the churn run: v3 workload completed (vs infinite wedge), v3→v4 vote passed, post-upgrade config set succeeded. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…le coin transfer_one_ika took the publisher's first IKA coin — the genesis supply coin (ika_supply_id) — and transferred it whole to the workload user. The first churn run to stake a joiner after a workload exposed it: stake_ika splits the joining stake from ika_supply_id signed by the publisher, which no longer owned it ("Transaction was not signed by the correct sender"). A second workload on the same cluster would have failed the same way ("publisher owns no IKA"). Pay a fixed 1000-IKA allowance to the workload user instead; the supply coin stays with the publisher. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…s activates The global-presign pipeline is gated on the internal_presign_sessions feature flag (protocol v4) on its consensus side, but session intake diverted every global presign request to the pool unconditionally. Below v4 that strands the request: no pool to serve it, no MPC session spawned, and the session is locked into its epoch — all_current_epoch_sessions_- completed blocks advance_epoch, so the epoch can never end and v4 can never activate. Mainnet's GlobalPresignConfig is already populated (every production ECDSA presign routes to global), so a single presign request in flight after the upgrade restart would have wedged the network at v3 permanently. Gate the diversion on the same flag: pre-activation, the request falls through to a user-requested MPC session — the v1.1.8 serving behavior, whose input (dwallet-output-less presign computation) and output (RespondDWalletPresign with no dwallet_id, VersionedPresignOutput::V2) paths are intact on this branch. Caught by the new v118_upgrade rehearsal: genesis a 4-validator committee on the literal mainnet-v1.1.8 ika-node with the verified mainnet-shape populated GlobalPresignConfig, run the mainnet user flow at v3 (DKG with Universal output, global presign as a user session, sign), atomically swap all validators to the local build, and probe the pre-activation window with a workload that must complete its global presign at v3 via the fallback before the boundary. The run then crosses into v4 (the local binaries reshare the 1.1.8-created network key), serves a pool-backed global presign, and completes one more clean reshare. Also corrects GenesisGlobalPresignConfig and cross_binary docs: Full is the actual verified mainnet on-chain state, Empty is a harness arrangement (and the only targeted-presign coverage), not the mainnet shape. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ith window-delta tables The scraped MPC duration metrics are cumulative per process, so a later snapshot blends everything the validator ever ran — a v3-protocol reshare and a v4-protocol reshare land in one row and the ratio table reads 1.00x. Add a window table to each comparison: (sum2-sum1)/(completions2-completions1) between consecutive snapshots isolates just the work done between them (skipped across a swap, where the counter reset makes the delta negative). Extend the v118 rehearsal past epoch 4 with two new snapshots: - v4-reshare: the first reshare executed under the v4 reconfiguration math (reconfiguration_message_version = 3, PVSS HPKE) — the epoch 2->3 reshare still ran the v3 protocol, so the previous run never measured v4 reconfiguration cleanly; - local-v4-settled: a full lifecycle after the pools finished their initial fill, pricing v4 DKG / pool presign / sign without the boundary work. Run is green (1105s): the v4-math reshare window prices at 53.2s/7.4s/30.8s/9.6s per round vs the local binary's v3-math reshare at 9.5s/2.5s/8.6s/2.8s — with continuous internal-presign pool top-ups sharing the cores. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

# Conflicts: # crates/ika-core/src/dwallet_mpc/mpc_manager.rs

ycscaly requested a review from Iamknownasfesal as a code owner June 5, 2026 09:14

github-actions Bot added the Area: Rust label Jun 5, 2026

ycscaly mentioned this pull request Jun 7, 2026

Off-chain validator metadata + EndOfPublishV2 #1721

Open

8 tasks

ycscaly and others added 11 commits June 9, 2026 08:13

ycscaly changed the base branch from dev to feat/off-chain-metadata-v2 June 9, 2026 08:18

ycscaly force-pushed the feat/ika-upgrade-test branch from 8672c96 to f858d37 Compare June 9, 2026 08:19

ycscaly and others added 13 commits June 9, 2026 13:11

Merge remote-tracking branch 'origin/dev' into feat/ika-upgrade-test

f721304

# Conflicts: # crates/ika-core/src/authority/authority_per_epoch_store.rs # crates/ika-core/src/dwallet_mpc/dwallet_mpc_service.rs

Merge remote-tracking branch 'origin/feat/off-chain-metadata-v2' into…

7d87f1a

… feat/ika-upgrade-test

docs(ika-upgrade-test): session handoff for workload reshare-decrypt bug

ce81f09

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ycscaly and others added 4 commits June 10, 2026 22:01

Merge origin/dev (PRs #1732, #1733 squash-merged back)

39b4b46

# Conflicts: # crates/ika-core/src/dwallet_mpc/mpc_manager.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(ika-upgrade-test): out-of-process cross-binary upgrade harness#1727

test(ika-upgrade-test): out-of-process cross-binary upgrade harness#1727
ycscaly wants to merge 28 commits into
feat/off-chain-metadata-v2from
feat/ika-upgrade-test

ycscaly commented Jun 5, 2026

Uh oh!

ycscaly commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ycscaly commented Jun 5, 2026

Summary

Tests (all opt-in via env flags; need real binaries + a workspace-tag sui)

Crate layout

Key finding: mainnet-v1.1.8 ↔ dev is not a naive rolling swap

Notes

Uh oh!

ycscaly commented Jun 7, 2026

Update: hardening + findings from a real mainnet-v1.1.8 → dev run

What the real run established (in order)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Tests (all opt-in via env flags; need real binaries + a workspace-tag `sui`)

Key finding: `mainnet-v1.1.8` ↔ `dev` is not a naive rolling swap

Update: hardening + findings from a real `mainnet-v1.1.8` → `dev` run