test(ika-upgrade-test): out-of-process cross-binary upgrade harness#1727
Open
ycscaly wants to merge 28 commits into
Open
test(ika-upgrade-test): out-of-process cross-binary upgrade harness#1727ycscaly wants to merge 28 commits into
ycscaly wants to merge 28 commits into
Conversation
8 tasks
Contributor
Author
Update: hardening + findings from a real
|
Add the research plan and counter-proposal for an out-of-process harness that runs validators on different compiled binaries (dev vs mainnet-v1.1.8) and tests them across epoch boundaries. Add a CLAUDE.md rule to express plans as ordering/dependencies, never durations. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…harness
New additive crate (no changes to ika-node/ika-swarm). Spawns real
ika-validator child processes against an external sui localnet, driven via
the existing admin RPC and the coordinator contract; epochs advance on a
short genesis epoch_duration_ms (force-close-epoch is a dead admin constant).
- binary.rs: BinarySpec (path/tag/sha/branch) + sha-keyed build cache via
git worktree, honoring each commit's pinned toolchain.
- process.rs: ValidatorProcess (spawn/stop/swap_binary, admin-RPC health).
- sui.rs: external `sui start --with-faucet --force-regenesis` localnet.
- cluster.rs: bring-up reusing init_ika_on_sui + ValidatorConfigBuilder +
notifier fullnode; NodeConfig -> YAML -> child; on-chain
wait_for_epoch / protocol-version reads via IkaClient.
- workload.rs / scenario.rs / bin: typed contracts + imperative DSL + CLI;
submission wiring and scenario runner land next.
Compiles and clippy-clean. Runtime go/no-go (epochs advance on real
binaries) and the workload driver are the next tasks.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…reach epoch 2 The smoke test brings up an external sui localnet + 4 real ika-validator child processes + a notifier, completes network DKG, and advances to epoch 2 entirely out-of-process. test result: ok (396s on an 8-core host). Fixes found by driving it green: - sui.rs: wait for the faucet (:9123), not just the RPC — init_ika_on_sui hits the faucet immediately and it comes up a beat after the RPC. - binary.rs / prebuilt bins: build ika-validator with --no-default-features to drop `enforce-minimum-cpu` (panics on < 16-core hosts). - cluster.rs: chdir into the per-run base around init_ika_on_sui so the chain-keyed Pub.localnet.toml dies with the run (stale pubfile rejected the next regenesis chain). - process.rs: health-check and admin reads use response *text*, not JSON — the ika admin server returns (StatusCode, String) debug text. This was why a healthy validator was never recognized. - workload.rs: real Curve25519 DKG submission path (compiles); network-key bytes via the full-data fetch (the key stores a TableVec, not bytes). - scenario.rs: real imperative runner (start/wait/swap/assert). Run: RUN_UPGRADE_SMOKE=1 IKA_VALIDATOR_BIN=... IKA_NOTIFIER_BIN=... SUI_BIN=... cargo test --release -p ika-upgrade-test --test smoke -- --nocapture Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ad tests - workload.rs: issue_dkg_and_confirm — submit a real Curve25519 user DKG and block until the coordinator's user completed_sessions_count rises (honest on-chain completion; no per-dWallet Move decoding, for which no Rust type exists). user_session_counts reads DWalletCoordinatorInnerV1.sessions_manager.user_sessions_keeper. - tests/workload.rs: single-binary cluster + one DKG completes on-chain. - tests/cross_binary.rs: v1.1.8 -> dev rolling upgrade; genesis v3, swap all four to dev, assert the protocol vote advances to v4 (gated: both binaries support v3, only all-dev supports v4). Tests are opt-in (RUN_WORKLOAD_TEST / RUN_CROSS_BINARY); all compile + clippy clean. cross_binary awaits a built v1.1.8 ika-node binary. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Scenario::with_base_dir keeps node logs after a failure (default temp dir is cleaned on drop, which hid the v1.1.8 boot panic). with_epoch_timeout for slower heterogeneous runs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The cross_binary test passes end-to-end (722s): 4 out-of-process validators boot on a v3-only binary, complete network DKG, then all swap to dev (v3..v4) and the capability vote advances v3 -> v4 at the next epoch. Exercises the protocol-vote arithmetic, mid-epoch reconfiguration across the swap, mixed- committee wire compat, and on-disk compat (restart on new binary, old RocksDB). Tuning that made it pass: 10-min epochs and swap-all-then-one-transition, to avoid the known sui_executor gas-coin-contention epoch wedge (short epochs + swap churn froze the notifier's advance-epoch executor) and to keep each swap clear of the mid-epoch reconfiguration window. Scenario gains with_epoch_duration_ms / with_epoch_timeout. The OLD binary is a dev build pinned to MAX_PROTOCOL_VERSION=3 (same crypto as dev, differs only in advertised version). The literal mainnet-v1.1.8 ika-node is crypto-incompatible (inkrypto vs cryptography-private class_groups; v4 key- shape change) and cannot share a committee with dev — a finding documented in the test, confirming the dual-pin premise of docs/plan-update-crypto-latest.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
workload.rs: issue_dkg returns the txn digest (completion is confirmed via the
coordinator session counter, not a per-dWallet read); protocol-pp fetch retries
on a partial-TableVec decode error. tests/workload.rs (GREEN) proves the
submission path end-to-end: protocol params from the on-chain network key,
centralized Curve25519 party, coordinator txn executes.
KNOWN GAP (documented in the test + results doc): the coordinator ignores the
emitted event ("not a DWalletSessionEvent"), so the session does not complete —
the driver must call register_encryption_key before the DKG (as the TS SDK
does). Presign/Sign build on a completed DKG and are not implemented.
docs/cross-binary-upgrade-testing-results.md summarizes what was built, the
green go/no-go + cross-binary(v3->v4) runs, the v1.1.8 crypto-incompat finding,
the epoch-wedge tuning, and the workload gap.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on-chain tests/workload.rs passes (~415s): a user dWallet completes register-encryption- key -> DKG(Active) -> presign(verified) -> sign, the sign confirmed on-chain via the coordinator's user completed_sessions_count. Proves the session-lifecycle invariant (sessions started in an epoch complete; no silent drops). The driver orchestrates the canonical `ika` CLI (the tested Rust client) rather than re-deriving the user-side 2PC. Making it reliably green surfaced real system properties, all handled: - dedicated, separately-funded user (faucet SUI + IKA transfer) — sharing the publisher key with the notifier causes coin-lock contention; - register-encryption-key before create (encrypted DKG borrows the user key from the coordinator); - v4 genesis (internal_presign_sessions is a v4 feature); - 30-min epoch so the lifecycle runs clear of the mid-epoch reconfiguration; - confirm sign via on-chain completed-count, not the CLI's racy --wait poll. Adds shared-crypto + fastcrypto deps for the IKA-funding transfer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…it, buffer-stake override Hardening surfaced while driving a real mainnet-v1.1.8 -> dev rolling swap: - process.rs: swap now stops the node with SIGTERM (which ika-node's `wait_termination` handles for an orderly shutdown) and waits for clean exit, with SIGKILL only as a fallback. The previous hard SIGKILL interrupted the node mid-consensus-round and left dwallet-MPC replay state partial on disk, which crashed the next binary on replay (`consensus round mismatch ...`). A binary swap is a planned restart, not a crash, so it must be graceful. - cluster.rs: document that the epoch counter advancing to N is itself the completion signal for epoch N-1 (reconfiguration into a new epoch is gated on that epoch's network-key MPC finishing), so callers wait for the epoch *after* the work they depend on rather than polling key state. - scenario.rs / process.rs: add a `set_buffer_stake(bps)` step (POST /set-override-buffer-stake) so a quorum, not unanimity, advances the protocol version. With n=4 the default 50% buffer rounds up to requiring all four votes; a rolling swap can leave one validator's fresh capability uncommitted at the boundary tally. - cross_binary.rs: wait for epoch 2 before swapping (the genesis network DKG runs during epoch 1, so epoch 2 guarantees it finished under the old binary), drop the buffer stake to a quorum, then wait for epoch 3. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t; gate internal-presign read on v4 Wire the cross-binary upgrade path against the off-chain-metadata branch so a mixed-version committee survives a rolling binary swap: - verify_validator_keys decodes whatever class-groups key shape is on-chain (bare mainnet-v1.1.8 `ClassGroupsEncryptionKeyAndProof` or the post-bump combined `ValidatorEncryptionKeysAndProofs`) via `decode_validator_encryption_keys`, comparing only the class-groups component that identifies the seed. PVSS keys are verified off-chain on the assembly path. - validator_initialization_config publishes the BARE mainnet-v1.1.8 shape on-chain (the richer bundle travels off-chain via validator P2P), so a v1.1.8 binary can still decode the record during the upgrade window. - The internal-presign output read is gated on `internal_presign_sessions_enabled()` (a v4 feature) so a pre-v4 node mid rolling-upgrade skips the sparse stream instead of panicking on the dense per-round assertion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- bin default scenario `rolling_majority_then_minority` mirrors the proven-good config (10-min epochs + 1800s wait timeout + `set_buffer_stake(0)` before the upgrade-crossing wait) so the v3->v4 vote can land under n=4. - `wait_for_epoch` logs a failed `current_epoch` read instead of silently treating it as epoch 0 until the deadline. - Document that the workload sign-completion check (coordinator user `completed_sessions_count`) is sound only because the harness drives a single dedicated user with one sign in flight. - results doc: caveat the cross_binary GREEN row (version-only swap; the real v1.1.8 crypto-boundary swap is not exercised) and note the single-instance / fixed-port (9000/9123) constraint. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
8672c96 to
f858d37
Compare
…te + read) Same fix landed on dev via #1728, applied here on the off-chain-metadata file structure. The consensus-output replay loop asserts each per-round table's record round equals the driver round (`dwallet_mpc_messages`). Tables added after mainnet-v1.1.8 are sparse when a dev binary replays a v1.1.8-written RocksDB (rolling binary swap), tripping the assertion. Gate write + read of each on the feature that introduced it: - internal_presign_sessions: dwallet_internal_mpc_outputs, global_presign_requests, idle_status_updates - noa_checkpoints: verified_system_checkpoint_messages, noa_observations, sui_chain_observation_updates (`network_key_data_messages` is already removed on this branch by the off-chain work, so it needs no gate here.) Validated by the cross-binary upgrade harness: mainnet-v1.1.8 -> dev rolling swap reaches protocol v4. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts: # crates/ika-core/src/authority/authority_per_epoch_store.rs # crates/ika-core/src/dwallet_mpc/dwallet_mpc_service.rs
… feat/ika-upgrade-test
…ore driving The workload waited for epoch 1 (genesis, reached immediately) and drove the dWallet lifecycle right away, with a 30-minute epoch. At v4 the genesis network-key DKG is gated on the off-chain mpc_data freeze, whose ready-signal — with no next-epoch committee published yet at genesis — only fires at the 3/4*epoch_duration deadline (~22 min on a 30-min epoch). So the network key wasn't on-chain when the CLI tried to derive protocol parameters, and the driver's ~2-min retry budget gave up long before. Wait for epoch 2 instead of 1: the counter advancing to 2 is itself the completion signal for the genesis network DKG (reconfiguration into epoch 2 reshares that key, which can't happen until the DKG finished), so the key is guaranteed readable — and don't drive the lifecycle before then, when it could only fail. Shorten the epoch to 4 min so the freeze deadline (~3 min) is reachable while still clearing validator bring-up + announcement recording (~90s) and leaving the lifecycle room inside the next epoch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… mid-epoch at the v3->v4 boundary The off-chain network-key blob overlay is keyed by key ID only, so the moment this epoch's mid-epoch reconfiguration finalizes locally, the syncer's merged key data starts carrying the output produced *for the next epoch's committee* — shares encrypted to next-epoch party IDs, which need not align with this epoch's (on-chain committee order is not stable across epochs). In steady-state v4 the cert anchor in `adopt_cert_verified_keys` rejects it (the prior epoch's handoff cert pins the output produced FOR the current epoch), but the first v4 epoch after the v3->v4 upgrade has no prior cert, fell into the cert-less boundary path, and adopted blindly — every validator then failed decryption with ClassGroup(Decryption) using this epoch's identity on next-epoch-dealt shares. Guard the boundary path the same way the cert anchor does: skip adoption when the reconfiguration output's digest matches the one this epoch's own reconfiguration session recorded (epoch-keyed perpetual digest, new point lookup). The next epoch's manager adopts and decrypts it with next-epoch identity at epoch start, as in steady state. Also hoist the last-failed check in the instantiation filter so it applies to every branch: previously the `Some(prev)` branch re-selected the failing bytes every poll tick (they differ from the last *successfully* instantiated ones by definition), re-running a doomed ~18s class-groups decrypt per tick and starving the service loop — checkpoints (including EndOfPublish) never certified, wedging epoch advance behind the decryption failure. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ejections The stale-gas recovery (drop the cached gas ref, floor the re-fetch at the rejected version) only ran when the rejection arrived inside `tx_response.errors`. But the fullnode also rejects at the JSON-RPC layer (ServerError -32002, "Transaction needs to be rebuilt ... object unavailable for consumption"), which surfaces as `Err` from `execute_transaction_block_with_effects` and bailed out before the recovery code — the cached gas ref survived, so every `retry_with_max_elapsed_time!` attempt rebuilt the byte-identical stale tx and re-rejected, wedging checkpoint delivery to Sui for the full one-hour window (observed: dwallet checkpoints stuck behind a gas coin advanced by the shared publisher address in the test cluster, blocking DKG settlement, mid-epoch reconfiguration, and epoch advance). Factor the recovery into `NotifierSubmitState::handle_possible_stale_gas_rejection` and apply it on both paths. Note `IkaError` derives strum's `AsRefStr`, so `err.as_ref()` yields only the variant name — match the `SuiClientTxFailureGeneric` payload to get the actual message. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…grade boundary Genesis at v3 (MIN) instead of v4 (MAX): a v4 *genesis* network DKG is rejected forever (PVSS keys only arrive through the next-committee-only off-chain assembly), so the supported path — and the one mainnet takes — is genesis v3, then upgrade into v4 via the capability vote. The test now waits for epoch 2 (v3 genesis DKG + reshare done), zeroes the buffer stake so the 4-validator vote tallies at bare quorum, waits for epoch 3, asserts protocol >= v4, and only then drives the DKG -> Presign -> Sign lifecycle. This exercises the cert-less v3->v4 reconfiguration-adoption boundary fixed in the previous commits. Remove HANDOFF.md — the reshare-decrypt bug it described is fixed and the workload test is green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e-gas recovery Clippy's `unnecessary_to_owned` suggests `err.as_ref()` for `&err.to_string()` here; it compiles because `IkaError` derives strum's `AsRefStr`, but that returns only the variant name — never the rejection markers — silently disabling the recovery. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…dators Internal presign sessions get their sequence number from a single shared counter, assigned in iteration order over (network key id) x (curve) x (signature algorithm), and the sequence number is bound into the session identifier transcript. Both iteration sources were unordered: - SUPPORTED_CURVES_TO_SIGNATURE_ALGORITHMS_TO_HASH_SCHEMES was a HashMap<u32, HashMap<u32, Vec<u32>>> — iteration order is random per process (RandomState), so each validator walked curves/algorithms in a different order; - the agreed network key ids were iterated straight off a HashMap. Each validator therefore derived *different* session identifiers for the same (curve, algorithm) work. Those sessions could never reach quorum, so they never completed, and the instantiated != completed gate then blocked that algorithm's pool top-ups for the entire epoch. Once a user presign request locked onto the starved pool, the EndOfPublish condition was unsatisfiable and the epoch could not advance. Observed live: in a 4-validator run the validators logged three distinct top-up orders, and exactly the sequence numbers whose (curve, algorithm) assignment happened to agree on 3+ validators completed — the rest hung forever, the ECDSA pool stayed empty all epoch, and the run timed out. A previous green run was a per-process-seed coin flip. Fix: BTreeMap at both nesting levels of the static, and collect the agreed key ids into a BTreeSet before the instantiation loop. Pre-existing bug from the internal sessions instantiation logic (#1638), not specific to this branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The reconfiguration overlay (next-epoch network key data computed off-chain during reconfiguration) was stored bare and adopted based on a produced-this-epoch digest guard. Instead, store the target epoch alongside the key data and adopt it only when it matches the epoch actually being entered — epoch-correct by construction, no guard heuristics. - validator_metadata.rs: overlay entries carry the epoch they were computed for; lookups take the target epoch. - authority_per_epoch_store.rs / authority_perpetual_tables.rs: persist and reload the epoch alongside the overlay data; drop the digest-guard plumbing. - mpc_manager.rs / sui_syncer.rs: pass the target epoch through adoption and ignore overlay data for any other epoch. Validated by a full workload run: genesis at v3, upgrade into v4 at epoch 3, v3 -> v4 reshare decrypts cleanly on all validators, DKG -> Presign -> Sign lifecycle green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two upgrades to the cross-binary rolling-upgrade test:
1. Committee changes at every epoch boundary, with a different committee
size each epoch (4 -> 3 -> 5 -> 4): a validator removal coincides with
the v3 -> v4 protocol bump, two brand-new validators join via the full
candidate -> stake -> activate flow (their class-groups keys registered
on-chain, so the v4 reshare encrypts to parties that never held the
key), and a final removal reshapes 5 -> 4. Every boundary after the
first is a real reshare to a different party set.
- sui_client.rs: request_add_validator / request_remove_validator /
candidate registration helpers with explicit sender, shared version,
and cap so the test can drive membership without touching the active
wallet address.
- network_config_builder.rs: configurable min_validator_count (the dip
to 3 is below the protocol default of 4).
- scenario.rs / cluster.rs / process.rs: join_validator,
remove_validator, expect_committee_size scenario steps; spawn /
stop / swap of individual validators on different binaries.
2. Rough per-protocol MPC timing report (mpc_timings.rs): scrape the MPC
duration metrics from each validator after the v3 (old binary) and v4
(new binary, churned committee) workload runs, and print a comparison
table at the end of the run. Informational and flagged, not asserted —
wall-clock on a loaded developer machine is too noisy to gate on.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…set post-upgrade The cross-binary churn test wedged at the v3 workload: genesis wrote the full GlobalPresignConfig, routing ECDSA presigns to the global path, which is served exclusively from the validators' internal presign pool — and that pool only fills once internal_presign_sessions activates at protocol v4. At v3 the presign was unservable forever. Genesis now takes GenesisGlobalPresignConfig (Full | Empty). Empty is the mainnet-v1.1.8 on-chain state (the config object must still exist — the coordinator reads it with a bare dynamic-field borrow). The cross-binary scenario uses Empty at genesis and a new SetGlobalPresignConfig step right after the v4 upgrade is confirmed — the same operational ordering a real mainnet rollout must follow: set_global_presign_config only after v4 activates, or ECDSA presigns stall network-wide until it does. Existing genesis-at-v4 tests keep Full (exact current behavior). Also rewords the cross_binary doc comment: the literal v1.1.8 binary failing on harness genesis is a registration-shape artifact (post-#1707 bundle bytes), not a production-direction gap — the new binary reads v1.1.8 keys via the shape-tolerant decode. Verified in the churn run: v3 workload completed (vs infinite wedge), v3→v4 vote passed, post-upgrade config set succeeded. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…le coin
transfer_one_ika took the publisher's first IKA coin — the genesis supply
coin (ika_supply_id) — and transferred it whole to the workload user. The
first churn run to stake a joiner after a workload exposed it: stake_ika
splits the joining stake from ika_supply_id signed by the publisher, which
no longer owned it ("Transaction was not signed by the correct sender").
A second workload on the same cluster would have failed the same way
("publisher owns no IKA").
Pay a fixed 1000-IKA allowance to the workload user instead; the supply
coin stays with the publisher.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s activates The global-presign pipeline is gated on the internal_presign_sessions feature flag (protocol v4) on its consensus side, but session intake diverted every global presign request to the pool unconditionally. Below v4 that strands the request: no pool to serve it, no MPC session spawned, and the session is locked into its epoch — all_current_epoch_sessions_- completed blocks advance_epoch, so the epoch can never end and v4 can never activate. Mainnet's GlobalPresignConfig is already populated (every production ECDSA presign routes to global), so a single presign request in flight after the upgrade restart would have wedged the network at v3 permanently. Gate the diversion on the same flag: pre-activation, the request falls through to a user-requested MPC session — the v1.1.8 serving behavior, whose input (dwallet-output-less presign computation) and output (RespondDWalletPresign with no dwallet_id, VersionedPresignOutput::V2) paths are intact on this branch. Caught by the new v118_upgrade rehearsal: genesis a 4-validator committee on the literal mainnet-v1.1.8 ika-node with the verified mainnet-shape populated GlobalPresignConfig, run the mainnet user flow at v3 (DKG with Universal output, global presign as a user session, sign), atomically swap all validators to the local build, and probe the pre-activation window with a workload that must complete its global presign at v3 via the fallback before the boundary. The run then crosses into v4 (the local binaries reshare the 1.1.8-created network key), serves a pool-backed global presign, and completes one more clean reshare. Also corrects GenesisGlobalPresignConfig and cross_binary docs: Full is the actual verified mainnet on-chain state, Empty is a harness arrangement (and the only targeted-presign coverage), not the mainnet shape. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ith window-delta tables The scraped MPC duration metrics are cumulative per process, so a later snapshot blends everything the validator ever ran — a v3-protocol reshare and a v4-protocol reshare land in one row and the ratio table reads 1.00x. Add a window table to each comparison: (sum2-sum1)/(completions2-completions1) between consecutive snapshots isolates just the work done between them (skipped across a swap, where the counter reset makes the delta negative). Extend the v118 rehearsal past epoch 4 with two new snapshots: - v4-reshare: the first reshare executed under the v4 reconfiguration math (reconfiguration_message_version = 3, PVSS HPKE) — the epoch 2->3 reshare still ran the v3 protocol, so the previous run never measured v4 reconfiguration cleanly; - local-v4-settled: a full lifecycle after the pools finished their initial fill, pricing v4 DKG / pool presign / sign without the boundary work. Run is green (1105s): the v4-math reshare window prices at 53.2s/7.4s/30.8s/9.6s per round vs the local binary's v3-math reshare at 9.5s/2.5s/8.6s/2.8s — with continuous internal-presign pool top-ups sharing the cores. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts: # crates/ika-core/src/dwallet_mpc/mpc_manager.rs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
New additive crate
crates/ika-upgrade-test— an out-of-process test harness that spawns real, separately-compiledika-validatorbinaries against an externalsuilocalnet, swaps binaries on validators across epochs, and drives dWallet workloads. Unlikeika-test-cluster(in-processIkaNode, one binary), it can host genuinely different binaries in one committee. No changes toika-node/ika-swarm.Implements
docs/cross-binary-upgrade-testing*.md; seedocs/cross-binary-upgrade-testing-results.mdfor the full write-up.Tests (all opt-in via env flags; need real binaries + a workspace-tag
sui)tests/smoke.rs(go/no-go)tests/cross_binary.rstests/workload.rsThe cross-binary run exercises, out of process: protocol-vote arithmetic, mid-epoch reconfiguration MPC across the swap, mixed-committee wire compat, and on-disk compat (restart on a new binary against the old RocksDB).
Crate layout
sui.rs— spawn externalsui start --with-faucet --force-regenesis(waits for RPC and faucet).cluster.rs— chain bootstrap viainit_ika_on_sui+ValidatorConfigBuilder+ a notifier fullnode;NodeConfig→ YAML → child; on-chainwait_for_epoch/ protocol-version viaIkaClient.process.rs—ValidatorProcess: spawn / stop /swap_binary, health via the admin RPC.binary.rs—BinarySpec(path / tag / sha / branch) + a sha-keyedgit worktreebuild cache honoring each commit's pinned toolchain.scenario.rs— imperative DSL (start / wait_for_epoch / stop_and_swap / expect_protocol_version).workload.rs— drives a user dWallet lifecycle by orchestrating the canonicalikaCLI; confirms completion on-chain.Key finding:
mainnet-v1.1.8↔devis not a naive rolling swapRunning the harness with the real
mainnet-v1.1.8ika-nodefails at boot for the expected reason, not a harness bug: v1.1.8 linksclass_groupsfromdwallet-labs/inkrypto, dev fromdwallet-labs/cryptography-private, and v4 publishes the combinedValidatorEncryptionKeysAndProofswhere v1.1.8 expects the bareClassGroupsEncryptionKeyAndProof— the exact incompatibility flagged invalidator_initialization_config.rs(⚠️ MAINNET WIRE-FORMAT INCOMPATIBILITY ⚠️). The v1.1.8 node loads its config, connects to Sui, reads the contracts, then fails decoding the on-chain validator record (class groups public key … remaining input). So the harness genuinely runs a different binary and fails on the documented wire-format divergence.To demonstrate a successful heterogeneous upgrade, the green
cross_binaryrun uses an OLD binary that is adevbuild pinned toMAX_PROTOCOL_VERSION=3(same crypto, differs only in advertised protocol version) — disclosed in the test and results doc.Notes
--no-default-featuresto dropenforce-minimum-cpu(panics on < 16-core hosts).register-encryption-keyprecedescreate; v4 genesis forinternal_presign_sessions; long epoch to clear the mid-epoch reconfiguration; sign confirmed via the coordinator's on-chain completed-session count.🤖 Generated with Claude Code