Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
209 commits
Select commit Hold shift + click to select a range
313f15b
Add off-chain validator metadata types + consensus variants
omersadika May 17, 2026
c550c0e
P2P blob endpoint + perpetual mpc_artifact_blobs + startup hydration
omersadika May 17, 2026
928c27d
Producer helpers + record path for validator mpc_data announcements
omersadika May 17, 2026
466f378
Record EpochMpcDataReadySignal + freeze mpc_data on first quorum
omersadika May 17, 2026
9a8b99e
Add SubmitMpcDataAnnouncement RPC + late-binding relay handle
omersadika May 17, 2026
8ca21d2
Joiner mpc_data announcement verification path
omersadika May 17, 2026
d08fab7
Pure handoff attestation build/sign/aggregate helpers
omersadika May 17, 2026
eeed644
Record handoff signatures into the per-epoch store
omersadika May 17, 2026
fecc140
Persist CertifiedHandoffAttestation to perpetual storage
omersadika May 17, 2026
7dec61d
Emit local handoff signature on EndOfPublish
omersadika May 17, 2026
361eef6
Serve handoff certs over Anemo + joiner bootstrap verify
omersadika May 17, 2026
75e882f
Cache DKG/reconfig output digests at Finalize for handoff
omersadika May 17, 2026
63eb76d
NetworkKeyDKGReadySignal + per-key freeze trigger
omersadika May 17, 2026
2c4669d
Effective reconfig input set = frozen ∩ (V_e ∪ V_{e+1})
omersadika May 17, 2026
c74d438
Off-chain DWalletNetworkEncryptionKeyData fetch with fallback
omersadika May 17, 2026
1993cf6
Off-chain Committee class-groups assembly with completion gate
omersadika May 17, 2026
7f67db5
Gate network DKG / reconfig session kickoff on off-chain freeze
omersadika May 17, 2026
96bc1a0
Broadcast mpc_data announcement + ready signals at epoch start
omersadika May 17, 2026
08b9879
Install JoinerPubkeyProvider from next-epoch committee
omersadika May 17, 2026
dd23f8c
Install ConsensusPubkeyProvider from current committee
omersadika May 17, 2026
e7cb91a
Overlay network key data with off-chain blobs in sui_syncer
omersadika May 17, 2026
d1773cb
Off-chain class-groups assembly in sui_syncer::new_committee
omersadika May 17, 2026
b74042d
Install joiner announcement relay on the Anemo server
omersadika May 17, 2026
e53c9f4
Move per-epoch consensus tasks into a new epoch_tasks module
omersadika May 17, 2026
2ab9a68
Decouple handoff from validator metadata
omersadika May 17, 2026
250750e
Rename ika-network::validator_metadata to mpc_artifacts + split
omersadika May 17, 2026
72a169a
Gate off-chain validator metadata behind protocol config flag
omersadika May 17, 2026
03ffb2c
Merge origin/dev into feat/off-chain-metadata-v2
omersadika May 22, 2026
4eb17a3
Expose validator-management bootstrap helpers
omersadika May 23, 2026
f4523c7
Merge remote-tracking branch 'origin/dev' into feat/off-chain-metadat…
omersadika May 24, 2026
27afa64
Add IkaTestCluster joiner helper + test
omersadika May 24, 2026
9648f1c
Add IkaTestCluster remove_validator helper + test
omersadika May 24, 2026
561d3f3
Add user-DKG ceremony + test_sessions_complete_across_epoch_switch
omersadika May 24, 2026
7b79982
Fix cross-cluster contamination in SuiClient shared-arg cache
omersadika May 24, 2026
59a8efd
Merge remote-tracking branch 'origin/dev' into feat/off-chain-metadat…
omersadika May 24, 2026
b8fc060
Enable internal_presign_sessions at v4
omersadika May 24, 2026
a7d1d81
Add multi-epoch user-session stress test
omersadika May 24, 2026
815879c
Fix handoff cert persistence + hydrate digest cache from chain
omersadika May 25, 2026
ab70f8d
Add EndOfPublishV2 consensus message variant
omersadika May 25, 2026
e5fb1c7
Add bundled_handoff_in_end_of_publish protocol flag
omersadika May 25, 2026
a7c33b1
Wire EndOfPublishV2 producer + consumer
omersadika May 25, 2026
0e70f16
Bump per-cycle epoch-advance timeout to 600s in churn test
omersadika May 25, 2026
a9792d0
Gate V2 by off_chain_validator_metadata; fix sync stale-snapshot race
omersadika May 25, 2026
531aa6e
Surface per-item digest diffs in AttestationMismatch log
omersadika May 25, 2026
49decc6
Wire off_chain mode to skip chain blob reads; add v4 cluster test
omersadika May 25, 2026
eb92b86
Mark off_chain blob-read assertion test as #[ignore]
omersadika May 25, 2026
ae3aefe
Investigate off-chain announcement propagation gap; identify P2P fetc…
omersadika May 25, 2026
4f05189
Wire P2P fetch_blob into peer-blob propagation; close off-chain gap
omersadika May 25, 2026
acc80f9
Add multi-network-key DKG cluster test + fix v3-shape mpc_data blob
omersadika May 25, 2026
f965b5f
Persist per-key DKG/reconfig digest map across epochs
omersadika May 25, 2026
8b7dbc1
Cache DKG/reconfig output digests from consensus-voted data
omersadika May 25, 2026
9a8398a
Re-broadcast NetworkKeyData on content change; add multi-key tests
omersadika May 26, 2026
2be3d94
Address PR review punch-list: freeze race, EOPV2 hardening, blob safety
omersadika May 26, 2026
41bc8ba
Exclude-on-bad-mpc-data freeze gate; drop chain fallback under v4
omersadika May 26, 2026
6fed770
Receive-time canonicalize ready signal; decode-validate peer blobs
omersadika May 26, 2026
cec2fc6
Bound pending handoff buffer; re-emit ready signal on growth; doc sweep
omersadika May 26, 2026
94466dc
Surface byzantine padding via canonicalize diagnostics
omersadika May 26, 2026
6de2abb
Pin handoff-aggregator replay invariants for restart safety
omersadika May 26, 2026
4c0a2c5
Document NetworkKeyDKGReadySignal dead-consumer status
omersadika May 26, 2026
d60a501
Address third-pass review: warn placement, FQ paths, replay test
omersadika May 26, 2026
5cd1236
Doc sweep: fix lies, ambiguity, and stale plan-phase tags
omersadika May 27, 2026
aaf9e10
Fix EpochMpcDataReadySignal re-emit silently dropped by consensus dedup
omersadika May 27, 2026
39ecfc8
Reject empty off-chain assembly; use frozen set as post-freeze truth
omersadika May 27, 2026
936d2e8
Gate self-attestation on own-blob health; reject sentinel timestamp_ms=0
omersadika May 27, 2026
faa9bf1
Add cert dup-signer, quorum-boundary, sentinel-timestamp tests
omersadika May 27, 2026
751e431
Extract assembly + self-attest decisions into testable pure helpers
omersadika May 27, 2026
9c864bc
docs: in-progress review of off-chain-metadata-v2 (predates 14 commits)
ycscaly May 28, 2026
e1202e7
docs: annotate review with verdicts vs current branch tip
ycscaly May 28, 2026
be254d5
Add write-through/read-through BlobCache; serve perpetual-only blobs
omersadika May 28, 2026
3c47984
Split announcement into self/relayed kinds; drop BLS for Ed25519
omersadika May 28, 2026
73f4ab8
Add joiner announcement fan-out task with P2P retry
omersadika May 28, 2026
ee385e3
Make the producer's announcement self-heal via confirmation retries
omersadika May 28, 2026
5a490ef
Wire joiner announcement fan-out into node startup
omersadika May 28, 2026
2a0f655
Delay the freeze until next-epoch joiners can be attested (F4-1)
omersadika May 28, 2026
cd42e9c
Fix peer_blob_fetcher to read the bare announcement table value
omersadika May 28, 2026
d02019c
Fix doc inaccuracies introduced by the Ed25519/freeze-delay refactors
omersadika May 28, 2026
95a3f5c
Don't cache empty network-key blobs when off-chain overlay isn't ready
omersadika May 28, 2026
69995f5
Surface F4-1 deadline emits that exclude unvalidated next-epoch members
omersadika May 28, 2026
159c190
Drop dead NetworkKeyDKGReadySignal plumbing
omersadika May 28, 2026
2f7e653
Unify the two pubkey-provider updaters into one generic task
omersadika May 28, 2026
7ecfa69
Extract handoff-cert subsystem into its own module
omersadika May 28, 2026
155ed58
Bind verify_joiner_bootstrap_cert to an expected prior epoch
omersadika May 28, 2026
7a27837
Wire joiner cert-bootstrap consumer into node startup
omersadika May 29, 2026
c309e75
Add explicit F4-1 cluster test: joiner lands in next committee class-…
omersadika May 29, 2026
fd3e0fd
Break the joiner freeze deadlock: gate on chain (not assembled) commi…
omersadika May 29, 2026
cc455e2
Brisk joiner fan-out retry; ignore timing-bound F4-1 cluster test
omersadika May 29, 2026
5a24170
Make off-chain joiner integration work end-to-end (freeze captures jo…
omersadika May 29, 2026
51c35db
Remove the dead V1 HandoffSignature consensus path
omersadika May 29, 2026
4ca60b6
Remove unused off-chain helper/cleanup methods (dead-code audit)
omersadika May 29, 2026
fc9a778
Review fast-follows: bootstrap outcome split + cache_protocol_output doc
omersadika May 29, 2026
a480cf1
Make handoff attestation committee membership deterministic under churn
omersadika May 29, 2026
34f880b
Handoff committee intersection must never withhold the EndOfPublish vote
omersadika May 30, 2026
e857ed5
docs: refresh review verdicts against current tip (34f880b124)
ycscaly May 31, 2026
df27ac1
docs(review): walk Feature 5 — pubkey providers
ycscaly May 31, 2026
e4e87c1
docs(review): walk Feature 6 — sui_syncer off-chain overlay
ycscaly May 31, 2026
feab4e5
docs(review): walk Feature 7 — handoff attestation
ycscaly May 31, 2026
4ff8b5b
docs(review): reframe F5-F7 author notes vs user concerns
ycscaly May 31, 2026
b906e7d
docs(review): walk Feature 8 — EndOfPublishV2
ycscaly May 31, 2026
9efed5b
Make the churn test robust to load: retry validator-mgmt txs + 120s e…
omersadika May 31, 2026
5c049d4
Harden off-chain handoff/reconfig determinism
omersadika May 31, 2026
b7f3760
Joiner fetches its network-key outputs from the verified handoff cert
omersadika May 31, 2026
bc370e8
Instantiate network keys from the local overlay (additive, alongside …
omersadika May 31, 2026
eb3d324
Revert "Instantiate network keys from the local overlay (additive, al…
omersadika May 31, 2026
f29e54b
Epoch-pin the handoff reconfiguration digest to the local-MPC current…
omersadika May 31, 2026
91e4e61
Bound the in-memory MPC blob serve cache (FIFO byte cap)
omersadika May 31, 2026
7f171d1
Retry EndOfPublishV2 until sequenced instead of a one-shot flag
omersadika May 31, 2026
883c609
Anchor every validator on the prior-epoch handoff cert (step 1: cert …
omersadika May 31, 2026
67aa516
Instantiate network keys from cert-verified local outputs (step 2)
omersadika May 31, 2026
344d8a1
Remove the ConsensusNetworkKeyData vote + broadcast (step 3: unificat…
omersadika May 31, 2026
bc3935f
Handle the genesis / initial-DKG case in cert-verified instantiation …
omersadika May 31, 2026
2ff5909
Suppress retry of network-key instantiations that fail to decrypt
omersadika May 31, 2026
5e19242
Rename the off-chain assembly path: class_groups -> mpc_data
omersadika May 31, 2026
3c3d70d
Reword stale consensus-voted comments after vote removal
omersadika May 31, 2026
f1d19e1
Reject the EndOfPublishV2 EOP vote when its bundled handoff sig fails
omersadika May 31, 2026
74acdf4
Fail-closed: halt the node when the bootstrap trust anchor is Rejected
omersadika May 31, 2026
3c79938
Escalate a permanently-wedged off-chain assembly to error! (F6)
omersadika May 31, 2026
740e997
Resolve departed prior-committee signers when verifying handoff certs
omersadika May 31, 2026
fedc0db
Skip pubkey-provider refresh when the chain has advanced past the epoch
omersadika May 31, 2026
7e08a6a
Buffer relayed joiner announcements when the joiner provider lags
omersadika May 31, 2026
a86ffc3
Exit the pubkey updater when its epoch drops; drop base64 dedup
omersadika Jun 1, 2026
91b5892
Don't publish transient incomplete network-key entries on the channel
omersadika Jun 1, 2026
693f2c6
Give the chain-committee channel a crypto-free CommitteeMembership type
omersadika Jun 1, 2026
db792dd
Clarify why joiner bootstrap is one-hop (Sui anchors the prior commit…
omersadika Jun 1, 2026
e7cb4d0
Document the two handoff-cert review findings
omersadika Jun 1, 2026
666b9c2
fix scripts
ycscaly Jun 2, 2026
7c15ecb
Tally the mpc_data freeze from consensus signals, not the local table
omersadika Jun 2, 2026
baec9d6
Write the local-publish ephemeral pubfile into the contracts temp dir
omersadika Jun 2, 2026
6af21a8
Revert the network-key empty-blob channel filter (it wedged epoch adv…
omersadika Jun 2, 2026
02effbd
Key handoff reconfiguration-output digest by the reconfiguration sess…
omersadika Jun 2, 2026
14ba3ea
Give the shared-dWallet test Active-wait the zero-trust 5-min timeout
omersadika Jun 2, 2026
52d797c
Fix two clippy warnings in validator_metadata tests
omersadika Jun 2, 2026
593aef3
Track the notifier gas coin from tx effects to survive fullnode lag
omersadika Jun 2, 2026
c97ea98
Shrink presign pools for the local in-memory swarm
omersadika Jun 3, 2026
81829f2
Chain-read the prior committee for joiner bootstrap
omersadika Jun 3, 2026
34f70b9
Drop the cached notifier gas ref on submission failure
omersadika Jun 3, 2026
7cae3fe
Make the notifier robust to stale-gas rejections
omersadika Jun 3, 2026
4ebc1ff
Give the test-cluster notifier a dedicated funded Sui key
omersadika Jun 3, 2026
338df11
Stop the dWallet MPC service panicking on EpochEnded
omersadika Jun 3, 2026
e5ee86c
Deliver validator mpc_data blobs in-band over consensus
omersadika Jun 4, 2026
8f97183
Pre-derive the joiner's mpc_data blob off the critical path
omersadika Jun 4, 2026
cb29ac3
Right-size the churn test to production-realistic epochs
omersadika Jun 4, 2026
80e2be4
Correct the pinned Sui version in CLAUDE.md
omersadika Jun 4, 2026
f02a295
test(integration): raise dWallet poll timeouts to 600s for slow-netwo…
omersadika Jun 5, 2026
87ee419
refactor(dwallet-mpc): cut per-session/per-message log spam, add sess…
omersadika Jun 6, 2026
f3c2508
fix(reconfiguration): epoch-scale the uncompleted-events re-poll; dia…
omersadika Jun 6, 2026
9f8d3c2
fix(reconfiguration): deliver pre-v4 network-key outputs across the v…
omersadika Jun 8, 2026
caa2cde
fix(sui-executor): gate advance_epoch on session completion to preven…
omersadika Jun 8, 2026
a560181
fix(test): pass the blob arg to new_validator_mpc_data_announcement
omersadika Jun 8, 2026
5b2afbb
feat(reconfiguration): prepare-then-start — block epoch start until f…
omersadika Jun 8, 2026
1e23b4c
test(sdk): raise the default poll timeout to 10m so slow-network poll…
omersadika Jun 9, 2026
cdd1757
test(integration): drop the redundant per-call poll timeout overrides…
omersadika Jun 9, 2026
d828cad
Merge remote-tracking branch 'origin/dev' into feat/off-chain-metadat…
omersadika Jun 9, 2026
d14164d
Update Cargo.lock
omersadika Jun 9, 2026
9c85ce5
feat(reconfiguration): ground the prepare-then-start barrier in the v…
omersadika Jun 9, 2026
ec016c4
feat(reconfiguration): reliably converge the handoff cert and full mp…
omersadika Jun 9, 2026
7a78502
Merge remote-tracking branch 'origin/dev' into feat/off-chain-metadat…
omersadika Jun 10, 2026
6a7c236
ci: run the heavy test suites on the ika-k8s-large self-hosted runner
omersadika Jun 10, 2026
2e306c3
ci: install wasm-pack in the TS integration workflow
omersadika Jun 10, 2026
51e2f92
ci: force IPv4 + retry for apt on the k8s runners
omersadika Jun 10, 2026
84c2230
ci: start a Sui localnet before ika start in the TS workflow
omersadika Jun 10, 2026
814434d
ci: run the full TS suite against one Sui + ika localnet
omersadika Jun 10, 2026
30cbda3
fix(reconfiguration): make the EndOfPublish close version-faithful, v…
omersadika Jun 10, 2026
aa85a20
fix(reconfiguration): harden the off-chain handoff paths against rest…
omersadika Jun 10, 2026
fa5fe7a
ci: give the ika localnet readiness probe a 40-minute budget
omersadika Jun 10, 2026
3b72474
ci: report effective runner CPU/memory at job start
omersadika Jun 10, 2026
38ef225
fix(reconfiguration): cache quorum-agreed network-key outputs as a re…
omersadika Jun 10, 2026
65a0bb1
fix(reconfiguration): decide the mpc_data freeze at the consensus com…
omersadika Jun 10, 2026
6993dbe
ci: pin PROFILE=release for the ika-wasm build in TS integration
omersadika Jun 10, 2026
d92a9d9
ci: sample effective CPU usage during the heavy test phases
omersadika Jun 10, 2026
efb46b6
ci: readiness via positive one-way signals, not a quiet window
omersadika Jun 10, 2026
37b1201
test(integration): make the harness wall-clock budgets environment-aware
omersadika Jun 10, 2026
f6ed33a
ci: default the TS localnet to 15-minute epochs
omersadika Jun 10, 2026
0845167
ci: optional extra RUSTFLAGS input for the cluster workflow
omersadika Jun 10, 2026
37d8dea
ci: build the TS localnet ika binary with target-cpu=native
omersadika Jun 10, 2026
766b22f
ci: parameterize localnet RUST_LOG + always upload localnet logs
omersadika Jun 10, 2026
76d2d26
ci: 7h budget + always-upload logs for the full cluster suite
omersadika Jun 10, 2026
5388d0b
perf(reconfiguration): stop redundant per-tick re-work in the off-cha…
omersadika Jun 10, 2026
b790278
ci: test_filter + rust_log inputs for the integration workflow, alway…
omersadika Jun 11, 2026
e662dee
fix(simtest): run cryptographic computations inline under msim instea…
omersadika Jun 11, 2026
e6bb02a
perf(dwallet-mpc): never block the MPC service loop on network-key in…
omersadika Jun 11, 2026
622c002
ci: run the cluster suite via cargo-nextest at 8-way parallelism
omersadika Jun 11, 2026
b530e5a
debug(dwallet-mpc): per-sub-call timing inside the network-key instan…
omersadika Jun 11, 2026
470440d
debug(dwallet-mpc): emit the instantiation sub-call timings at info l…
omersadika Jun 11, 2026
6b19761
ci: optional hosted-runner selection for the integration workflow
omersadika Jun 11, 2026
cdd77dd
ci: allocator A/B input for the integration workflow
omersadika Jun 11, 2026
4481d44
docs(specs): behavioral specs for the handoff and validator mpc-data …
omersadika Jun 11, 2026
8a9964f
ci: time + optional perf counters around the integration test execution
omersadika Jun 11, 2026
43fb9a0
fix(ci): disable eager library backtrace capture — the actual CI slow…
omersadika Jun 11, 2026
502db1f
fix(test-cluster): cross-process boot lock for parallel cluster tests
omersadika Jun 11, 2026
0685274
fix(dwallet-mpc): consensus-deterministic internal-presign session id…
omersadika Jun 11, 2026
6634936
Merge remote-tracking branch 'origin/dev' into feat/off-chain-metadat…
omersadika Jun 11, 2026
59ea748
fix(dwallet-mpc): adopt network keys once per service iteration, not …
omersadika Jun 11, 2026
45674ca
feat(node): compiled-in jemalloc global allocator, mirroring sui-node
omersadika Jun 11, 2026
a694222
chore: remove process-artifact docs from the branch
omersadika Jun 11, 2026
fe9a24f
chore(ci): strip investigation scaffolding from the workflows, retune…
omersadika Jun 11, 2026
502b8fa
chore(dwallet-mpc): post-investigation cleanup — terminology, stale c…
omersadika Jun 11, 2026
550faba
chore: remove the PR action-plan working document
omersadika Jun 11, 2026
858788b
feat(observability): production-grade logs and metrics for the off-ch…
omersadika Jun 12, 2026
a07b743
fix(test): never-panicking RUST_LOG-honoring tracing init for missing…
omersadika Jun 12, 2026
c9ecaf0
fix(dwallet-mpc): never complete a user session beyond the epoch-clos…
omersadika Jun 12, 2026
074318c
fix(dwallet-mpc): never abandon a computation-results batch on one st…
omersadika Jun 12, 2026
c380be3
test(dwallet-mpc): regression tests for both epoch-close wedge bugs
omersadika Jun 12, 2026
b3be0be
chore(deps): bump cryptography-private to de3cddd; drop the RUST_LIB_…
omersadika Jun 12, 2026
bba1c50
test(sdk): drop the remaining 30-attempt retryUntil overrides
omersadika Jun 12, 2026
8cdea34
ci: split compilation out of the test steps; quiet residual cargo noise
omersadika Jun 12, 2026
c06203b
docs(specs): epoch-close session-lock spec
omersadika Jun 12, 2026
54ff155
fix(test-cluster): cover the joiner spawn with the boot lock; widen p…
omersadika Jun 12, 2026
4abaad5
test(sdk): raise the retryUntil default to 15min; align per-case time…
omersadika Jun 12, 2026
d2bd241
ci(ts-integration): chain-side rejections + per-validator metrics in …
omersadika Jun 12, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 10 additions & 2 deletions .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -51,8 +51,16 @@ jobs:
- uses: Swatinem/rust-cache@v2
- name: Build
run: |
# Install build dependencies
sudo apt-get update && sudo apt-get install -y cmake clang pkg-config libssl-dev
# Install build dependencies. Some runners lack IPv6 egress while
# DNS returns AAAA records, so force IPv4 and retry — apt mirror
# flakiness otherwise fails the job before the build starts.
APT="-o Acquire::ForceIPv4=true"
for attempt in 1 2 3; do
sudo apt-get $APT update && \
sudo apt-get $APT install -y cmake clang pkg-config libssl-dev && break
echo "apt attempt $attempt failed; retrying in 15s" && sleep 15
done
command -v cmake >/dev/null || { echo "build dependencies missing after retries"; exit 1; }
RUSTFLAGS="-D warnings" cargo build --bin ika --target x86_64-unknown-linux-gnu

fmt:
Expand Down
139 changes: 131 additions & 8 deletions .github/workflows/integration-tests-ci.yaml
Original file line number Diff line number Diff line change
@@ -1,11 +1,39 @@
name: Integration Tests CI

# Manually triggered. Runs the Rust dwallet-MPC integration tests (real
# class-groups crypto, in-process consensus harness) on the `ika-k8s-large`
# self-hosted runner. `scope: all` widens to the entire workspace test suite.

on:
workflow_dispatch:
inputs:
scope:
description: "Which Rust tests to run"
type: choice
required: false
default: "integration"
options:
- integration
- all
test_threads:
description: "Concurrent test count (default 4 — concurrent tests share one rayon pool; too many queue-starves the per-advancement wall-clock budgets)"
type: string
required: false
default: "4"
test_filter:
description: "Optional test-name filter for the integration scope (e.g. network_dkg::test_network_dkg_full_flow)"
type: string
required: false
default: ""
rust_log:
description: "RUST_LOG override for the test run"
type: string
required: false
default: "error"

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}
cancel-in-progress: true
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: false

env:
RUST_BACKTRACE: 1
Expand All @@ -18,15 +46,31 @@ env:
CARGO_NET_RETRY: 10
RUSTUP_MAX_RETRIES: 10
RUST_LOG: error
# Generous safety-net headroom over the harness's per-advancement
# wall-clock budgets, for contention outliers on shared runners.
IKA_TEST_MAX_PARTY_ITERATIONS: "6000"
IKA_TEST_MAX_COMPUTATION_WAIT_ITERATIONS: "18000"

jobs:
run-tests:
name: Run Integration Tests
runs-on: ubuntu-latest
name: Run ${{ inputs.scope }} tests
runs-on: ika-k8s-large
timeout-minutes: 180
steps:
- name: Checkout Repository
uses: actions/checkout@v6

- name: Runner resources
run: |
# Surface what this pod ACTUALLY gets — the scale set advertises up
# to 80 vCPUs, but a low cgroup quota (requests/limits mismatch) or
# node oversubscription silently throttles the crypto workloads.
echo "nproc: $(nproc)"
echo "cgroup cpu.max: $(cat /sys/fs/cgroup/cpu.max 2>/dev/null || cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us 2>/dev/null || echo n/a)"
echo "cgroup memory.max: $(cat /sys/fs/cgroup/memory.max 2>/dev/null || cat /sys/fs/cgroup/memory/memory.limit_in_bytes 2>/dev/null || echo n/a)"
free -g 2>/dev/null || true
uptime || true

- name: Setup SSH
uses: ./.github/actions/setup-ssh
with:
Expand All @@ -37,8 +81,87 @@ jobs:
with:
toolchain: ${{ env.rust_stable }}
targets: x86_64-unknown-linux-gnu
- name: Install Target
run: rustup target add x86_64-unknown-linux-gnu

- name: Install build dependencies
run: |
# Some runner pods lack IPv6 egress while DNS returns AAAA records,
# so force IPv4 and retry — apt mirror flakiness otherwise fails the
# whole job before any test runs.
if command -v sudo >/dev/null; then SUDO=sudo; else SUDO=; fi
APT="-o Acquire::ForceIPv4=true"
for attempt in 1 2 3; do
$SUDO apt-get $APT update && \
$SUDO apt-get $APT install -y cmake clang pkg-config libssl-dev curl && break
echo "apt attempt $attempt failed; retrying in 15s" && sleep 15
done
command -v cmake >/dev/null || { echo "build dependencies missing after retries"; exit 1; }

- uses: Swatinem/rust-cache@v2
- name: Run Integration Tests
run: cargo test -p ika-core --lib dwallet_mpc::integration_tests --release --features test-utils --color=always -- --nocapture

- name: Start CPU sampler
run: |
# Every 15s: cgroup cpu.stat (usage_usec delta -> effective CPUs
# actually consumed; nr_throttled/throttled_usec -> CFS quota
# stalls) + loadavg. Answers "does this workload USE the vCPUs"
# rather than inferring it from wall-clock.
nohup bash -c 'prev=$(grep usage_usec /sys/fs/cgroup/cpu.stat 2>/dev/null | awk "{print \$2}"); while true; do sleep 15; cur=$(grep usage_usec /sys/fs/cgroup/cpu.stat 2>/dev/null | awk "{print \$2}"); echo "$(date -u +%T) effective_cpus=$(( (cur - prev) / 15000000 )).$(( ((cur - prev) / 1500000) % 10 )) $(grep -E "nr_throttled|throttled_usec" /sys/fs/cgroup/cpu.stat 2>/dev/null | tr "\n" " ") load=$(cut -d" " -f1-3 /proc/loadavg)"; prev=$cur; done' > cpu-sampler.log 2>&1 &
echo $! > cpu-sampler.pid

- name: Build tests
env:
SCOPE: ${{ inputs.scope }}
run: |
# Compilation in its own step: the Downloaded/Compiling stream
# dominates the log volume, a separate step collapses in the UI
# when green, and `time` in the run step covers test execution,
# not rustc.
if [ "$SCOPE" = "all" ]; then
cargo test --release --workspace --features test-utils --color=always --no-run
else
cargo test -p ika-core --lib --release --features test-utils --color=always --no-run
fi

- name: Run tests
env:
SCOPE: ${{ inputs.scope }}
TEST_THREADS: ${{ inputs.test_threads }}
TEST_FILTER: ${{ inputs.test_filter }}
RUST_LOG: ${{ inputs.rust_log || 'error' }}
run: |
set -o pipefail
THREADS=""
if [ -n "$TEST_THREADS" ]; then
THREADS="--test-threads=$TEST_THREADS"
fi
if [ "$SCOPE" = "all" ]; then
time cargo test --release --workspace --features test-utils --color=always -- \
$THREADS --nocapture 2>&1 | tee rust-tests.log
else
FILTER="dwallet_mpc::integration_tests"
if [ -n "$TEST_FILTER" ]; then
FILTER="dwallet_mpc::integration_tests::$TEST_FILTER"
fi
time cargo test -p ika-core --lib "$FILTER" --release \
--features test-utils --color=always -- $THREADS --nocapture 2>&1 | tee rust-tests.log
fi

- name: Summarize results
if: always()
run: |
grep -E "^test .*(ok|FAILED)|test result" rust-tests.log | tail -60 || true

- name: Upload CPU sampler log
if: always()
uses: actions/upload-artifact@v4
with:
name: cpu-sampler-${{ github.job }}-${{ github.run_attempt }}
path: cpu-sampler.log
retention-days: 7

- name: Upload test log
if: always()
uses: actions/upload-artifact@v4
with:
name: rust-tests-log-${{ github.run_attempt }}
path: rust-tests.log
retention-days: 7
11 changes: 10 additions & 1 deletion .github/workflows/simtest.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,16 @@ jobs:
toolchain: ${{ env.rust_stable }}

- name: Install build dependencies
run: sudo apt-get update && sudo apt-get install -y cmake clang pkg-config libssl-dev
run: |
# IPv4 + retry: some runners lack IPv6 egress while DNS returns
# AAAA records; apt mirror flakiness otherwise fails the job.
APT="-o Acquire::ForceIPv4=true"
for attempt in 1 2 3; do
sudo apt-get $APT update && \
sudo apt-get $APT install -y cmake clang pkg-config libssl-dev && break
echo "apt attempt $attempt failed; retrying in 15s" && sleep 15
done
command -v cmake >/dev/null || { echo "build dependencies missing after retries"; exit 1; }

- uses: Swatinem/rust-cache@v2
with:
Expand Down
133 changes: 116 additions & 17 deletions .github/workflows/test-cluster.yaml
Original file line number Diff line number Diff line change
@@ -1,9 +1,24 @@
name: Test Cluster

# Manually triggered. Runs the in-process Sui + ika swarm integration tests
# from `crates/ika-test-cluster/`. This is the `#[tokio::test]` path: real
# parallel crypto, fast wall time, no msim determinism. The slower `#[sim_test]`
# variant lives in `.github/workflows/simtest.yaml`.
# from `crates/ika-test-cluster/` on the `ika-k8s-large` self-hosted runner.
# This is the `#[tokio::test]` path: real parallel crypto, fast wall time, no
# msim determinism. The slower `#[sim_test]` variant lives in
# `.github/workflows/simtest.yaml`.
#
# Runs via cargo-nextest with parallel tests. Two facts make this work:
# 1. nextest runs each test in its OWN PROCESS, which isolates the
# `IkaTestClusterBuilder` publish flow's process-global
# `set_current_dir` (the `Pub.<env>.toml` parking) — under plain
# `cargo test` threads, parallel tests race on cwd and corrupt each
# other's contract publishes. (Concurrent boots are serialized by the
# builder's cross-process boot lock to avoid port-probe races.)
# 2. The suite is latency-bound (each cluster spends most wall time
# waiting on consensus rounds and epoch timers), so parallel clusters
# mostly interleave waiting.
# MEMORY is the parallelism ceiling, not CPU: each cluster is a full Sui
# swarm + ika validators (multi-GB); 8-way has OOM-killed the runner pod
# (96Gi limit) — keep `test_threads` at 4 unless the runner spec grows.
#
# See the "## Testing" section in CLAUDE.md for the strategy split between
# tokio and sim_test.
Expand All @@ -17,10 +32,15 @@ on:
required: false
default: "ika-test-cluster"
test_filter:
description: "Test name filter passed to cargo test"
description: "Test name filter passed to nextest (empty = full suite)"
type: string
required: false
default: ""
test_threads:
description: "Concurrent test count (nextest process-per-test; memory-bound — 8-way OOM-killed the 96Gi runner pod)"
type: string
required: false
default: "cluster_boots_with_four_validators"
default: "4"

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
Expand All @@ -37,20 +57,26 @@ env:

jobs:
test-cluster:
name: cargo test --release
runs-on: ubuntu-latest
# The full bootstrap (Sui chain → publish 4 ika packages → initialize →
# swarm launch) runs in ~80 s locally with parallel crypto on. CI runners
# are slower; 60 min is generous.
timeout-minutes: 60
name: cargo nextest --release
runs-on: ika-k8s-large
# The full suite at 4-way runs in ~35 minutes; the ceiling covers a
# cold build cache plus contention outliers.
timeout-minutes: 150
steps:
- name: Clean runner disk
run: |
sudo rm -rf /usr/share/dotnet /usr/local/lib/android /opt/ghc /opt/hostedtoolcache/CodeQL

- name: Checkout repository
uses: actions/checkout@v6

- name: Runner resources
run: |
# Surface what this pod ACTUALLY gets — the scale set advertises up
# to 80 vCPUs, but a low cgroup quota (requests/limits mismatch) or
# node oversubscription silently throttles the crypto workloads.
echo "nproc: $(nproc)"
echo "cgroup cpu.max: $(cat /sys/fs/cgroup/cpu.max 2>/dev/null || cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us 2>/dev/null || echo n/a)"
echo "cgroup memory.max: $(cat /sys/fs/cgroup/memory.max 2>/dev/null || cat /sys/fs/cgroup/memory/memory.limit_in_bytes 2>/dev/null || echo n/a)"
free -g 2>/dev/null || true
uptime || true

- name: Setup SSH
uses: ./.github/actions/setup-ssh
with:
Expand All @@ -62,7 +88,18 @@ jobs:
toolchain: ${{ env.rust_stable }}

- name: Install build dependencies
run: sudo apt-get update && sudo apt-get install -y cmake clang pkg-config libssl-dev
run: |
# Some runner pods lack IPv6 egress while DNS returns AAAA records,
# so force IPv4 and retry — apt mirror flakiness otherwise fails the
# whole job before any test runs.
if command -v sudo >/dev/null; then SUDO=sudo; else SUDO=; fi
APT="-o Acquire::ForceIPv4=true"
for attempt in 1 2 3; do
$SUDO apt-get $APT update && \
$SUDO apt-get $APT install -y cmake clang pkg-config libssl-dev curl && break
echo "apt attempt $attempt failed; retrying in 15s" && sleep 15
done
command -v cmake >/dev/null || { echo "build dependencies missing after retries"; exit 1; }

- uses: Swatinem/rust-cache@v2
with:
Expand All @@ -71,6 +108,68 @@ jobs:
# default release profile.
prefix-key: "test-cluster"

- name: Install cargo-nextest
uses: taiki-e/install-action@v2
with:
tool: cargo-nextest

- name: Start CPU sampler
run: |
# Every 15s: cgroup cpu.stat (usage_usec delta -> effective CPUs
# actually consumed; nr_throttled/throttled_usec -> CFS quota
# stalls) + loadavg. Answers "does this workload USE the vCPUs"
# rather than inferring it from wall-clock.
nohup bash -c 'prev=$(grep usage_usec /sys/fs/cgroup/cpu.stat 2>/dev/null | awk "{print \$2}"); while true; do sleep 15; cur=$(grep usage_usec /sys/fs/cgroup/cpu.stat 2>/dev/null | awk "{print \$2}"); echo "$(date -u +%T) effective_cpus=$(( (cur - prev) / 15000000 )).$(( ((cur - prev) / 1500000) % 10 )) $(grep -E "nr_throttled|throttled_usec" /sys/fs/cgroup/cpu.stat 2>/dev/null | tr "\n" " ") load=$(cut -d" " -f1-3 /proc/loadavg)"; prev=$cur; done' > cpu-sampler.log 2>&1 &
echo $! > cpu-sampler.pid

- name: Build test cluster
env:
PACKAGE: ${{ inputs.package }}
run: |
# Compilation in its own step: the Downloaded/Compiling stream is
# the majority of this workflow's log volume (~57% measured), and
# a separate step collapses in the UI when green, leaving the test
# step with only nextest progress and failure replays.
cargo nextest run --no-run --release -p "$PACKAGE"

- name: Run test cluster
env:
PACKAGE: ${{ inputs.package }}
TEST_FILTER: ${{ inputs.test_filter }}
TEST_THREADS: ${{ inputs.test_threads }}
run: |
set -o pipefail
# nextest: process-per-test (isolates the publish-flow cwd
# mutation), captured per-test output (failures replay theirs at
# the end — no more multi-GB interleaved logs), and no fail-fast
# so one wedged cluster can't hide the rest of the suite's
# results. Long tests surface via nextest's default SLOW markers.
# Failure replays stay inline ON PURPOSE: when the runner pod dies
# (OOM/eviction) the artifact upload never happens and the live
# log is the only surviving evidence.
cargo nextest run --release -p "$PACKAGE" $TEST_FILTER \
--test-threads "$TEST_THREADS" --no-fail-fast --cargo-quiet \
2>&1 | tee cluster-tests.log

- name: Summarize results
if: always()
run: |
cargo test --release -p ${{ inputs.package }} ${{ inputs.test_filter }} -- --nocapture
grep -E "PASS |FAIL |SLOW |Summary" cluster-tests.log | tail -40 || true

- name: Upload CPU sampler log
if: always()
uses: actions/upload-artifact@v4
with:
name: cpu-sampler-${{ github.job }}-${{ github.run_attempt }}
path: cpu-sampler.log
retention-days: 7

- name: Upload test log
# always: a timeout kill registers as 'cancelled', not 'failure',
# and partial results from a long suite must survive it.
if: always()
uses: actions/upload-artifact@v4
with:
name: cluster-tests-log-${{ github.run_attempt }}
path: cluster-tests.log
retention-days: 7
Loading
Loading