Skip to content

fix(sui-executor): notifier gas-handling and advance_epoch robustness under load#1734

Open
ycscaly wants to merge 6 commits into
devfrom
fix/notifier-sui-executor-robustness
Open

fix(sui-executor): notifier gas-handling and advance_epoch robustness under load#1734
ycscaly wants to merge 6 commits into
devfrom
fix/notifier-sui-executor-robustness

Conversation

@ycscaly

@ycscaly ycscaly commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Six commits, cherry-picked in their original order from feat/ika-upgrade-test, making the notifier's Sui-submission path robust under load. They chain on sui_executor.rs, so they ship as one PR; each commit message carries its full story. The distinct fixes:

  1. Track the gas coin from tx effects (593aef39): the executor rebuilt every checkpoint/epoch-advance tx with gas fetched from the notifier's fullnode, which lags the validators by hundreds of object versions under checkpoint-heavy load — stale gas version, rejection, retry against the same lagging view, stalled epoch advance. Cache the gas ObjectRef from each tx's effects (the authoritative post-tx version) and build the next tx against it.
  2. Drop the cached gas ref on submission failure (34f70b99): on a rejection the cache wasn't cleared, so retry_with_max_elapsed_time! spun on the identical stale ref for the full retry budget (an hour). Clear it on both failure paths.
  3. Stale-gas rejection robustness (7cae3fee): recover when the notifier's gas coin is advanced by another holder of the key — re-fetch floored at the rejected version instead of spinning.
  4. Gate request_advance_epoch on session completion (caa2cde4): the notifier submitted request_advance_epoch as soon as a quorum's checkpoint arrived; if just-synced state showed sessions still pending, the Move call aborted and the notifier died in a panic loop. Re-check the on-chain completion predicate before submitting. (Touches system_inner_v1.rs for the predicate read.)
  5. Apply stale-gas recovery to RPC-level submission rejections (01600b82): the recovery only triggered on effects-level failures; RPC-level rejections (the common shape for stale gas) bypassed it. Factor the recovery into one path covering both.
  6. Document the err.as_ref() clippy trap (59b0db3e): clippy's unnecessary_to_owned suggestion silently changes which error variant the stale-gas matcher sees; the comment prevents a "cleanup" from reintroducing the spin.

Validated on feat/ika-upgrade-test: the full dwallet-creation integration suite (8/8) passes with sui_tx_err=0, where the pre-fix binary stalled at epoch 5 with 100+ gas-version rejections; the cross-binary churn and literal v1.1.8 upgrade rehearsals are green on top of these.

🤖 Generated with Claude Code

omersadika and others added 6 commits June 11, 2026 12:18
The Sui executor rebuilt every checkpoint / epoch-advance tx with gas
fetched via get_gas_objects() from the notifier's fullnode. Under
checkpoint-heavy load the fullnode lags the validators by hundreds of
object versions, so the tx carried a stale gas-coin version and was
rejected ("transaction needs to be rebuilt because object ... version
..."); the retry kept re-reading the same lagging view until the
epoch-advance stalled.

Cache the gas ObjectRef carried by each tx's effects (the authoritative
post-tx version) in the serialized notifier submit state and build the
next tx against it, falling back to get_gas_objects only for the first
tx. Submission is serial (the lock is held across submit_tx_to_sui), so
the cached ref is always the exact current version when the next tx is
built.

Verified: the full dwallet-creation integration suite (8/8) passes with
sui_tx_err=0 throughout, where the pre-fix binary stalled at epoch 5
with 100+ gas-version rejections.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The notifier caches its gas-coin object ref from the previous tx's
effects to skip the lagging fullnode on the steady-state path. But on a
submission failure — a stale gas version rejected before execution, or
missing effects — `submit_tx_to_sui` returned without clearing that
cache. Since `run_epoch_switch` and the checkpoint handlers wrap each
submission in `retry_with_max_elapsed_time!`, the retry re-read the same
stale ref and was rejected again, spinning on the identical stale version
for the full retry budget (an hour) and wedging epoch advance.

Clear `gas_coins` on both failure paths so the retry re-fetches a fresh
ref (from effects once a tx lands, else the fullnode). In a local cluster
run this collapsed the rejected-version gap from ~177 (a pinned stale
ref) to ~3 (residual fullnode lag) and let epoch processing progress.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Refine the notifier's gas-coin handling so a stale-version rejection can
recover instead of spinning:

- Only drop the cached gas ref on a genuine stale-gas rejection
  ("unavailable for consumption" / "needs to be rebuilt"), not on every
  pre-execution error — an unrelated failure leaves the (valid) cached
  ref intact rather than forcing a fullnode re-fetch.
- On a stale-gas rejection, record the rejected version as a floor; the
  next `get_gas_objects` re-fetch waits (up to 30s) for the notifier
  fullnode to advance past it before trusting the result, instead of
  re-serving the same stale version it just rejected.
- Clear the floor once a tx lands (the cached ref is authoritative again).

This handles a notifier whose gas coin is advanced by another holder of
its address and whose fullnode lags the validators. (The in-process test
cluster reproduces this by funding the notifier from the shared publisher
coin; production notifiers use a dedicated key.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t panic death-spiral

The notifier submitted `request_advance_epoch` as soon as a quorum's
`received_end_of_publish` snapshot was observed, but a network-key system
session can start in the window after that snapshot (a `respond_*` on a
network-key DKG/reconfiguration session chains a new
`initiate_system_session`). The on-chain
`sessions_manager::advance_epoch` then MoveAborts with
`ENotAllCurrentEpochSessionsCompleted` (code 6); the hour-long retry
burned out and `panic!`d the validator over a transient, self-clearing
condition — dropping the committee below quorum mid-transition and
risking a network-wide wedge.

Re-check the on-chain completion predicate against just-synced state
before submitting: hold the tick (debug-log the breakdown) if user or
system sessions are still draining, mirroring the existing
`epoch_not_locked` gate. The hour-long panic now only guards genuinely
fatal submission errors.

- ika-types: add `SessionsManager::all_current_epoch_sessions_completed`
  (Rust mirror of the Move assertion) + a unit test of the truth table,
  including the "system session started after end-of-publish" transient.
- sui-executor: gate the `advance_epoch` submission on it.

Validated live: under the heavy TS integration suite that previously
crashed the network at epoch 7, the network advanced to epoch 10 with
zero panics and zero thread-stall wedges.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ejections

The stale-gas recovery (drop the cached gas ref, floor the re-fetch at
the rejected version) only ran when the rejection arrived inside
`tx_response.errors`. But the fullnode also rejects at the JSON-RPC
layer (ServerError -32002, "Transaction needs to be rebuilt ... object
unavailable for consumption"), which surfaces as `Err` from
`execute_transaction_block_with_effects` and bailed out before the
recovery code — the cached gas ref survived, so every
`retry_with_max_elapsed_time!` attempt rebuilt the byte-identical stale
tx and re-rejected, wedging checkpoint delivery to Sui for the full
one-hour window (observed: dwallet checkpoints stuck behind a gas coin
advanced by the shared publisher address in the test cluster, blocking
DKG settlement, mid-epoch reconfiguration, and epoch advance).

Factor the recovery into
`NotifierSubmitState::handle_possible_stale_gas_rejection` and apply it
on both paths. Note `IkaError` derives strum's `AsRefStr`, so
`err.as_ref()` yields only the variant name — match the
`SuiClientTxFailureGeneric` payload to get the actual message.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e-gas recovery

Clippy's `unnecessary_to_owned` suggests `err.as_ref()` for
`&err.to_string()` here; it compiles because `IkaError` derives strum's
`AsRefStr`, but that returns only the variant name — never the rejection
markers — silently disabling the recovery.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants