fix(sui-executor): notifier gas-handling and advance_epoch robustness under load#1734
Open
ycscaly wants to merge 6 commits into
Open
fix(sui-executor): notifier gas-handling and advance_epoch robustness under load#1734ycscaly wants to merge 6 commits into
ycscaly wants to merge 6 commits into
Conversation
The Sui executor rebuilt every checkpoint / epoch-advance tx with gas
fetched via get_gas_objects() from the notifier's fullnode. Under
checkpoint-heavy load the fullnode lags the validators by hundreds of
object versions, so the tx carried a stale gas-coin version and was
rejected ("transaction needs to be rebuilt because object ... version
..."); the retry kept re-reading the same lagging view until the
epoch-advance stalled.
Cache the gas ObjectRef carried by each tx's effects (the authoritative
post-tx version) in the serialized notifier submit state and build the
next tx against it, falling back to get_gas_objects only for the first
tx. Submission is serial (the lock is held across submit_tx_to_sui), so
the cached ref is always the exact current version when the next tx is
built.
Verified: the full dwallet-creation integration suite (8/8) passes with
sui_tx_err=0 throughout, where the pre-fix binary stalled at epoch 5
with 100+ gas-version rejections.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The notifier caches its gas-coin object ref from the previous tx's effects to skip the lagging fullnode on the steady-state path. But on a submission failure — a stale gas version rejected before execution, or missing effects — `submit_tx_to_sui` returned without clearing that cache. Since `run_epoch_switch` and the checkpoint handlers wrap each submission in `retry_with_max_elapsed_time!`, the retry re-read the same stale ref and was rejected again, spinning on the identical stale version for the full retry budget (an hour) and wedging epoch advance. Clear `gas_coins` on both failure paths so the retry re-fetches a fresh ref (from effects once a tx lands, else the fullnode). In a local cluster run this collapsed the rejected-version gap from ~177 (a pinned stale ref) to ~3 (residual fullnode lag) and let epoch processing progress. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Refine the notifier's gas-coin handling so a stale-version rejection can
recover instead of spinning:
- Only drop the cached gas ref on a genuine stale-gas rejection
("unavailable for consumption" / "needs to be rebuilt"), not on every
pre-execution error — an unrelated failure leaves the (valid) cached
ref intact rather than forcing a fullnode re-fetch.
- On a stale-gas rejection, record the rejected version as a floor; the
next `get_gas_objects` re-fetch waits (up to 30s) for the notifier
fullnode to advance past it before trusting the result, instead of
re-serving the same stale version it just rejected.
- Clear the floor once a tx lands (the cached ref is authoritative again).
This handles a notifier whose gas coin is advanced by another holder of
its address and whose fullnode lags the validators. (The in-process test
cluster reproduces this by funding the notifier from the shared publisher
coin; production notifiers use a dedicated key.)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t panic death-spiral The notifier submitted `request_advance_epoch` as soon as a quorum's `received_end_of_publish` snapshot was observed, but a network-key system session can start in the window after that snapshot (a `respond_*` on a network-key DKG/reconfiguration session chains a new `initiate_system_session`). The on-chain `sessions_manager::advance_epoch` then MoveAborts with `ENotAllCurrentEpochSessionsCompleted` (code 6); the hour-long retry burned out and `panic!`d the validator over a transient, self-clearing condition — dropping the committee below quorum mid-transition and risking a network-wide wedge. Re-check the on-chain completion predicate against just-synced state before submitting: hold the tick (debug-log the breakdown) if user or system sessions are still draining, mirroring the existing `epoch_not_locked` gate. The hour-long panic now only guards genuinely fatal submission errors. - ika-types: add `SessionsManager::all_current_epoch_sessions_completed` (Rust mirror of the Move assertion) + a unit test of the truth table, including the "system session started after end-of-publish" transient. - sui-executor: gate the `advance_epoch` submission on it. Validated live: under the heavy TS integration suite that previously crashed the network at epoch 7, the network advanced to epoch 10 with zero panics and zero thread-stall wedges. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ejections The stale-gas recovery (drop the cached gas ref, floor the re-fetch at the rejected version) only ran when the rejection arrived inside `tx_response.errors`. But the fullnode also rejects at the JSON-RPC layer (ServerError -32002, "Transaction needs to be rebuilt ... object unavailable for consumption"), which surfaces as `Err` from `execute_transaction_block_with_effects` and bailed out before the recovery code — the cached gas ref survived, so every `retry_with_max_elapsed_time!` attempt rebuilt the byte-identical stale tx and re-rejected, wedging checkpoint delivery to Sui for the full one-hour window (observed: dwallet checkpoints stuck behind a gas coin advanced by the shared publisher address in the test cluster, blocking DKG settlement, mid-epoch reconfiguration, and epoch advance). Factor the recovery into `NotifierSubmitState::handle_possible_stale_gas_rejection` and apply it on both paths. Note `IkaError` derives strum's `AsRefStr`, so `err.as_ref()` yields only the variant name — match the `SuiClientTxFailureGeneric` payload to get the actual message. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e-gas recovery Clippy's `unnecessary_to_owned` suggests `err.as_ref()` for `&err.to_string()` here; it compiles because `IkaError` derives strum's `AsRefStr`, but that returns only the variant name — never the rejection markers — silently disabling the recovery. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Six commits, cherry-picked in their original order from
feat/ika-upgrade-test, making the notifier's Sui-submission path robust under load. They chain onsui_executor.rs, so they ship as one PR; each commit message carries its full story. The distinct fixes:593aef39): the executor rebuilt every checkpoint/epoch-advance tx with gas fetched from the notifier's fullnode, which lags the validators by hundreds of object versions under checkpoint-heavy load — stale gas version, rejection, retry against the same lagging view, stalled epoch advance. Cache the gasObjectReffrom each tx's effects (the authoritative post-tx version) and build the next tx against it.34f70b99): on a rejection the cache wasn't cleared, soretry_with_max_elapsed_time!spun on the identical stale ref for the full retry budget (an hour). Clear it on both failure paths.7cae3fee): recover when the notifier's gas coin is advanced by another holder of the key — re-fetch floored at the rejected version instead of spinning.request_advance_epochon session completion (caa2cde4): the notifier submittedrequest_advance_epochas soon as a quorum's checkpoint arrived; if just-synced state showed sessions still pending, the Move call aborted and the notifier died in a panic loop. Re-check the on-chain completion predicate before submitting. (Touchessystem_inner_v1.rsfor the predicate read.)01600b82): the recovery only triggered on effects-level failures; RPC-level rejections (the common shape for stale gas) bypassed it. Factor the recovery into one path covering both.err.as_ref()clippy trap (59b0db3e): clippy'sunnecessary_to_ownedsuggestion silently changes which error variant the stale-gas matcher sees; the comment prevents a "cleanup" from reintroducing the spin.Validated on
feat/ika-upgrade-test: the full dwallet-creation integration suite (8/8) passes withsui_tx_err=0, where the pre-fix binary stalled at epoch 5 with 100+ gas-version rejections; the cross-binary churn and literal v1.1.8 upgrade rehearsals are green on top of these.🤖 Generated with Claude Code