fix(turn-orchestrator): serialize concurrent approval wakes#198
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Warning Review limit reached
More reviews will be available in 52 minutes and 17 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR removes lazy settings-based approval verdicts, moves explicit per-session approval checks into the hook, adds optional serialized turn transitions guarded by a per-session lease, and updates the harness and tests to exercise concurrent approval resolution with retry semantics. ChangesApproval Decision Refactoring and Transition Serialization
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
skill-check — worker0 verified, 13 skipped (no docs/).
Note 17 stale rendered artifact(s) detected on main, unrelated to this PR. This PR is fine; the drift was already there. A maintainer should open a chore PR to re-render these.
|
…roval races
Two approval::resolve writes for one session fan out via turn::on_approval
into concurrent turn::function_awaiting_approval wakes. The turn-step queue
has no per-session ordering (Enqueue takes only a queue name), so both wakes
loaded the same parked turn_state, executed every call, and finalized the
batch — running side-effecting functions twice and emitting duplicate
function_execution_end / turn_end frames, which wedged the turn.
Add a per-session lease built on the only atomic primitive the state worker
exposes (state::update increment, a locked read-modify-write): acquire when
the prior holder count is 0, release by resetting it, and let a contender
steal a lease older than the TTL to recover from a crashed holder. Gate the
function_awaiting_approval transition behind it via runTransition({serialize}).
A contender that cannot acquire throws TransientError, so the durable queue
retries it after the holder releases and it then stale-skips.
Also revert the earlier approve-always parked-sibling change, which targeted a
different scenario that did not reproduce the reported bug.
Cover the race with a parallel-approval e2e test; the harness now models
state::update as a faithful atomic increment and the queue's TransientError
retry.
1458240 to
b8f2b28
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
harness/src/turn-orchestrator/function-awaiting-approval/run.ts (1)
63-66: ⚡ Quick winReplace the non-null assertion with an explicit invariant check.
Line 66 assumes every awaiting entry still has a matching prepared call. If that invariant is ever broken, this wake fails with a runtime throw, and the current Biome
noNonNullAssertionwarning remains unresolved.♻️ Proposed change
- const current = work.prepared.find((p) => p.call.id === callId)!; + const current = work.prepared.find((p) => p.call.id === callId); + if (!current) { + throw new Error( + `Invariant violated: missing prepared call ${callId} for session ${rec.session_id}`, + ); + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@harness/src/turn-orchestrator/function-awaiting-approval/run.ts` around lines 63 - 66, The code uses a non-null assertion on the result of work.prepared.find((p) => p.call.id === callId) when assigning current; replace this with an explicit invariant check: capture the find result (e.g., const current = ... without !), then if current is undefined either log a clear error including rec.session_id and callId and continue, or throw a descriptive Error to fail fast; update any downstream assumptions to use the checked variable instead of relying on the non-null assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@harness/tests/integration/parallel-approval-harness.ts`:
- Around line 97-109: The retry loop around runTurnStep silently returns
undefined when all 100 attempts throw TransientError; change it so that after
the loop completes it does not fall through but throws a clear error (either
rethrow the last caught error or throw a new Error like "Retries exhausted" that
includes the last error) so the harness surfaces a failed retry cycle; update
the for-loop handling in the function that calls runTurnStep and references
TransientError and flushMicrotasks to capture the last err in the catch and
throw it after the loop instead of returning.
---
Nitpick comments:
In `@harness/src/turn-orchestrator/function-awaiting-approval/run.ts`:
- Around line 63-66: The code uses a non-null assertion on the result of
work.prepared.find((p) => p.call.id === callId) when assigning current; replace
this with an explicit invariant check: capture the find result (e.g., const
current = ... without !), then if current is undefined either log a clear error
including rec.session_id and callId and continue, or throw a descriptive Error
to fail fast; update any downstream assumptions to use the checked variable
instead of relying on the non-null assertion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f72f48a3-c233-40c7-af45-00135d056315
📒 Files selected for processing (9)
harness/src/approval-gate/settings/verdict.tsharness/src/turn-orchestrator/function-awaiting-approval/ports.tsharness/src/turn-orchestrator/function-awaiting-approval/process.tsharness/src/turn-orchestrator/function-awaiting-approval/run.tsharness/src/turn-orchestrator/hook.tsharness/src/turn-orchestrator/run-transition.tsharness/src/turn-orchestrator/state-runtime/session-lease.tsharness/tests/integration/parallel-approval-harness.tsharness/tests/integration/parallel-approval.e2e.test.ts
💤 Files with no reviewable changes (2)
- harness/src/approval-gate/settings/verdict.ts
- harness/src/turn-orchestrator/function-awaiting-approval/ports.ts
| if ((await bumpHolders(iii, session_id)) === 0) { | ||
| await stateSet(iii, LEASE_AT_SCOPE, session_id, Date.now()); | ||
| return true; | ||
| } | ||
| // Contended — recover a lease abandoned by a crashed holder. | ||
| const acquiredAt = await stateGet(iii, LEASE_AT_SCOPE, session_id); | ||
| if (typeof acquiredAt === 'number' && Date.now() - acquiredAt > LEASE_TTL_MS) { | ||
| await stateSet(iii, LEASE_SCOPE, session_id, 0); | ||
| if ((await bumpHolders(iii, session_id)) === 0) { | ||
| await stateSet(iii, LEASE_AT_SCOPE, session_id, Date.now()); | ||
| return true; |
There was a problem hiding this comment.
Split-brain lease acquisition is still possible in the normal (non-crash) path.
At Line 50, the holder counter is acquired before the new acquire timestamp is published (Line 51). A concurrent contender can then hit the stale-steal branch at Lines 56-59 using an old LEASE_AT_SCOPE value and acquire too, so two workers may proceed concurrently.
This reintroduces duplicate transition execution risk. The lease needs ownership fencing (e.g., token/epoch validation on acquire/steal/release) or native KV lock primitives to avoid this race.
| for (let attempt = 0; attempt < 100; attempt += 1) { | ||
| try { | ||
| await runTurnStep(iii, function_id, session_id); | ||
| return; | ||
| } catch (err) { | ||
| if (err instanceof TransientError) { | ||
| await flushMicrotasks(); | ||
| continue; | ||
| } | ||
| throw err; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Don't silently succeed after retry exhaustion.
If all 100 attempts hit TransientError, this helper currently falls through and returns undefined. That drops the wake instead of surfacing a failed retry cycle, which can hide lease/retry regressions and make the harness pass for the wrong reason.
Proposed fix
async function runTurnStepWithRetry(
iii: ISdk,
function_id: string,
session_id: string,
): Promise<void> {
+ let lastTransient: TransientError | null = null;
for (let attempt = 0; attempt < 100; attempt += 1) {
try {
await runTurnStep(iii, function_id, session_id);
return;
} catch (err) {
if (err instanceof TransientError) {
+ lastTransient = err;
await flushMicrotasks();
continue;
}
throw err;
}
}
+ throw lastTransient ?? new Error(`retry budget exhausted for ${function_id}`);
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@harness/tests/integration/parallel-approval-harness.ts` around lines 97 -
109, The retry loop around runTurnStep silently returns undefined when all 100
attempts throw TransientError; change it so that after the loop completes it
does not fall through but throws a clear error (either rethrow the last caught
error or throw a new Error like "Retries exhausted" that includes the last
error) so the harness surfaces a failed retry cycle; update the for-loop
handling in the function that calls runTurnStep and references TransientError
and flushMicrotasks to capture the last err in the catch and throw it after the
loop instead of returning.
Drop genuinely-unused symbols flagged by biome 2.4.10 (correctness rules, fail the harness lint check): - copy-assets.mjs: unused `stat` import - session/tree/register.ts: unused `activePath as _activePath` import - session/tree/store.test.ts: unused `T` generic type parameter Unblocks CI; these were already red on main, unrelated to the approval-race fix.
|
Actionable comments posted: 0 |
Collapse a hand-written multi-line expect() to biome's canonical layout; this format mismatch was the single error failing the harness lint+test CI.
Fixes MOT-3434.
Summary
Two
approval::resolvewrites for one session fan out via theturn::on_approvalstate trigger into two concurrentturn::function_awaiting_approvalwakes. Theturn-stepqueue has no per-session ordering (TriggerAction.Enqueuetakes only a queue name), andrunTransitiondoes an unguarded load → mutate → overwrite, so both wakes load the same parkedturn_state, execute every call, and finalize the batch — double-running side-effecting functions (e.g.shell::run) and emitting duplicatefunction_execution_end/turn_endframes, which wedges the turn.This adds a per-session lease built on the only atomic state primitive exposed (
state::updateincrement): acquire when the prior holder count is 0, release by resetting it, and TTL-steal to recover a crashed holder. Thefunction_awaiting_approvaltransition is gated behind it viarunTransition({ serialize: true }); a contender that cannot acquire throwsTransientError, so the durable queue retries after the holder releases and then stale-skips.Also reverts the earlier approve-always parked-sibling change, which targeted a scenario that did not reproduce the reported bug.
Test plan
npx vitest run tests/integration/parallel-approval.e2e.test.ts— new parallel test green (deterministic across repeated runs)npx vitest run— full suite greennpx tsc -b --noEmit— cleanFollow-up
The engine has a native TTL + owner lock (
try_acquire_lock/release_lockin the kv builtin) used by the cron worker but not exposed as astate::*function. Exposingstate::acquire_lock/release_lockwould be a cleaner primitive than the increment-based lease.Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores