Skip to content

fix(turn-orchestrator): serialize concurrent approval wakes#198

Merged
ytallo merged 3 commits into
mainfrom
fix/parallel-approval-race
May 29, 2026
Merged

fix(turn-orchestrator): serialize concurrent approval wakes#198
ytallo merged 3 commits into
mainfrom
fix/parallel-approval-race

Conversation

@ytallo
Copy link
Copy Markdown
Contributor

@ytallo ytallo commented May 28, 2026

Fixes MOT-3434.

Summary

Two approval::resolve writes for one session fan out via the turn::on_approval state trigger into two concurrent turn::function_awaiting_approval wakes. The turn-step queue has no per-session ordering (TriggerAction.Enqueue takes only a queue name), and runTransition does an unguarded load → mutate → overwrite, so both wakes load the same parked turn_state, execute every call, and finalize the batch — double-running side-effecting functions (e.g. shell::run) and emitting duplicate function_execution_end / turn_end frames, which wedges the turn.

This adds a per-session lease built on the only atomic state primitive exposed (state::update increment): acquire when the prior holder count is 0, release by resetting it, and TTL-steal to recover a crashed holder. The function_awaiting_approval transition is gated behind it via runTransition({ serialize: true }); a contender that cannot acquire throws TransientError, so the durable queue retries after the holder releases and then stale-skips.

Also reverts the earlier approve-always parked-sibling change, which targeted a scenario that did not reproduce the reported bug.

Test plan

  • npx vitest run tests/integration/parallel-approval.e2e.test.ts — new parallel test green (deterministic across repeated runs)
  • npx vitest run — full suite green
  • npx tsc -b --noEmit — clean

Follow-up

The engine has a native TTL + owner lock (try_acquire_lock / release_lock in the kv builtin) used by the cron worker but not exposed as a state::* function. Exposing state::acquire_lock / release_lock would be a cleaner primitive than the increment-based lease.

Summary by CodeRabbit

  • New Features

    • Added session-level serialization for transition execution to prevent duplicate concurrent runs.
  • Bug Fixes

    • Approval resolution now strictly uses persisted decisions; removed prior fallback derivation from settings.
    • Consultation logic refined to respect per-session allow rules before policy checks.
  • Tests

    • Integration harness retries transient failures; added parallel approval test covering concurrent resolutions.
  • Chores

    • Minor test and script typing/import cleanups.

Review Change Stack

@vercel
Copy link
Copy Markdown

vercel Bot commented May 28, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
workers Ready Ready Preview, Comment May 28, 2026 8:07pm

Request Review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Warning

Review limit reached

@ytallo, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 52 minutes and 17 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 54677257-d09b-4013-bc80-e33cc4ef0d55

📥 Commits

Reviewing files that changed from the base of the PR and between 792876c and 13a7c33.

📒 Files selected for processing (1)
  • harness/tests/integration/parallel-approval.e2e.test.ts
📝 Walkthrough

Walkthrough

This PR removes lazy settings-based approval verdicts, moves explicit per-session approval checks into the hook, adds optional serialized turn transitions guarded by a per-session lease, and updates the harness and tests to exercise concurrent approval resolution with retry semantics.

Changes

Approval Decision Refactoring and Transition Serialization

Layer / File(s) Summary
Session lease mutual exclusion for turn transitions
harness/src/turn-orchestrator/state-runtime/session-lease.ts
New session lease module provides per-session mutual exclusion for turn FSM transitions with atomic counter bumps, immediate/stale-lease-stealing acquisition paths, and TTL-based crash-recovery semantics.
Transition serialization with optional per-session leasing
harness/src/turn-orchestrator/run-transition.ts, harness/src/turn-orchestrator/function-awaiting-approval/process.ts
Adds RunTransitionOptions type with optional serialize flag to runTransition. When enabled, a per-session lease is acquired and released around the transition; contention throws TransientError for durable queue retry. Core transition flow moved into runTransitionInner. function_awaiting_approval marked with serialize: true.
Hook decision logic consolidation from settings verdict
harness/src/turn-orchestrator/hook.ts
Moves approval decision checks into consultBefore: mode === 'full' allows immediately; entries in approved_always allow regardless of mode; mode === 'auto' allows only when function is in always_allow; otherwise falls through to policy check_permissions.
Function-awaiting-approval: remove settings port, consolidate decisions
harness/src/turn-orchestrator/function-awaiting-approval/ports.ts, harness/src/turn-orchestrator/function-awaiting-approval/run.ts
Removes readSettings from AwaitingApprovalPorts and its implementation. Decision resolution now relies solely on readDecision(session_id, function_call_id); entries with no persisted decision are skipped (no settings-derived fallback).
Integration harness: retry loop, atomic state updates, serialization support
harness/tests/integration/parallel-approval-harness.ts
Adds runTurnStepWithRetry that retries on TransientError with microtask flushes, marks turn::function_awaiting_approval with serialize:true, refactors state::update into atomic per-(scope,key) read-modify-write supporting increments, and routes actionful turn::... triggers through the retry wrapper.
E2E test: replace approveAlways coverage with concurrent approval resolution
harness/tests/integration/parallel-approval.e2e.test.ts
Removes approveAlways-based test and adds a concurrency-focused test that resolves two approvals in parallel, asserting each call executes once and only one turn_end is emitted.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • iii-hq/workers#174: Adds the runTransition abstraction that this PR extends with serialization and per-session lease behavior.
  • iii-hq/workers#195: Introduces approval settings and "approve always" infrastructure that this PR refactors by removing lazy verdict computation.

Suggested reviewers

  • sergiofilhowz

Poem

🐰 Leases set, no races now,
Two approvals race and bow,
Hooks decide with clearer sight,
Retries hum till order's right,
One turn_end, the batch says wow!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.25% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding serialization (via a per-session lease) to prevent concurrent approval wakes from executing duplicate transitions and side effects.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/parallel-approval-race

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 28, 2026

skill-check — worker

0 verified, 13 skipped (no docs/).

Layer Result
structure
vale
ai
render

Note

17 stale rendered artifact(s) detected on main, unrelated to this PR. This PR is fine; the drift was already there. A maintainer should open a chore PR to re-render these.

  • shell/README.md
  • shell/skill.md
  • shell/skills/chmod.md
  • shell/skills/exec.md
  • shell/skills/exec_bg.md
  • shell/skills/grep.md
  • shell/skills/kill.md
  • shell/skills/list.md
  • shell/skills/ls.md
  • shell/skills/mkdir.md
  • shell/skills/mv.md
  • shell/skills/read.md
  • …and 5 more (see the workflow logs)

@ytallo ytallo changed the base branch from feat/function-approval-modes to main May 28, 2026 16:51
…roval races

Two approval::resolve writes for one session fan out via turn::on_approval
into concurrent turn::function_awaiting_approval wakes. The turn-step queue
has no per-session ordering (Enqueue takes only a queue name), so both wakes
loaded the same parked turn_state, executed every call, and finalized the
batch — running side-effecting functions twice and emitting duplicate
function_execution_end / turn_end frames, which wedged the turn.

Add a per-session lease built on the only atomic primitive the state worker
exposes (state::update increment, a locked read-modify-write): acquire when
the prior holder count is 0, release by resetting it, and let a contender
steal a lease older than the TTL to recover from a crashed holder. Gate the
function_awaiting_approval transition behind it via runTransition({serialize}).
A contender that cannot acquire throws TransientError, so the durable queue
retries it after the holder releases and it then stale-skips.

Also revert the earlier approve-always parked-sibling change, which targeted a
different scenario that did not reproduce the reported bug.

Cover the race with a parallel-approval e2e test; the harness now models
state::update as a faithful atomic increment and the queue's TransientError
retry.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
harness/src/turn-orchestrator/function-awaiting-approval/run.ts (1)

63-66: ⚡ Quick win

Replace the non-null assertion with an explicit invariant check.

Line 66 assumes every awaiting entry still has a matching prepared call. If that invariant is ever broken, this wake fails with a runtime throw, and the current Biome noNonNullAssertion warning remains unresolved.

♻️ Proposed change
-    const current = work.prepared.find((p) => p.call.id === callId)!;
+    const current = work.prepared.find((p) => p.call.id === callId);
+    if (!current) {
+      throw new Error(
+        `Invariant violated: missing prepared call ${callId} for session ${rec.session_id}`,
+      );
+    }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@harness/src/turn-orchestrator/function-awaiting-approval/run.ts` around lines
63 - 66, The code uses a non-null assertion on the result of
work.prepared.find((p) => p.call.id === callId) when assigning current; replace
this with an explicit invariant check: capture the find result (e.g., const
current = ... without !), then if current is undefined either log a clear error
including rec.session_id and callId and continue, or throw a descriptive Error
to fail fast; update any downstream assumptions to use the checked variable
instead of relying on the non-null assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@harness/tests/integration/parallel-approval-harness.ts`:
- Around line 97-109: The retry loop around runTurnStep silently returns
undefined when all 100 attempts throw TransientError; change it so that after
the loop completes it does not fall through but throws a clear error (either
rethrow the last caught error or throw a new Error like "Retries exhausted" that
includes the last error) so the harness surfaces a failed retry cycle; update
the for-loop handling in the function that calls runTurnStep and references
TransientError and flushMicrotasks to capture the last err in the catch and
throw it after the loop instead of returning.

---

Nitpick comments:
In `@harness/src/turn-orchestrator/function-awaiting-approval/run.ts`:
- Around line 63-66: The code uses a non-null assertion on the result of
work.prepared.find((p) => p.call.id === callId) when assigning current; replace
this with an explicit invariant check: capture the find result (e.g., const
current = ... without !), then if current is undefined either log a clear error
including rec.session_id and callId and continue, or throw a descriptive Error
to fail fast; update any downstream assumptions to use the checked variable
instead of relying on the non-null assertion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f72f48a3-c233-40c7-af45-00135d056315

📥 Commits

Reviewing files that changed from the base of the PR and between 7a7485f and b8f2b28.

📒 Files selected for processing (9)
  • harness/src/approval-gate/settings/verdict.ts
  • harness/src/turn-orchestrator/function-awaiting-approval/ports.ts
  • harness/src/turn-orchestrator/function-awaiting-approval/process.ts
  • harness/src/turn-orchestrator/function-awaiting-approval/run.ts
  • harness/src/turn-orchestrator/hook.ts
  • harness/src/turn-orchestrator/run-transition.ts
  • harness/src/turn-orchestrator/state-runtime/session-lease.ts
  • harness/tests/integration/parallel-approval-harness.ts
  • harness/tests/integration/parallel-approval.e2e.test.ts
💤 Files with no reviewable changes (2)
  • harness/src/approval-gate/settings/verdict.ts
  • harness/src/turn-orchestrator/function-awaiting-approval/ports.ts

Comment on lines +50 to +60
if ((await bumpHolders(iii, session_id)) === 0) {
await stateSet(iii, LEASE_AT_SCOPE, session_id, Date.now());
return true;
}
// Contended — recover a lease abandoned by a crashed holder.
const acquiredAt = await stateGet(iii, LEASE_AT_SCOPE, session_id);
if (typeof acquiredAt === 'number' && Date.now() - acquiredAt > LEASE_TTL_MS) {
await stateSet(iii, LEASE_SCOPE, session_id, 0);
if ((await bumpHolders(iii, session_id)) === 0) {
await stateSet(iii, LEASE_AT_SCOPE, session_id, Date.now());
return true;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Split-brain lease acquisition is still possible in the normal (non-crash) path.

At Line 50, the holder counter is acquired before the new acquire timestamp is published (Line 51). A concurrent contender can then hit the stale-steal branch at Lines 56-59 using an old LEASE_AT_SCOPE value and acquire too, so two workers may proceed concurrently.

This reintroduces duplicate transition execution risk. The lease needs ownership fencing (e.g., token/epoch validation on acquire/steal/release) or native KV lock primitives to avoid this race.

Comment on lines +97 to 109
for (let attempt = 0; attempt < 100; attempt += 1) {
try {
await runTurnStep(iii, function_id, session_id);
return;
} catch (err) {
if (err instanceof TransientError) {
await flushMicrotasks();
continue;
}
throw err;
}
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't silently succeed after retry exhaustion.

If all 100 attempts hit TransientError, this helper currently falls through and returns undefined. That drops the wake instead of surfacing a failed retry cycle, which can hide lease/retry regressions and make the harness pass for the wrong reason.

Proposed fix
 async function runTurnStepWithRetry(
   iii: ISdk,
   function_id: string,
   session_id: string,
 ): Promise<void> {
+  let lastTransient: TransientError | null = null;
   for (let attempt = 0; attempt < 100; attempt += 1) {
     try {
       await runTurnStep(iii, function_id, session_id);
       return;
     } catch (err) {
       if (err instanceof TransientError) {
+        lastTransient = err;
         await flushMicrotasks();
         continue;
       }
       throw err;
     }
   }
+  throw lastTransient ?? new Error(`retry budget exhausted for ${function_id}`);
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@harness/tests/integration/parallel-approval-harness.ts` around lines 97 -
109, The retry loop around runTurnStep silently returns undefined when all 100
attempts throw TransientError; change it so that after the loop completes it
does not fall through but throws a clear error (either rethrow the last caught
error or throw a new Error like "Retries exhausted" that includes the last
error) so the harness surfaces a failed retry cycle; update the for-loop
handling in the function that calls runTurnStep and references TransientError
and flushMicrotasks to capture the last err in the catch and throw it after the
loop instead of returning.

Drop genuinely-unused symbols flagged by biome 2.4.10 (correctness rules,
fail the harness lint check):
- copy-assets.mjs: unused `stat` import
- session/tree/register.ts: unused `activePath as _activePath` import
- session/tree/store.test.ts: unused `T` generic type parameter

Unblocks CI; these were already red on main, unrelated to the approval-race fix.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

Collapse a hand-written multi-line expect() to biome's canonical layout;
this format mismatch was the single error failing the harness lint+test CI.
@ytallo ytallo merged commit 1a51300 into main May 29, 2026
14 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants