fix(turn-orchestrator): serialize concurrent approval wakes by ytallo · Pull Request #198 · iii-hq/workers

ytallo · 2026-05-28T16:34:05Z

Fixes MOT-3434.

Summary

Two approval::resolve writes for one session fan out via the turn::on_approval state trigger into two concurrent turn::function_awaiting_approval wakes. The turn-step queue has no per-session ordering (TriggerAction.Enqueue takes only a queue name), and runTransition does an unguarded load → mutate → overwrite, so both wakes load the same parked turn_state, execute every call, and finalize the batch — double-running side-effecting functions (e.g. shell::run) and emitting duplicate function_execution_end / turn_end frames, which wedges the turn.

This adds a per-session lease built on the only atomic state primitive exposed (state::update increment): acquire when the prior holder count is 0, release by resetting it, and TTL-steal to recover a crashed holder. The function_awaiting_approval transition is gated behind it via runTransition({ serialize: true }); a contender that cannot acquire throws TransientError, so the durable queue retries after the holder releases and then stale-skips.

Also reverts the earlier approve-always parked-sibling change, which targeted a scenario that did not reproduce the reported bug.

Test plan

npx vitest run tests/integration/parallel-approval.e2e.test.ts — new parallel test green (deterministic across repeated runs)
npx vitest run — full suite green
npx tsc -b --noEmit — clean

Follow-up

The engine has a native TTL + owner lock (try_acquire_lock / release_lock in the kv builtin) used by the cron worker but not exposed as a state::* function. Exposing state::acquire_lock / release_lock would be a cleaner primitive than the increment-based lease.

Summary by CodeRabbit

New Features
- Added session-level serialization for transition execution to prevent duplicate concurrent runs.
Bug Fixes
- Approval resolution now strictly uses persisted decisions; removed prior fallback derivation from settings.
- Consultation logic refined to respect per-session allow rules before policy checks.
Tests
- Integration harness retries transient failures; added parallel approval test covering concurrent resolutions.
Chores
- Minor test and script typing/import cleanups.

vercel · 2026-05-28T16:34:11Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
workers	Ready	Preview, Comment	May 28, 2026 8:07pm

coderabbitai · 2026-05-28T16:34:14Z

Warning

Review limit reached

@ytallo, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 52 minutes and 17 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 54677257-d09b-4013-bc80-e33cc4ef0d55

📥 Commits

Reviewing files that changed from the base of the PR and between 792876c and 13a7c33.

📒 Files selected for processing (1)

harness/tests/integration/parallel-approval.e2e.test.ts

📝 Walkthrough

Walkthrough

This PR removes lazy settings-based approval verdicts, moves explicit per-session approval checks into the hook, adds optional serialized turn transitions guarded by a per-session lease, and updates the harness and tests to exercise concurrent approval resolution with retry semantics.

Changes

Approval Decision Refactoring and Transition Serialization

Layer / File(s)	Summary
Session lease mutual exclusion for turn transitions `harness/src/turn-orchestrator/state-runtime/session-lease.ts`	New session lease module provides per-session mutual exclusion for turn FSM transitions with atomic counter bumps, immediate/stale-lease-stealing acquisition paths, and TTL-based crash-recovery semantics.
Transition serialization with optional per-session leasing `harness/src/turn-orchestrator/run-transition.ts`, `harness/src/turn-orchestrator/function-awaiting-approval/process.ts`	Adds `RunTransitionOptions` type with optional `serialize` flag to `runTransition`. When enabled, a per-session lease is acquired and released around the transition; contention throws `TransientError` for durable queue retry. Core transition flow moved into `runTransitionInner`. `function_awaiting_approval` marked with `serialize: true`.
Hook decision logic consolidation from settings verdict `harness/src/turn-orchestrator/hook.ts`	Moves approval decision checks into `consultBefore`: `mode === 'full'` allows immediately; entries in `approved_always` allow regardless of mode; `mode === 'auto'` allows only when function is in `always_allow`; otherwise falls through to policy `check_permissions`.
Function-awaiting-approval: remove settings port, consolidate decisions `harness/src/turn-orchestrator/function-awaiting-approval/ports.ts`, `harness/src/turn-orchestrator/function-awaiting-approval/run.ts`	Removes `readSettings` from `AwaitingApprovalPorts` and its implementation. Decision resolution now relies solely on `readDecision(session_id, function_call_id)`; entries with no persisted decision are skipped (no settings-derived fallback).
Integration harness: retry loop, atomic state updates, serialization support `harness/tests/integration/parallel-approval-harness.ts`	Adds `runTurnStepWithRetry` that retries on `TransientError` with microtask flushes, marks `turn::function_awaiting_approval` with `serialize:true`, refactors `state::update` into atomic per-(scope,key) read-modify-write supporting increments, and routes actionful `turn::...` triggers through the retry wrapper.
E2E test: replace approveAlways coverage with concurrent approval resolution `harness/tests/integration/parallel-approval.e2e.test.ts`	Removes `approveAlways`-based test and adds a concurrency-focused test that resolves two approvals in parallel, asserting each call executes once and only one `turn_end` is emitted.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

iii-hq/workers#174: Adds the runTransition abstraction that this PR extends with serialization and per-session lease behavior.
iii-hq/workers#195: Introduces approval settings and "approve always" infrastructure that this PR refactors by removing lazy verdict computation.

Suggested reviewers

sergiofilhowz

Poem

🐰 Leases set, no races now,
Two approvals race and bow,
Hooks decide with clearer sight,
Retries hum till order's right,
One turn_end, the batch says wow!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 31.25% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding serialization (via a per-session lease) to prevent concurrent approval wakes from executing duplicate transitions and side effects.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/parallel-approval-race

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-28T16:34:22Z

skill-check — worker

0 verified, 13 skipped (no docs/).

Layer	Result
structure	✓
vale	✓
ai	✓
render	✗

Note

17 stale rendered artifact(s) detected on main, unrelated to this PR. This PR is fine; the drift was already there. A maintainer should open a chore PR to re-render these.

shell/README.md
shell/skill.md
shell/skills/chmod.md
shell/skills/exec.md
shell/skills/exec_bg.md
shell/skills/grep.md
shell/skills/kill.md
shell/skills/list.md
shell/skills/ls.md
shell/skills/mkdir.md
shell/skills/mv.md
shell/skills/read.md
…and 5 more (see the workflow logs)

…roval races Two approval::resolve writes for one session fan out via turn::on_approval into concurrent turn::function_awaiting_approval wakes. The turn-step queue has no per-session ordering (Enqueue takes only a queue name), so both wakes loaded the same parked turn_state, executed every call, and finalized the batch — running side-effecting functions twice and emitting duplicate function_execution_end / turn_end frames, which wedged the turn. Add a per-session lease built on the only atomic primitive the state worker exposes (state::update increment, a locked read-modify-write): acquire when the prior holder count is 0, release by resetting it, and let a contender steal a lease older than the TTL to recover from a crashed holder. Gate the function_awaiting_approval transition behind it via runTransition({serialize}). A contender that cannot acquire throws TransientError, so the durable queue retries it after the holder releases and it then stale-skips. Also revert the earlier approve-always parked-sibling change, which targeted a different scenario that did not reproduce the reported bug. Cover the race with a parallel-approval e2e test; the harness now models state::update as a faithful atomic increment and the queue's TransientError retry.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

harness/src/turn-orchestrator/function-awaiting-approval/run.ts (1)

63-66: ⚡ Quick win

Replace the non-null assertion with an explicit invariant check.

Line 66 assumes every awaiting entry still has a matching prepared call. If that invariant is ever broken, this wake fails with a runtime throw, and the current Biome noNonNullAssertion warning remains unresolved.

♻️ Proposed change

-    const current = work.prepared.find((p) => p.call.id === callId)!;
+    const current = work.prepared.find((p) => p.call.id === callId);
+    if (!current) {
+      throw new Error(
+        `Invariant violated: missing prepared call ${callId} for session ${rec.session_id}`,
+      );
+    }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@harness/src/turn-orchestrator/function-awaiting-approval/run.ts` around lines
63 - 66, The code uses a non-null assertion on the result of
work.prepared.find((p) => p.call.id === callId) when assigning current; replace
this with an explicit invariant check: capture the find result (e.g., const
current = ... without !), then if current is undefined either log a clear error
including rec.session_id and callId and continue, or throw a descriptive Error
to fail fast; update any downstream assumptions to use the checked variable
instead of relying on the non-null assertion.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@harness/tests/integration/parallel-approval-harness.ts`:
- Around line 97-109: The retry loop around runTurnStep silently returns
undefined when all 100 attempts throw TransientError; change it so that after
the loop completes it does not fall through but throws a clear error (either
rethrow the last caught error or throw a new Error like "Retries exhausted" that
includes the last error) so the harness surfaces a failed retry cycle; update
the for-loop handling in the function that calls runTurnStep and references
TransientError and flushMicrotasks to capture the last err in the catch and
throw it after the loop instead of returning.

---

Nitpick comments:
In `@harness/src/turn-orchestrator/function-awaiting-approval/run.ts`:
- Around line 63-66: The code uses a non-null assertion on the result of
work.prepared.find((p) => p.call.id === callId) when assigning current; replace
this with an explicit invariant check: capture the find result (e.g., const
current = ... without !), then if current is undefined either log a clear error
including rec.session_id and callId and continue, or throw a descriptive Error
to fail fast; update any downstream assumptions to use the checked variable
instead of relying on the non-null assertion.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f72f48a3-c233-40c7-af45-00135d056315

📥 Commits

Reviewing files that changed from the base of the PR and between 7a7485f and b8f2b28.

📒 Files selected for processing (9)

harness/src/approval-gate/settings/verdict.ts
harness/src/turn-orchestrator/function-awaiting-approval/ports.ts
harness/src/turn-orchestrator/function-awaiting-approval/process.ts
harness/src/turn-orchestrator/function-awaiting-approval/run.ts
harness/src/turn-orchestrator/hook.ts
harness/src/turn-orchestrator/run-transition.ts
harness/src/turn-orchestrator/state-runtime/session-lease.ts
harness/tests/integration/parallel-approval-harness.ts
harness/tests/integration/parallel-approval.e2e.test.ts

💤 Files with no reviewable changes (2)

harness/src/approval-gate/settings/verdict.ts
harness/src/turn-orchestrator/function-awaiting-approval/ports.ts

coderabbitai · 2026-05-28T17:05:48Z

+  if ((await bumpHolders(iii, session_id)) === 0) {
+    await stateSet(iii, LEASE_AT_SCOPE, session_id, Date.now());
+    return true;
+  }
+  // Contended — recover a lease abandoned by a crashed holder.
+  const acquiredAt = await stateGet(iii, LEASE_AT_SCOPE, session_id);
+  if (typeof acquiredAt === 'number' && Date.now() - acquiredAt > LEASE_TTL_MS) {
+    await stateSet(iii, LEASE_SCOPE, session_id, 0);
+    if ((await bumpHolders(iii, session_id)) === 0) {
+      await stateSet(iii, LEASE_AT_SCOPE, session_id, Date.now());
+      return true;


⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Split-brain lease acquisition is still possible in the normal (non-crash) path.

At Line 50, the holder counter is acquired before the new acquire timestamp is published (Line 51). A concurrent contender can then hit the stale-steal branch at Lines 56-59 using an old LEASE_AT_SCOPE value and acquire too, so two workers may proceed concurrently.

This reintroduces duplicate transition execution risk. The lease needs ownership fencing (e.g., token/epoch validation on acquire/steal/release) or native KV lock primitives to avoid this race.

coderabbitai · 2026-05-28T17:05:48Z

+  for (let attempt = 0; attempt < 100; attempt += 1) {
+    try {
+      await runTurnStep(iii, function_id, session_id);
+      return;
+    } catch (err) {
+      if (err instanceof TransientError) {
+        await flushMicrotasks();
+        continue;
+      }
+      throw err;
+    }
  }
 }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't silently succeed after retry exhaustion.

If all 100 attempts hit TransientError, this helper currently falls through and returns undefined. That drops the wake instead of surfacing a failed retry cycle, which can hide lease/retry regressions and make the harness pass for the wrong reason.

Proposed fix

async function runTurnStepWithRetry( iii: ISdk, function_id: string, session_id: string, ): Promise<void> { + let lastTransient: TransientError | null = null; for (let attempt = 0; attempt < 100; attempt += 1) { try { await runTurnStep(iii, function_id, session_id); return; } catch (err) { if (err instanceof TransientError) { + lastTransient = err; await flushMicrotasks(); continue; } throw err; } } + throw lastTransient ?? new Error(`retry budget exhausted for ${function_id}`); }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@harness/tests/integration/parallel-approval-harness.ts` around lines 97 - 109, The retry loop around runTurnStep silently returns undefined when all 100 attempts throw TransientError; change it so that after the loop completes it does not fall through but throws a clear error (either rethrow the last caught error or throw a new Error like "Retries exhausted" that includes the last error) so the harness surfaces a failed retry cycle; update the for-loop handling in the function that calls runTurnStep and references TransientError and flushMicrotasks to capture the last err in the catch and throw it after the loop instead of returning.

Drop genuinely-unused symbols flagged by biome 2.4.10 (correctness rules, fail the harness lint check): - copy-assets.mjs: unused `stat` import - session/tree/register.ts: unused `activePath as _activePath` import - session/tree/store.test.ts: unused `T` generic type parameter Unblocks CI; these were already red on main, unrelated to the approval-race fix.

coderabbitai · 2026-05-28T20:01:50Z

Actionable comments posted: 0

Collapse a hand-written multi-line expect() to biome's canonical layout; this format mismatch was the single error failing the harness lint+test CI.

ytallo changed the base branch from feat/function-approval-modes to main May 28, 2026 16:51

ytallo force-pushed the fix/parallel-approval-race branch from 1458240 to b8f2b28 Compare May 28, 2026 16:57

vercel Bot deployed to Preview May 28, 2026 16:57 View deployment

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

vercel Bot deployed to Preview May 28, 2026 19:59 View deployment

style(harness): biome-format parallel-approval e2e test

13a7c33

Collapse a hand-written multi-line expect() to biome's canonical layout; this format mismatch was the single error failing the harness lint+test CI.

vercel Bot deployed to Preview May 28, 2026 20:07 View deployment

andersonleal approved these changes May 29, 2026

View reviewed changes

ytallo merged commit 1a51300 into main May 29, 2026
14 of 15 checks passed

coderabbitai Bot mentioned this pull request Jun 4, 2026

fix: retry compaction-busy turns and close shell fs jail gaps #228

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(turn-orchestrator): serialize concurrent approval wakes#198

fix(turn-orchestrator): serialize concurrent approval wakes#198
ytallo merged 3 commits into
mainfrom
fix/parallel-approval-race

ytallo commented May 28, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

vercel Bot commented May 28, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 28, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 28, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 28, 2026

Uh oh!

coderabbitai Bot May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ytallo commented May 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Follow-up

Summary by CodeRabbit

Uh oh!

vercel Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

skill-check — worker

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ytallo commented May 28, 2026 •

edited by coderabbitai Bot

Loading

vercel Bot commented May 28, 2026 •

edited

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading