Skip to content

fix(junior): Recover interrupted continuation work#388

Merged
dcramer merged 4 commits into
mainfrom
codex/continuation-failure-resilience
May 21, 2026
Merged

fix(junior): Recover interrupted continuation work#388
dcramer merged 4 commits into
mainfrom
codex/continuation-failure-resilience

Conversation

@dcramer
Copy link
Copy Markdown
Member

@dcramer dcramer commented May 21, 2026

Preserve the timeout diagnostics work from #385 and broaden the continuation recovery strategy for interrupted turns. Junior now writes safe in-progress checkpoints at Pi-continuable boundaries, posts consistent Slack continuation notices, and treats interrupted sandbox command streams as recoverable bash results instead of terminal agent failures.

Safe Continuation Boundaries

Running checkpoints are persisted only when Pi can legally continue from the transcript, such as after the user message or after tool results are recorded. This keeps progress available without checkpointing unsafe mid-tool-call state.

Sandbox Stream Recovery

A sandbox command stream that ends before a command finishes now returns a failed bash result with guidance to inspect or retry. The agent can continue from that durable tool result rather than failing the whole turn.

Fault-Injection Coverage

The eval harness can inject sandbox bash stream interruptions, and the lifecycle eval proves the real agent recovers by retrying and finishing the request.

Supersedes #385

@vercel
Copy link
Copy Markdown

vercel Bot commented May 21, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
junior-docs Ready Ready Preview, Comment May 21, 2026 4:11am

Request Review

Comment thread packages/junior/src/chat/services/turn-checkpoint.ts Outdated
omribz156 and others added 4 commits May 20, 2026 21:11
Track cumulative duration and token usage across paused turn checkpoints, then use those totals when rendering timeout-resume Slack footers. Timeout continuation notices also now use the normal Slack footer shape so the conversation ID stays traceable.

Co-Authored-By: GPT-5 Codex <[email protected]>
Persist safe running checkpoints at Pi-continuable boundaries so interrupted turns retain a previous resumable state without checkpointing mid-tool-call.

Treat interrupted sandbox command streams as failed bash results that the agent can inspect or retry, and add eval fault injection to prove the recovery path.

Supersedes GH-385
Co-Authored-By: GPT-5 Codex <[email protected]>
Make completed checkpoint persistence best-effort like the running, timeout, and auth checkpoint paths so state adapter failures cannot block an already generated reply.

Co-Authored-By: GPT-5 Codex <[email protected]>
Share checkpoint logging context and remove a one-use resume helper so the continuation changes stay smaller without changing behavior.

Co-Authored-By: GPT-5 Codex <[email protected]>
@dcramer dcramer force-pushed the codex/continuation-failure-resilience branch from 537aea2 to a9d6711 Compare May 21, 2026 04:11
@dcramer dcramer merged commit 2a19233 into main May 21, 2026
16 checks passed
@dcramer dcramer deleted the codex/continuation-failure-resilience branch May 21, 2026 04:15
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit a9d6711. Configure here.

const unsubscribe = agent.subscribe((event) => {
if (event.type === "turn_end" && event.toolResults.length > 0) {
return persistSafeBoundary([...agent!.state.messages]);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fire-and-forget checkpoint can overwrite timeout checkpoint

Medium Severity

The subscriber callback returns the persistSafeBoundary promise as fire-and-forget, meaning its async I/O runs concurrently with subsequent agent work. If a turn_end event fires and the timeout triggers shortly after, persistTimeoutCheckpoint writes an awaiting_resume state, but the still-in-flight persistRunningCheckpoint from the subscriber can land later and overwrite it with running, silently preventing the continuation from being scheduled. Neither agent.abort() nor unsubscribe() cancels in-flight async operations from prior callbacks.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit a9d6711. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants