fix(junior): Recover interrupted continuation work#388
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Track cumulative duration and token usage across paused turn checkpoints, then use those totals when rendering timeout-resume Slack footers. Timeout continuation notices also now use the normal Slack footer shape so the conversation ID stays traceable. Co-Authored-By: GPT-5 Codex <[email protected]>
Persist safe running checkpoints at Pi-continuable boundaries so interrupted turns retain a previous resumable state without checkpointing mid-tool-call. Treat interrupted sandbox command streams as failed bash results that the agent can inspect or retry, and add eval fault injection to prove the recovery path. Supersedes GH-385 Co-Authored-By: GPT-5 Codex <[email protected]>
Make completed checkpoint persistence best-effort like the running, timeout, and auth checkpoint paths so state adapter failures cannot block an already generated reply. Co-Authored-By: GPT-5 Codex <[email protected]>
Share checkpoint logging context and remove a one-use resume helper so the continuation changes stay smaller without changing behavior. Co-Authored-By: GPT-5 Codex <[email protected]>
537aea2 to
a9d6711
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit a9d6711. Configure here.
| const unsubscribe = agent.subscribe((event) => { | ||
| if (event.type === "turn_end" && event.toolResults.length > 0) { | ||
| return persistSafeBoundary([...agent!.state.messages]); | ||
| } |
There was a problem hiding this comment.
Fire-and-forget checkpoint can overwrite timeout checkpoint
Medium Severity
The subscriber callback returns the persistSafeBoundary promise as fire-and-forget, meaning its async I/O runs concurrently with subsequent agent work. If a turn_end event fires and the timeout triggers shortly after, persistTimeoutCheckpoint writes an awaiting_resume state, but the still-in-flight persistRunningCheckpoint from the subscriber can land later and overwrite it with running, silently preventing the continuation from being scheduled. Neither agent.abort() nor unsubscribe() cancels in-flight async operations from prior callbacks.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit a9d6711. Configure here.


Preserve the timeout diagnostics work from #385 and broaden the continuation recovery strategy for interrupted turns. Junior now writes safe in-progress checkpoints at Pi-continuable boundaries, posts consistent Slack continuation notices, and treats interrupted sandbox command streams as recoverable bash results instead of terminal agent failures.
Safe Continuation Boundaries
Running checkpoints are persisted only when Pi can legally continue from the transcript, such as after the user message or after tool results are recorded. This keeps progress available without checkpointing unsafe mid-tool-call state.
Sandbox Stream Recovery
A sandbox command stream that ends before a command finishes now returns a failed bash result with guidance to inspect or retry. The agent can continue from that durable tool result rather than failing the whole turn.
Fault-Injection Coverage
The eval harness can inject sandbox bash stream interruptions, and the lifecycle eval proves the real agent recovers by retrying and finishing the request.
Supersedes #385