fix(sandbox): report stopped when the agent container terminates so executions don't reclaim-loop#498
Closed
cjustice wants to merge 1 commit into
Closed
fix(sandbox): report stopped when the agent container terminates so executions don't reclaim-loop#498cjustice wants to merge 1 commit into
cjustice wants to merge 1 commit into
Conversation
…xecutions don't reclaim-loop status_by_id derives session status from pod phase only. The phase stays Running while any container runs, and each sandbox pod has a per-sandbox tool-server sidecar that survives the agent container's death — so after the `sandbox` container is OOMKilled (exit 137), status_by_id still returns "running". The stream consumer then sees the dead container's stdout EOF, checks status, gets "running", and sets the execution to retry_wait expecting a reattach to recover. Every reattach EOFs immediately, so it reclaim-loops (execute_claimed ~every 60s) until the hard-deadline reaper, surfacing no failure. Inspect the agent (sandbox) container's own state: if it has terminated, return "stopped" even when pod phase is Running, so the EOF path finalizes the execution instead of looping on a dead container. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Member
|
We've merged the Rust rewrite in #344, meaning the old API and Slackbot services are deprecated. As such we're closing this PR - if you think this is a mistake, please reopen the PR and ping me. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When a sandbox's agent container is OOMKilled (or otherwise terminates) mid-turn, the execution gets stuck in an endless reclaim loop until the hard deadline and the user never gets a response.
Root cause
KubernetesExecutorBackend.status_by_idderives session status from the pod phase alone. The phase staysRunningas long as any container runs — and each sandbox pod has a per-sandboxtool-serversidecar that survives the agent container's death. So after thesandboxcontainer is OOMKilled (exit 137),status_by_idstill returnsrunning.The execution stream consumer (
runtime_control) then sees the dead container's stdout stream EOF, checksbackend.status(...), getsrunning, and sets the execution back toretry_waitexpecting a reattach to recover. But every reattach to the dead container's stdout EOFs immediately, so it loops —execute_claimedroughly every 60s — until the hard-deadline reaper terminalizes it. No failure is ever surfaced to the user.Observed in a deployed GKE cluster: a memory-heavy build OOM-killed the sandbox container; the attach stream closed (
{"status":"Success"}) at the exact instant of the OOM kill;podPhase=Running(sidecar still alive) → reclaim loop for the entire deadline window.Fix
status_by_idnow inspects the agent (sandbox) container's own state. If that container has terminated, it returnsstoppedeven when the pod phase isRunning, so the stream-EOF path finalizes the execution (failed) instead of reclaim-looping on a dead container's stdout.Testing
test_status_by_id_reports_stopped_when_agent_container_terminated— OOMKilled agent container +Runningpod phase →stopped; live agent container →running. Passing (uv run pytest tests/test_sandbox_kubernetes_backend.py);ruff checkclean.