Skip to content

fix(sandbox): report stopped when the agent container terminates so executions don't reclaim-loop#498

Closed
cjustice wants to merge 1 commit into
paradigmxyz:mainfrom
cjustice:fix/sandbox-oom-status
Closed

fix(sandbox): report stopped when the agent container terminates so executions don't reclaim-loop#498
cjustice wants to merge 1 commit into
paradigmxyz:mainfrom
cjustice:fix/sandbox-oom-status

Conversation

@cjustice

@cjustice cjustice commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Problem

When a sandbox's agent container is OOMKilled (or otherwise terminates) mid-turn, the execution gets stuck in an endless reclaim loop until the hard deadline and the user never gets a response.

Root cause

KubernetesExecutorBackend.status_by_id derives session status from the pod phase alone. The phase stays Running as long as any container runs — and each sandbox pod has a per-sandbox tool-server sidecar that survives the agent container's death. So after the sandbox container is OOMKilled (exit 137), status_by_id still returns running.

The execution stream consumer (runtime_control) then sees the dead container's stdout stream EOF, checks backend.status(...), gets running, and sets the execution back to retry_wait expecting a reattach to recover. But every reattach to the dead container's stdout EOFs immediately, so it loops — execute_claimed roughly every 60s — until the hard-deadline reaper terminalizes it. No failure is ever surfaced to the user.

Observed in a deployed GKE cluster: a memory-heavy build OOM-killed the sandbox container; the attach stream closed ({"status":"Success"}) at the exact instant of the OOM kill; podPhase=Running (sidecar still alive) → reclaim loop for the entire deadline window.

Fix

status_by_id now inspects the agent (sandbox) container's own state. If that container has terminated, it returns stopped even when the pod phase is Running, so the stream-EOF path finalizes the execution (failed) instead of reclaim-looping on a dead container's stdout.

Testing

  • Added test_status_by_id_reports_stopped_when_agent_container_terminated — OOMKilled agent container + Running pod phase → stopped; live agent container → running. Passing (uv run pytest tests/test_sandbox_kubernetes_backend.py); ruff check clean.
  • Root cause reproduced live (a real monorepo build OOM-killed the sandbox container at its memory limit); this change finalizes such executions cleanly instead of looping.

…xecutions don't reclaim-loop

status_by_id derives session status from pod phase only. The phase stays
Running while any container runs, and each sandbox pod has a per-sandbox
tool-server sidecar that survives the agent container's death — so after the
`sandbox` container is OOMKilled (exit 137), status_by_id still returns
"running".

The stream consumer then sees the dead container's stdout EOF, checks status,
gets "running", and sets the execution to retry_wait expecting a reattach to
recover. Every reattach EOFs immediately, so it reclaim-loops (execute_claimed
~every 60s) until the hard-deadline reaper, surfacing no failure.

Inspect the agent (sandbox) container's own state: if it has terminated,
return "stopped" even when pod phase is Running, so the EOF path finalizes the
execution instead of looping on a dead container.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Zygimantass

Copy link
Copy Markdown
Member

We've merged the Rust rewrite in #344, meaning the old API and Slackbot services are deprecated. As such we're closing this PR - if you think this is a mistake, please reopen the PR and ping me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants