fix(sandbox): report stopped when the agent container terminates so executions don't reclaim-loop by cjustice · Pull Request #498 · paradigmxyz/centaur

cjustice · 2026-06-11T18:43:32Z

Problem

When a sandbox's agent container is OOMKilled (or otherwise terminates) mid-turn, the execution gets stuck in an endless reclaim loop until the hard deadline and the user never gets a response.

Root cause

KubernetesExecutorBackend.status_by_id derives session status from the pod phase alone. The phase stays Running as long as any container runs — and each sandbox pod has a per-sandbox tool-server sidecar that survives the agent container's death. So after the sandbox container is OOMKilled (exit 137), status_by_id still returns running.

The execution stream consumer (runtime_control) then sees the dead container's stdout stream EOF, checks backend.status(...), gets running, and sets the execution back to retry_wait expecting a reattach to recover. But every reattach to the dead container's stdout EOFs immediately, so it loops — execute_claimed roughly every 60s — until the hard-deadline reaper terminalizes it. No failure is ever surfaced to the user.

Observed in a deployed GKE cluster: a memory-heavy build OOM-killed the sandbox container; the attach stream closed ({"status":"Success"}) at the exact instant of the OOM kill; podPhase=Running (sidecar still alive) → reclaim loop for the entire deadline window.

Fix

status_by_id now inspects the agent (sandbox) container's own state. If that container has terminated, it returns stopped even when the pod phase is Running, so the stream-EOF path finalizes the execution (failed) instead of reclaim-looping on a dead container's stdout.

Testing

Added test_status_by_id_reports_stopped_when_agent_container_terminated — OOMKilled agent container + Running pod phase → stopped; live agent container → running. Passing (uv run pytest tests/test_sandbox_kubernetes_backend.py); ruff check clean.
Root cause reproduced live (a real monorepo build OOM-killed the sandbox container at its memory limit); this change finalizes such executions cleanly instead of looping.

…xecutions don't reclaim-loop status_by_id derives session status from pod phase only. The phase stays Running while any container runs, and each sandbox pod has a per-sandbox tool-server sidecar that survives the agent container's death — so after the `sandbox` container is OOMKilled (exit 137), status_by_id still returns "running". The stream consumer then sees the dead container's stdout EOF, checks status, gets "running", and sets the execution to retry_wait expecting a reattach to recover. Every reattach EOFs immediately, so it reclaim-loops (execute_claimed ~every 60s) until the hard-deadline reaper, surfacing no failure. Inspect the agent (sandbox) container's own state: if it has terminated, return "stopped" even when pod phase is Running, so the EOF path finalizes the execution instead of looping on a dead container. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Zygimantass · 2026-06-12T09:44:09Z

We've merged the Rust rewrite in #344, meaning the old API and Slackbot services are deprecated. As such we're closing this PR - if you think this is a mistake, please reopen the PR and ping me.

Zygimantass closed this Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sandbox): report stopped when the agent container terminates so executions don't reclaim-loop#498

fix(sandbox): report stopped when the agent container terminates so executions don't reclaim-loop#498
cjustice wants to merge 1 commit into
paradigmxyz:mainfrom
cjustice:fix/sandbox-oom-status

cjustice commented Jun 11, 2026 •

edited

Loading

Uh oh!

Zygimantass commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cjustice commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Fix

Testing

Uh oh!

Zygimantass commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cjustice commented Jun 11, 2026 •

edited

Loading