Description
docker_provider/provider.py uses container.exec_run(..., demux=True) (and the low-level api.exec_start(exec_id, demux=True) in the timeout path) to collect stdout/stderr in memory:
exec_result = container.exec_run(
cmd=command,
environment=sanitized_env,
workdir="/workspace",
demux=True,
)
...
stdout_bytes, stderr_bytes = exec_result.output or (None, None)
stdout = stdout_bytes.decode("utf-8", errors="replace")[:10000] if stdout_bytes else ""
stderr = stderr_bytes.decode("utf-8", errors="replace")[:10000] if stderr_bytes else ""
The truncation to 10000 characters happens after the full container output has been read into a single bytes object. An adversarial container can produce many gigabytes of stdout/stderr (e.g. yes | head -c 10G); docker-py will faithfully buffer all of it before returning, and the agent-sandbox host process pays the memory cost.
REVIEW.md, Python Isolation section ("docker_provider/provider.py:573-578 — exec_run(..., demux=True) truncates after reading 10GB into memory. Fix: stream=True with cap.").
Suggested fix
Switch both call sites to stream=True, demux=True. docker-py returns (stdout_iter, stderr_iter) of chunked bytes; consume each iterator until a configurable byte cap (e.g. 1 MiB) is reached, then close the stream. The cap belongs on SandboxConfig so callers can tune it per-session.
Why this is being filed as an issue rather than a PR
The change is shape-correct but the existing test suite mocks container.exec_run as MagicMock(output=(b"...", b"..."), exit_code=0) in roughly a dozen places — a streaming refactor requires updating each mock to return iterable streams instead of byte tuples, plus the threaded-timeout path's api.exec_start mock pattern. That's a larger surface than a LOW-severity drive-by should land in one PR; doing it well needs a maintainer's review of the public SandboxConfig field for the cap (name, default value, units) and a sweep of existing mock fixtures.
The fix shape itself is uncontentious: stream=True plus cap-and-close. Splitting the test-mock migration from the runtime change is also reasonable if the maintainers prefer two PRs.
Trade-offs
- Default cap value: 1 MiB matches typical CI log buffers; 10 MiB matches
agent_id config flexibility. Either is fine — picking it is a product call.
- Behaviour on cap-hit: option A is silent truncation (matches current
[:10000] behaviour); option B is appending a marker like \n[...output truncated at 1 MiB]\n so downstream consumers can tell. B is what kubectl logs --tail does.
- Streaming and timeout: in the
_run_with_exec_timeout path the thread reads chunks until either cap-hit or timeout. Cap-hit should not trip timed_out; only the wall-clock timer should.
Surfaced during independent audit conducted by @finnoybu (Ken Tannenbaum, AEGIS Initiative); [LOW, Python Isolation].
Description
docker_provider/provider.pyusescontainer.exec_run(..., demux=True)(and the low-levelapi.exec_start(exec_id, demux=True)in the timeout path) to collect stdout/stderr in memory:The truncation to 10000 characters happens after the full container output has been read into a single
bytesobject. An adversarial container can produce many gigabytes of stdout/stderr (e.g.yes | head -c 10G); docker-py will faithfully buffer all of it before returning, and the agent-sandbox host process pays the memory cost.REVIEW.md, Python Isolation section ("
docker_provider/provider.py:573-578—exec_run(..., demux=True)truncates after reading 10GB into memory. Fix:stream=Truewith cap.").Suggested fix
Switch both call sites to
stream=True, demux=True. docker-py returns(stdout_iter, stderr_iter)of chunked bytes; consume each iterator until a configurable byte cap (e.g. 1 MiB) is reached, then close the stream. The cap belongs onSandboxConfigso callers can tune it per-session.Why this is being filed as an issue rather than a PR
The change is shape-correct but the existing test suite mocks
container.exec_runasMagicMock(output=(b"...", b"..."), exit_code=0)in roughly a dozen places — a streaming refactor requires updating each mock to return iterable streams instead of byte tuples, plus the threaded-timeout path'sapi.exec_startmock pattern. That's a larger surface than a LOW-severity drive-by should land in one PR; doing it well needs a maintainer's review of the publicSandboxConfigfield for the cap (name, default value, units) and a sweep of existing mock fixtures.The fix shape itself is uncontentious:
stream=Trueplus cap-and-close. Splitting the test-mock migration from the runtime change is also reasonable if the maintainers prefer two PRs.Trade-offs
agent_idconfig flexibility. Either is fine — picking it is a product call.[:10000]behaviour); option B is appending a marker like\n[...output truncated at 1 MiB]\nso downstream consumers can tell. B is whatkubectl logs --taildoes._run_with_exec_timeoutpath the thread reads chunks until either cap-hit or timeout. Cap-hit should not triptimed_out; only the wall-clock timer should.Surfaced during independent audit conducted by @finnoybu (Ken Tannenbaum, AEGIS Initiative); [LOW, Python Isolation].