Skip to content

docker_provider: exec_run(demux=True) buffers full container output before truncation #2185

@finnoybu

Description

@finnoybu

Description

docker_provider/provider.py uses container.exec_run(..., demux=True) (and the low-level api.exec_start(exec_id, demux=True) in the timeout path) to collect stdout/stderr in memory:

exec_result = container.exec_run(
    cmd=command,
    environment=sanitized_env,
    workdir="/workspace",
    demux=True,
)
...
stdout_bytes, stderr_bytes = exec_result.output or (None, None)
stdout = stdout_bytes.decode("utf-8", errors="replace")[:10000] if stdout_bytes else ""
stderr = stderr_bytes.decode("utf-8", errors="replace")[:10000] if stderr_bytes else ""

The truncation to 10000 characters happens after the full container output has been read into a single bytes object. An adversarial container can produce many gigabytes of stdout/stderr (e.g. yes | head -c 10G); docker-py will faithfully buffer all of it before returning, and the agent-sandbox host process pays the memory cost.

REVIEW.md, Python Isolation section ("docker_provider/provider.py:573-578exec_run(..., demux=True) truncates after reading 10GB into memory. Fix: stream=True with cap.").

Suggested fix

Switch both call sites to stream=True, demux=True. docker-py returns (stdout_iter, stderr_iter) of chunked bytes; consume each iterator until a configurable byte cap (e.g. 1 MiB) is reached, then close the stream. The cap belongs on SandboxConfig so callers can tune it per-session.

Why this is being filed as an issue rather than a PR

The change is shape-correct but the existing test suite mocks container.exec_run as MagicMock(output=(b"...", b"..."), exit_code=0) in roughly a dozen places — a streaming refactor requires updating each mock to return iterable streams instead of byte tuples, plus the threaded-timeout path's api.exec_start mock pattern. That's a larger surface than a LOW-severity drive-by should land in one PR; doing it well needs a maintainer's review of the public SandboxConfig field for the cap (name, default value, units) and a sweep of existing mock fixtures.

The fix shape itself is uncontentious: stream=True plus cap-and-close. Splitting the test-mock migration from the runtime change is also reasonable if the maintainers prefer two PRs.

Trade-offs

  • Default cap value: 1 MiB matches typical CI log buffers; 10 MiB matches agent_id config flexibility. Either is fine — picking it is a product call.
  • Behaviour on cap-hit: option A is silent truncation (matches current [:10000] behaviour); option B is appending a marker like \n[...output truncated at 1 MiB]\n so downstream consumers can tell. B is what kubectl logs --tail does.
  • Streaming and timeout: in the _run_with_exec_timeout path the thread reads chunks until either cap-hit or timeout. Cap-hit should not trip timed_out; only the wall-clock timer should.

Surfaced during independent audit conducted by @finnoybu (Ken Tannenbaum, AEGIS Initiative); [LOW, Python Isolation].

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions