feat(api-rs): serve tools + overlay to agent sandboxes#443
Merged
Zygimantass merged 10 commits intoJun 10, 2026
Merged
Conversation
e2d3db4 to
eb0969b
Compare
api-rs sandboxes had no tools and no overlay. Give api-rs-spawned agents the same base + overlay tools and overlay system-prompt the chart already wires for the api-rs pod, using upstream's CLI-shim tool model rather than a sidecar. Upstream direction: tools are shell CLI shims, not an HTTP registry. The agent image's install-tool-shims (services/sandbox/install_tool_shims.py) scans TOOL_DIRS at entrypoint and `uvx`-installs each pyproject [project.scripts] as a CLI; the SYSTEM_PROMPT points agents at those CLIs and `centaur-tools list`. The old `call <tool>` HTTP registry is deprecated to control-plane-only. Tool secrets are already handled upstream: codex_app_server_env_template pushes the tool placeholder creds onto the agent env, iron-control grants the per-sandbox principal the real secrets, and Postgres rides proxied `*_DSN` env from apply_proxy_env. So the agent needs only the tool SOURCES at the right paths — no sidecar, no HMAC sandbox token, no loopback tool server. - tools.rs (replaces tool_server.rs): a `tools-bootstrap` init container copies /app/tools out of the shared centaur-api image into an emptyDir mounted at /app/tools in the agent, and an `overlay-bootstrap` init container copies the org overlay tree into overlay-root mounted at overlay.mountPath (the same path the api-rs Deployment uses) and stages the overlay's SYSTEM_PROMPT.md as $HOME/AGENTS_OVERLAY.md, which the sandbox entrypoint appends to the base prompt. TOOL_DIRS is set on the agent env to /app/tools (or /app/tools:<mountPath>/tools with the overlay) — identical to the value the api-rs pod computes for its own tool discovery, set deterministically in the spec builder rather than via passthrough env. - lib.rs: build_agent_sandbox layers the tools/overlay env over spec.env, mounts the bootstrapped sources read-only into the agent, and appends the tools-bootstrap + overlay-bootstrap init containers and their volumes. No sidecar container, no token minting. - args.rs: a minimal ToolsArgs (source image/pull-policy, reusing the KUBERNETES_TOOL_SERVER_IMAGE* env the chart sets from the shared api image) and OverlayArgs (image/pull-policy/source-path/mount-path) wired into AgentSandboxConfig. Explicit clap arg ids avoid id collisions with the other flattened arg structs. - chart apirs.yaml: render the tools source image (api.image.*, gated on toolServer.enabled) and overlay (overlay.*) onto the api-rs env, replacing the KUBERNETES_TOOL_SERVER_* sidecar block. Gone vs the sidecar port: tool_server.rs, the sbx1 HMAC token minting and its SANDBOX_SIGNING_KEY requirement, CENTAUR_TOOLS_URL, the sidecar pg-DSN/proxy-env collection, and the hmac/base64/sha2 dependency additions (nothing else in the agent-k8s crate uses them). Warm-pool sandboxes route through the same build_agent_sandbox path, so they get the tools/overlay init containers and volumes for free. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The tools-bootstrap init container mounted the tools emptyDir at /app/tools — the same path it copies FROM. The mount shadows the source image's tools tree, so the script self-copies the empty volume and GNU cp rejects it (exit 1); every sandbox dies with 'reached terminal state before running' and no agent ever starts. Mount the volume at /tools-bootstrap instead (mirroring how overlay-bootstrap stages to a distinct target) and copy the image's /app/tools into it. The agent container keeps mounting the same volume at /app/tools, so TOOL_DIRS and the shim installer are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Gate overlay env, volumes, and mounts independently from the tools source image so overlay-only sandbox configs produce valid pod specs.
Set an fsGroup on sandbox pods that use tools or overlays so non-root bootstrap init containers can populate their emptyDir mounts.
The tools-bootstrap init container copied /app/tools from .Values.api.image (centaur-api), but api-rs discovers its tools from /app/tools in its own container (.Values.apiRs.image). Sourcing from a different image risked the agent installing a different tool set than api-rs granted per-sandbox creds for. Source from the same api-rs image the Deployment runs so the two match by construction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replumb the tools-bootstrap init container to git-clone the tools repo at a pinned ref into each sandbox's /app/tools (sparse on the tools subdir; GitHub token via askpass for private repos), instead of copying /app/tools out of the api-rs image. Mirrors the repo-cache architecture — clone a repo into a pre-provisioned directory — without sharing its node-level cache, so adding a tool is a push to the repo rather than an api-rs image rebuild. api-rs still discovers its own /app/tools to grant proxy creds, so pin toolServer.ref to the tool set the image carries to avoid drift. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The sandbox NetworkPolicy only allows egress to the sandbox's iron-proxy, api-rs, and DNS, so the tools-bootstrap init container's direct git clone to github.com is blocked whenever iron-proxy is enabled. Route the clone through the proxy like all other sandbox egress: export HTTPS_PROXY (the resolved per-sandbox proxy URL apply_proxy_env already put on the spec) and GIT_SSL_CAINFO, and mount the pod's existing firewall-ca volume into the init container. github.com/api.github.com are already in the baseline proxy allowlist, so no policy or allowlist changes are needed. Without iron-proxy the clone still goes direct. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
These are operator config (helm values -> env -> clap), not user input, but interpolating them bare into the /bin/sh -ec script means a stray space or metacharacter breaks in the shell instead of loudly in git. Quote them at every interpolation site. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The per-sandbox iron-proxy is created in the same reconcile as the Sandbox CR and isn't accepting connections yet when the tools-bootstrap init container first runs — the clone dies with connection-refused, and an init failure is terminal for the Sandbox (no kubelet retry), so every cold spawn failed with 'reached terminal state before running'. Wrap the clone/sparse-checkout/ref fetch in a bounded retry loop (30 x 2s) so the init container rides out the proxy's startup instead of killing the sandbox. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> (cherry picked from commit b1f274d)
4ef3522 to
f1e41be
Compare
…image plumbing The base branch grew its own overlay mechanism (SandboxSpec.overlay + overlay_json) for workflow-host sandboxes, configured by the same CENTAUR_OVERLAY_* env this branch's OverlayArgs read — so a workflow-host pod with an overlay configured got two init containers and two volumes with identical names, which Kubernetes rejects. Adopt the upstream plumbing wholesale: the backend default is now an OverlayImage from the same env helper the workflow host uses (the OverlayArgs flags are gone), a spec-level overlay takes precedence over the backend default so only one overlay-bootstrap/overlay-root pair ever exists, and agent sandboxes mount the overlay at /opt/centaur/overlay like workflow hosts do. The AGENTS_OVERLAY.md prompt staging moves into the shared overlay_json path, and the chart's duplicated CENTAUR_OVERLAY_* env block is dropped — the upstream block already feeds it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
60f7eac
into
paradigmxyz:api-rs-control-plane
6 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Agents spawned by the api-rs control plane come up with none of the deployment tools and no organization overlay. The control plane registers credentials at the egress proxy but never gets the tool sources or overlay tree into the agent container, so the boot-time installer has nothing to turn into commands. This affects every ingress riding api-rs.
This finishes the last mile, following the repo-cache architecture: at sandbox boot, an init container pulls the tools from their source repository at a pinned revision into the directory the boot-time installer scans, and the organization overlay rides the same overlay mechanism workflow-host sandboxes already use — same image, same configuration, same mount point — extended to also stage the overlay's system prompt for the agent. One code path shapes the overlay for every kind of sandbox, so the two can't collide or drift. Adding or changing a tool becomes a push to the repository — no image rebuild, live in the next sandbox.
The pull rides the same per-sandbox egress proxy as everything else (the locked-down egress policies allow no direct outbound, and the source host is already on the baseline allowlist), so no network policy changes are needed. A token can be supplied for private sources. Secrets stay untouched: placeholders in the sandbox, real values injected at the proxy. The tools clone is opt-in and additive through the existing values; the overlay needs no new configuration at all — deployments that already configure an overlay image get it in agent sandboxes automatically.
flowchart TB GH[("tools repo<br/>(pinned revision)")] OI[("overlay image")] subgraph CP["control plane"] TD["tool discovery<br/>(own tools copy)"] --> ICTL["credential grants"] end subgraph POD["agent sandbox pod"] TB["tools bootstrap<br/>(init)"] --> TV[("tools dir")] OB["overlay bootstrap<br/>(init, shared with<br/>workflow hosts)"] --> OV[("overlay dir")] OB --> SP[("prompt overlay")] TV --> SI["boot-time installer"] OV --> SI SI -->|"command shims"| AG["agent"] SP -->|"appended to base prompt"| AG end PX["per-sandbox egress proxy"] TB -->|"git clone (via proxy)"| PX PX -->|"allowlisted"| GH OI -->|"image copy"| OB AG -->|"tool calls, placeholder creds"| PX PX -->|"real creds injected"| EXT["external APIs"] ICTL -.->|"renders proxy config"| PXOne caveat worth calling out: the control plane still reads its own copy of the tools to decide which credentials to grant, so the pinned revision should track the set that copy carries until discovery moves to the same source.
🤖 Generated with Claude Code