Skip to content

feat: python no more#344

Open
Zygimantass wants to merge 76 commits into
mainfrom
api-rs-control-plane
Open

feat: python no more#344
Zygimantass wants to merge 76 commits into
mainfrom
api-rs-control-plane

Conversation

@Zygimantass

Copy link
Copy Markdown
Member

No description provided.

@gakonst gakonst left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM I think

Comment on lines +15 to +24
pub struct SandboxIoParts {
pub stdin: SandboxWrite,
pub stdout: SandboxRead,
pub stderr: SandboxRead,
pub guard: SandboxIoGuard,
}

pub struct SandboxIoGuard {
_inner: Box<dyn Send>,
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

},
};

use async_trait::async_trait;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't async trait no longer necsesary bc it's in std rust?

sandboxes: Mutex<HashMap<SandboxId, Arc<Mutex<LocalSandbox>>>>,
}

struct LocalSandbox {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be separaete local.rs file

Comment on lines +213 to +214
pub fn empty_object() -> Value {
Value::Object(serde_json::Map::new())

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline this

Comment on lines +127 to +202
let execution = self
.store
.create_execution(thread_key, default_metadata(input.metadata))
.await?;
let execution = self
.store
.mark_execution_running(&execution.execution_id)
.await?;
let sandbox_id = self
.ensure_session_sandbox(
thread_key,
session.sandbox_id.as_deref(),
&execution.execution_id,
)
.await?;

self.store
.append_event(
thread_key,
Some(&execution.execution_id),
"session.execution_started",
json!({
"execution_id": execution.execution_id,
"thread_key": thread_key.as_str(),
"input_line_count": input.input_lines.len(),
}),
)
.await?;

let write_result = match self.ensure_session_pipe(thread_key, &sandbox_id).await {
Ok(pipe) => write_input_lines(&pipe, &input.input_lines).await,
Err(error) => Err(error),
};

match write_result {
Ok(()) => {}
Err(error) => {
let error_message = error.to_string();
let _ = self
.store
.append_event(
thread_key,
Some(&execution.execution_id),
"session.execution_failed",
json!({
"execution_id": execution.execution_id,
"thread_key": thread_key.as_str(),
"error": error_message,
}),
)
.await;
let _ = self
.store
.fail_execution(&execution.execution_id, &error_message)
.await;
return Err(error);
}
}

self.store
.append_event(
thread_key,
Some(&execution.execution_id),
"session.execution_completed",
json!({
"execution_id": execution.execution_id,
"thread_key": thread_key.as_str(),
"completion_reason": "input_accepted",
}),
)
.await?;

Ok(self
.store
.complete_execution(&execution.execution_id)
.await?)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these be atomic commits?


tokio::spawn(async move {
let result =
run_stdout_pump(store.clone(), thread_key.clone(), &pump_key, stdout, guard).await;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wtf who has ever used the word stdout pump???

* refactor(api-rs): use owned sandbox io streams

* refactor(api-rs): move session streaming into runtime

* refactor(api-rs): remove mock session runtime branch

* refactor(api-rs): move sandbox workload modes into runtime

* refactor(api-rs): clean up session API client and e2e tests

* refactor(api-rs): use library sse event types

* refactor(api-rs): address owned io review comments
Zygimantass and others added 24 commits June 10, 2026 16:32
Co-authored-by: Centaur AI <ai@centaur.local>
Co-authored-by: Centaur AI <ai@centaur.local>
…#404)

feat(api-rs): broker credentials in iron-control for codex access-token auth

Manage iron-control broker credentials — managed OAuth refresh tokens that
iron-control mints and delivers inline to proxies via a `token_broker` source —
and drop the iron-proxy broker sidecar. Adds the centaur-perms `broker`
subcommands, `brokered_token` tool-secret parsing/translation, and the
iron-control client/models for broker credentials.

Supporting infra so it runs end-to-end:
- chart: iron-control Solid Queue worker (bin/jobs) that runs the broker OAuth
  refresh loop, plus its egress NetworkPolicy and the postgres ingress allowlist
  entry; image pullPolicy defaulted to Always for the mutable :latest tag.
- chart: create-db init waits for postgres and provisions all four logical
  databases (primary + Solid Cache/Queue/Cable) idempotently.
- slackbotv2: tolerate the postgres startup race with an owned pool + connect
  retry so a transient cold-start failure doesn't wedge the bot.
This allows clients to define persona as a part of the request.
fix(slackbotv2): pin slack stream continuation fix
* feat(api-rs): add aws_auth credential type linked to iron-control

Reimplements the CloudWatch tool's AWS SigV4 re-signing support in the
Rust api-rs / iron-control control plane instead of the Python api
service (superseding #449).

The tool signs requests with placeholder credentials; iron-proxy's
aws_auth transform re-signs with the real read-only IAM keys resolved
from iron-control, so credentials never enter the sandbox.

- centaur-iron-control: AwsAuthSecretInput model, aws_auth_secrets
  endpoint + upsert client, GrantSecret/Grant/SECRET_TYPES wiring
  (aas_ prefix), fragment translator marks aws_auth unsupported (it is
  a tool secret registered via the centaur-perms CLI, like hmac_sign)
- centaur-perms: parse type = "aws_auth" from a tool's pyproject and
  translate it to an AwsAuthSecretInput granted to the tool's role
- keep the cloudwatch tool and the iron-proxy SigV4 header allowlist;
  drop the Python services/api changes (api-rs replaces them). AWS_REGION
  is non-secret and reaches the sandbox via passthrough_env.

* feat(api-rs): translate aws_auth in the iron-proxy fragment path

Infra/harness fragments can now declare an aws_auth transform and have
it registered as an iron-control aws_auth secret, instead of erroring as
unsupported. Mirrors gcp_auth: access_key_id/secret_access_key (and
optional session_token) are placeholder refs resolved via the source
policy, allowed_regions/allowed_services scope signing, rules use the
shared request-rule shape, and the foreign_id keys on the access-key
placeholder (`{role}-aws-{slug}`).

* feat(api-rs): support aws_auth in tool discovery

tool_discovery rejected the aws_auth secret type and dropped the whole
tool, so the cloudwatch tool got no proxy fragment and no sandbox AWS
credentials. Parse aws_auth into an aws_auth transform (matching the
iron-control translator), seed the sandbox AWS SDK placeholder creds via
placeholder_env, and add GITHUB_TOKEN to the infra-env bootstrap for the
repo-cache.
The sync used a PID-suffixed temp dir, but Kubernetes collapses the
doubled dollar sign in a container command to a single one during its own
$(VAR) expansion, so the suffix became a constant literal and the mv
failed ("subdirectory of itself"). Use a deterministic temp name and
sweep any stale temp dirs before cloning; sync is sequential per pod so a
fixed name is safe. Self-heals existing corrupt caches on next run.
The sync used a PID-suffixed temp dir, but Kubernetes collapses the
doubled dollar sign in a container command to a single one during its own
$(VAR) expansion, so the suffix became a constant literal and the mv
failed ("subdirectory of itself"). Use a deterministic temp name and
sweep any stale temp dirs before cloning; sync is sequential per pod so a
fixed name is safe. Self-heals existing corrupt caches on next run.
* fix: grant infra role to warm pool bootstrap

* fix: keep warm pool bootstrap roleless

* refactor(api-rs): move warm pool policy into sandbox manager

* feat(api-rs): add absurd workflow runtime poc

* chore(api-rs): refresh absurd workflow staging image

* fix(api-rs): label workflow host sandboxes

* fix(api-rs): serialize workflow timestamps as rfc3339

* fix(api-rs): mount overlay workflows in workflow sandbox

* fix(api-rs): propagate sandbox image pull secrets

* fix(api-rs): extend workflow agent turns

* fix(api-rs): keep workflow host claims alive

* fix(api-rs): heartbeat all workflow tasks

* fix(workflows): call tools through sandbox shims

* fix(workflows): resolve sandbox tool shim outside login shells

* fix(workflows): bootstrap tool shims in workflow hosts

* fix(workflows): mount tools in workflow host sandboxes

* fix(workflows): grant workflow host tool secrets

* ci: speed up branch image builds

* ci: add Dockerfile package caches

* fix(api-rs): repair workflow schedule test fixture

* refactor(api-rs): rename workflows crate

---------

Co-authored-by: Centaur AI <ai@centaur.local>
Co-authored-by: Centaur AI <ai@centaur.local>
fix(api-rs): raise stdout line cap and disable service links

Co-authored-by: Centaur AI <ai@centaur.local>
@Zygimantass Zygimantass force-pushed the api-rs-control-plane branch from e2d3db4 to eb0969b Compare June 10, 2026 14:35
@github-actions

Copy link
Copy Markdown

Cloudflare Workers docs preview

https://pr-344-centaur-docs.porto.workers.dev

Zygimantass and others added 4 commits June 10, 2026 16:53
)

The sandbox entrypoint's install-tool-shims printed its success notice to
stdout, which is the same pipe the session stdout pump streams to clients.
slackbotv2 treated any JSON-unparseable output line as a terminal codex
line, so on every fresh (non-warm) sandbox the very first bootstrap line
ended the render stream before the agent produced output, finalizing the
Slack reply as 'Execution completed, but no final text was captured.'
while the real answer streamed afterwards and was dropped.

- install_tool_shims.py: write the notice to stderr
- slackbotv2: non-JSON output lines are noise, not terminal
- regression test: bootstrap line before codex output still delivers the answer

Amp-Thread-ID: https://ampcode.com/threads/T-019eb1eb-27c4-7169-9489-41d85f8e0614

Co-authored-by: Centaur AI <ai@centaur.local>
Co-authored-by: Amp <amp@ampcode.com>
* fix(api-rs): only drive session executions claimed by this request

mark_execution_running treated an already-running row as a successful
claim, so a concurrent request with the same idempotency key could fall
into the fallback fetch, see status=running, and send the same input to
the sandbox a second time. It now returns ClaimExecutionResult with a
claimed flag that is true only when this call did the queued->running
transition; the runtime returns the existing execution without driving
it when the claim was lost.

Amp-Thread-ID: https://ampcode.com/threads/T-019eb167-76de-7515-84f7-4265ce53ba85
Co-authored-by: Amp <amp@ampcode.com>

* fix(api-rs): reject HTTP tool secrets without hosts

An HTTP secret parsed with an empty hosts list (empty tool-level
default, hosts = [], or malformed hosts falling back to an empty
default) translated to an empty iron-control rules array, leaving the
credential host-unlimited. Both manifest parsers (centaur-perms and the
api-server's tool discovery mirror) now fail closed, matching the
brokered_token parser. Affected tools are warn-skipped at discovery.

* fix(api-rs): guard absurd.await_event against task/run mismatch

await_event trusted the caller-provided (task_id, run_id) pair: a
mismatched call could attach one task's run to a wait/checkpoint on
another task and put the wrong task to sleep. Reject mismatches like
get_task_checkpoint_states already does. Shipped as migration 0009
(create or replace) because 0007 is already applied in live
environments and sqlx validates migration checksums.

* fix(api-rs): only claim running warm sandboxes

A warm sandbox observed as Created is not ready for byte I/O and means
the runtime regressed after the replenisher saw it running (backends
wait for readiness before returning from create). Claiming it made the
session fail at open_io; mark it failed and try the next one instead.

Amp-Thread-ID: https://ampcode.com/threads/T-019eb167-76de-7515-84f7-4265ce53ba85
Co-authored-by: Amp <amp@ampcode.com>

* fix(api-rs): make grant_inputs_to_role idempotent

grant_inputs_to_role documented idempotency but always POSTed a new
grant after upserting each secret, so re-running centaur-perms grants
or startup role registration produced duplicate grants or conflicts.
It now lists the role's existing grants once and reuses the grant for
an already-granted secret.

* fix(rendering): flush buffered answer in codexAppServerToRendererEvents

The array helper never called mapper.flush(), so finite sources that end
without an explicit terminal event lost buffered answer text and never
received renderer.done. Flush like codexAppServerToChatSdkStream does;
flush is a no-op when a terminal event already completed the stream.

* fix(slackbotv2): cap inline attachments at 100 MiB

serializeAttachment buffered every Slack attachment in memory and
base64-inlined it with no size limit, letting one large upload blow
request limits or OOM the process. Skip the download when Slack's size
metadata exceeds 100 MiB and re-check the actual byte count after
fetching; oversized attachments degrade through the existing fetchError
channel.

Amp-Thread-ID: https://ampcode.com/threads/T-019eb167-76de-7515-84f7-4265ce53ba85
Co-authored-by: Amp <amp@ampcode.com>

* fix(slackbotv2): isolate render obligation recovery failures

One thread's corrupt state, lease error, or failed render propagated
out of the recovery scan, so the remaining indexed obligations were
never attempted until the next restart. Isolate each thread, log the
failure, and count it as deferred so the capped-backoff retry loop
revisits it.

* ci: skip registry cache export for fork PRs

cache-to type=registry pushes a cache manifest to GHCR even when image
push is disabled, and fork PRs run with a read-only GITHUB_TOKEN, so
their builds failed at cache export. Gate the registry cache-to on the
same not-a-fork predicate as push.

---------

Co-authored-by: Amp <amp@ampcode.com>
fix: update Slack stream pagination and Postgres pool limits
* feat(api-rs): serve tools + gerard overlay to agent sandboxes

api-rs sandboxes had no tools and no overlay. Give api-rs-spawned agents the
same base + overlay tools and overlay system-prompt the chart already wires for
the api-rs pod, using upstream's CLI-shim tool model rather than a sidecar.

Upstream direction: tools are shell CLI shims, not an HTTP registry. The agent
image's install-tool-shims (services/sandbox/install_tool_shims.py) scans
TOOL_DIRS at entrypoint and `uvx`-installs each pyproject [project.scripts] as a
CLI; the SYSTEM_PROMPT points agents at those CLIs and `centaur-tools list`. The
old `call <tool>` HTTP registry is deprecated to control-plane-only. Tool
secrets are already handled upstream: codex_app_server_env_template pushes the
tool placeholder creds onto the agent env, iron-control grants the per-sandbox
principal the real secrets, and Postgres rides proxied `*_DSN` env from
apply_proxy_env. So the agent needs only the tool SOURCES at the right paths —
no sidecar, no HMAC sandbox token, no loopback tool server.

- tools.rs (replaces tool_server.rs): a `tools-bootstrap` init container copies
  /app/tools out of the shared centaur-api image into an emptyDir mounted at
  /app/tools in the agent, and an `overlay-bootstrap` init container copies the
  org overlay tree into overlay-root mounted at overlay.mountPath (the same path
  the api-rs Deployment uses) and stages the overlay's SYSTEM_PROMPT.md as
  $HOME/AGENTS_OVERLAY.md, which the sandbox entrypoint appends to the base
  prompt. TOOL_DIRS is set on the agent env to /app/tools (or
  /app/tools:<mountPath>/tools with the overlay) — identical to the value the
  api-rs pod computes for its own tool discovery, set deterministically in the
  spec builder rather than via passthrough env.
- lib.rs: build_agent_sandbox layers the tools/overlay env over spec.env, mounts
  the bootstrapped sources read-only into the agent, and appends the
  tools-bootstrap + overlay-bootstrap init containers and their volumes. No
  sidecar container, no token minting.
- args.rs: a minimal ToolsArgs (source image/pull-policy, reusing the
  KUBERNETES_TOOL_SERVER_IMAGE* env the chart sets from the shared api image) and
  OverlayArgs (image/pull-policy/source-path/mount-path) wired into
  AgentSandboxConfig. Explicit clap arg ids avoid id collisions with the other
  flattened arg structs.
- chart apirs.yaml: render the tools source image (api.image.*, gated on
  toolServer.enabled) and overlay (overlay.*) onto the api-rs env, replacing the
  KUBERNETES_TOOL_SERVER_* sidecar block.

Gone vs the sidecar port: tool_server.rs, the sbx1 HMAC token minting and its
SANDBOX_SIGNING_KEY requirement, CENTAUR_TOOLS_URL, the sidecar pg-DSN/proxy-env
collection, and the hmac/base64/sha2 dependency additions (nothing else in the
agent-k8s crate uses them).

Warm-pool sandboxes route through the same build_agent_sandbox path, so they get
the tools/overlay init containers and volumes for free.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(api-rs): stage tools-bootstrap copy outside /app/tools

The tools-bootstrap init container mounted the tools emptyDir at
/app/tools — the same path it copies FROM. The mount shadows the source
image's tools tree, so the script self-copies the empty volume and GNU
cp rejects it (exit 1); every sandbox dies with 'reached terminal state
before running' and no agent ever starts.

Mount the volume at /tools-bootstrap instead (mirroring how
overlay-bootstrap stages to a distinct target) and copy the image's
/app/tools into it. The agent container keeps mounting the same volume
at /app/tools, so TOOL_DIRS and the shim installer are unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix: wire sandbox overlays without tools

Gate overlay env, volumes, and mounts independently from the tools source image so overlay-only sandbox configs produce valid pod specs.

* fix: make sandbox bootstrap volumes writable

Set an fsGroup on sandbox pods that use tools or overlays so non-root bootstrap init containers can populate their emptyDir mounts.

* fix(api-rs): source sandbox tools image from the api-rs image

The tools-bootstrap init container copied /app/tools from .Values.api.image
(centaur-api), but api-rs discovers its tools from /app/tools in its own
container (.Values.apiRs.image). Sourcing from a different image risked the
agent installing a different tool set than api-rs granted per-sandbox creds
for. Source from the same api-rs image the Deployment runs so the two match
by construction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(api-rs): clone sandbox tools from a repo instead of baking them in

Replumb the tools-bootstrap init container to git-clone the tools repo at a
pinned ref into each sandbox's /app/tools (sparse on the tools subdir; GitHub
token via askpass for private repos), instead of copying /app/tools out of the
api-rs image. Mirrors the repo-cache architecture — clone a repo into a
pre-provisioned directory — without sharing its node-level cache, so adding a
tool is a push to the repo rather than an api-rs image rebuild.

api-rs still discovers its own /app/tools to grant proxy creds, so pin
toolServer.ref to the tool set the image carries to avoid drift.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(api-rs): route the tools clone through the per-sandbox iron-proxy

The sandbox NetworkPolicy only allows egress to the sandbox's iron-proxy,
api-rs, and DNS, so the tools-bootstrap init container's direct git clone to
github.com is blocked whenever iron-proxy is enabled. Route the clone through
the proxy like all other sandbox egress: export HTTPS_PROXY (the resolved
per-sandbox proxy URL apply_proxy_env already put on the spec) and
GIT_SSL_CAINFO, and mount the pod's existing firewall-ca volume into the init
container. github.com/api.github.com are already in the baseline proxy
allowlist, so no policy or allowlist changes are needed. Without iron-proxy
the clone still goes direct.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(api-rs): quote repo/ref/subdir in the tools-bootstrap script

These are operator config (helm values -> env -> clap), not user input, but
interpolating them bare into the /bin/sh -ec script means a stray space or
metacharacter breaks in the shell instead of loudly in git. Quote them at
every interpolation site.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(api-rs): retry the tools clone through the proxy's startup window

The per-sandbox iron-proxy is created in the same reconcile as the Sandbox CR
and isn't accepting connections yet when the tools-bootstrap init container
first runs — the clone dies with connection-refused, and an init failure is
terminal for the Sandbox (no kubelet retry), so every cold spawn failed with
'reached terminal state before running'. Wrap the clone/sparse-checkout/ref
fetch in a bounded retry loop (30 x 2s) so the init container rides out the
proxy's startup instead of killing the sandbox.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
(cherry picked from commit b1f274d)

* refactor(api-rs): converge sandbox overlay on the spec-level overlay-image plumbing

The base branch grew its own overlay mechanism (SandboxSpec.overlay +
overlay_json) for workflow-host sandboxes, configured by the same
CENTAUR_OVERLAY_* env this branch's OverlayArgs read — so a workflow-host
pod with an overlay configured got two init containers and two volumes
with identical names, which Kubernetes rejects.

Adopt the upstream plumbing wholesale: the backend default is now an
OverlayImage from the same env helper the workflow host uses (the
OverlayArgs flags are gone), a spec-level overlay takes precedence over
the backend default so only one overlay-bootstrap/overlay-root pair ever
exists, and agent sandboxes mount the overlay at /opt/centaur/overlay
like workflow hosts do. The AGENTS_OVERLAY.md prompt staging moves into
the shared overlay_json path, and the chart's duplicated CENTAUR_OVERLAY_*
env block is dropped — the upstream block already feeds it.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants