[codex] stack Chat SDK Slackbot on api-rs control plane by gakonst · Pull Request #346 · paradigmxyz/centaur

gakonst · 2026-06-01T19:53:41Z

Summary

stack the Chat SDK Slackbot v2 work from [codex] add Chat SDK slackbot v2 #322 and [codex] improve slackbot v2 stream tasks #327 on top of feat: python no more #344
add the canonical @centaur/rendering pipeline from feat: add canonical renderer pipeline #329 and make Slackbotv2 consume that renderer directly
remove the duplicate Chat SDK stream mapper from @centaur/harness-events; that package now stays focused on shared event shapes
keep V1 Slackbot untouched in the PR delta

Validation

pnpm install --frozen-lockfile
pnpm --filter @centaur/harness-events test
pnpm --filter @centaur/harness-events typecheck
pnpm --filter @centaur/rendering test
pnpm --filter @centaur/rendering typecheck
pnpm --filter slackbotv2 test
pnpm --filter slackbotv2 check:types

Local deploy notes

built and rolled out centaur-slackbotv2-pr344-chat-sdk:18f27b5a to centaur-slackbotv2-pr327
API-rs remains on centaur-api-rs-pr344-chat-sdk:9939d6ab-linux; local API health is passing
local API-rs is running with RUN_MIGRATIONS=false against the reused PR327 DB because that DB has the older migration checksum
Funnel is on: https://gak.tail388b2e.ts.net -> 127.0.0.1:3002

Add @centaur/rendering as the boundary from Codex app-server events to generic renderer events and Chat SDK-shaped output. Wire Slackbot Codex sessions through the renderer package while keeping Slack delivery in a Slackbot adapter. Keep services/api-rs caller-neutral by removing Slack examples from the session control-plane docs and tests.

* feat: add chat sdk slackbot v2 * feat: add slackbotv2 runtime entrypoint * refactor: inline harness event app server types * fix: create session before slackbotv2 append * fix: improve slackbot v2 stream tasks * feat: add canonical renderer pipeline Add @centaur/rendering as the boundary from Codex app-server events to generic renderer events and Chat SDK-shaped output. Wire Slackbot Codex sessions through the renderer package while keeping Slack delivery in a Slackbot adapter. Keep services/api-rs caller-neutral by removing Slack examples from the session control-plane docs and tests. * refactor: route chat sdk rendering through slackbotv2 * fix: improve slackbotv2 stream latency * fix: stream slackbotv2 responses (#347) * fix: address slackbotv2 review comments --------- Co-authored-by: Zygimantas <5236121+Zygimantass@users.noreply.github.com> Co-authored-by: Goksu Toprak <goksu@tempo.xyz> Co-authored-by: Goksu Toprak <19259594+goksu@users.noreply.github.com>

* feat(api-rs): add sandbox core * feat(api-rs): add local sandbox backend * feat(api-rs): add agent sandbox backend * refactor(api-rs): parse sandbox test config with clap * feat(api-rs): add session store * chore(api-rs): raise session db pool limit * chore(api-rs): remove unused execution claim helper * feat(api-rs): restrict session harness types * refactor(api-rs): derive session harness serde * refactor(api-rs): derive session enum strings * feat(api-rs): add session HTTP API * refactor(api-rs): parse server config with clap * fix(api-rs): render execution status with display * feat(api-rs): add session CLI * feat(api-rs): wire codex sandbox e2e path * feat(api-rs): accept stdin session events in cli * feat(api-rs): add session cli tui * feat(api-rs): add sandbox core (#315) * refactor: simplify sandbox interface * refactor(api-rs): use owned sandbox io streams (#334) * refactor(api-rs): use owned sandbox io streams * refactor(api-rs): move session streaming into runtime * refactor(api-rs): remove mock session runtime branch * refactor(api-rs): move sandbox workload modes into runtime * refactor(api-rs): clean up session API client and e2e tests * refactor(api-rs): use library sse event types * refactor(api-rs): address owned io review comments * refactor(api-rs): wake session streams with postgres notify (#345) * [codex] stack Chat SDK Slackbot on api-rs control plane (#346) * feat: add chat sdk slackbot v2 * feat: add slackbotv2 runtime entrypoint * refactor: inline harness event app server types * fix: create session before slackbotv2 append * fix: improve slackbot v2 stream tasks * feat: add canonical renderer pipeline Add @centaur/rendering as the boundary from Codex app-server events to generic renderer events and Chat SDK-shaped output. Wire Slackbot Codex sessions through the renderer package while keeping Slack delivery in a Slackbot adapter. Keep services/api-rs caller-neutral by removing Slack examples from the session control-plane docs and tests. * refactor: route chat sdk rendering through slackbotv2 * fix: improve slackbotv2 stream latency * fix: stream slackbotv2 responses (#347) * fix: address slackbotv2 review comments --------- Co-authored-by: Zygimantas <5236121+Zygimantass@users.noreply.github.com> Co-authored-by: Goksu Toprak <goksu@tempo.xyz> Co-authored-by: Goksu Toprak <19259594+goksu@users.noreply.github.com> * fix: make sandbox tool CLIs importable (#351) * fix: preserve final Slack answers after plans (#352) * fix: use upstream Chat SDK Slack streaming (#360) Streaming is patched and we no longer need our shim. * docs: add local centaur dev skill * [codex] add focused api-rs iron proxy integration (#365) * feat: add api-rs iron proxy integration * fix: align api-rs iron proxy wiring * docs: fix run-centaur-dev api-rs env * refactor: address api-rs review feedback * fix: tighten api-rs dev sandbox setup --------- Co-authored-by: Zygimantas <5236121+Zygimantass@users.noreply.github.com> * chore: trim sandbox image size (#371) * feat: load iron proxy fragments from tool pyprojects (#372) * feat: load iron proxy fragments from tool pyprojects * docs: note pyproject fragment conversion todo * fix: improve sandbox tool smoke behavior (#373) * fix: improve sandbox tool smoke behavior * chore: remove unused harness env marker * [codex] pass sandbox tool paths and package tool CLIs (#354) fix: rebase tool cli packaging on api-rs control plane Co-authored-by: Zygimantas <5236121+Zygimantass@users.noreply.github.com> * feat(api-rs): wire api-rs and slackbotv2 into helm chart (#374) Add Dockerfiles for the Rust control plane (centaur-api-server) and the Chat SDK Slackbot v2, and deploy both via the chart alongside the Python stack. api-rs runs the agent-k8s + codex sandbox backend with per-sandbox iron-proxy; slackbotv2 replaces the legacy slackbot and targets api-rs. Enables the agent-sandbox subchart, reuses the in-cluster postgres, and repoints ingress/networkpolicy from slackbot to slackbotv2. * fix(rendering): preserve full task output (#379) * fix(slackbotv2): harden api-rs handoff and stream recovery * fix(session): make Slack retries idempotent for append and execute (#388) fix(session): make slack handoff idempotent * chore(sandbox): add thin test image (#389) * fix: render terminal slack session completions * feat(api-rs): integrate iron-control for per-principal proxy credential grants (#378) * feat(api-rs): integrate iron-control for per-principal proxy credential grants Add a centaur-iron-control crate (typed admin client + fragment->resource translation + principal derivation) and wire it into the session and sandbox paths. Sessions upsert an iron-control principal (Slack user for DMs, channel otherwise) and assign infra + tool/harness roles; per-sandbox iron-proxies are registered to that principal and sync their config over IRON_CONTROL_URL with an iprx_ token instead of a rendered static config. Gated on iron-control being configured, so the legacy rendered-proxy path stays intact as a fallback. Adds the Helm wiring (api-rs env + egress NetworkPolicy) behind ironControl.enabled. * chore(iron-proxy): bump base image to 0.42.0-rc.5 * fix(chart): allow api-rs ingress to iron-control network policy api-rs registers principals/roles/grants with iron-control on startup, but iron-control's NetworkPolicy ingress only permitted the legacy Python api component. Under the default-deny policy the CNI dropped api-rs's inbound connection, surfacing as a reqwest 'stream closed: EOF' transport error during startup role registration. Add api-rs to the ingress allowlist, gated on apiRs.enabled. * feat(api-rs): make iron-control mandatory and add sandbox drain Remove the legacy non-synced iron-proxy fallback: a per-sandbox proxy always syncs its config from iron-control, so iron-proxy now requires iron-control to be configured (hard error at startup otherwise) and a session-registration failure fails session creation instead of silently booting a sandbox with a non-functional proxy. Persist the iron-control principal OID on the session row (new migration 0003) instead of an in-memory map, so a resumed session can recreate its sandbox after an api-rs restart without re-deriving the principal. Add SessionRuntime::drain and POST /api/sandboxes/drain to stop every non-terminal sandbox, reporting which stopped and which failed. * feat(iron-proxy): run managed proxies with no local config A managed (control-plane-synced) iron-proxy rejects a local config that sets management.listen, since the control plane owns it. Stop providing any rendered proxy.yaml to per-sandbox proxies: the entrypoint runs iron-proxy with no -config when IRON_CONTROL_URL is set, and the local settings the control plane does not own (tunnel/dns listen, TLS/CA paths, log level) are passed as IRON_* env vars instead. Removes the per-sandbox ConfigMap and the now-dead fragment-derived config/env (postgres listeners are already unsupported on the iron-control path). * fix(chart): allow per-sandbox iron-proxy ingress to iron-control In managed mode each per-sandbox iron-proxy syncs its effective config directly from IRON_CONTROL_URL (the iron-control Service) over /proxy/sync. Proxy pods are labeled centaur.ai/iron-proxy=true, not component=api-rs, so iron-control's NetworkPolicy was dropping their sync connection. Add the iron-proxy pod selector to the ingress allowlist. * fix(iron-proxy): use IRON_CONTROL_PLANE_URL for the managed proxy The iron-proxy binary reads its control-plane base URL from IRON_CONTROL_PLANE_URL, not IRON_CONTROL_URL (the latter is api-rs's own admin-client var). Setting the wrong name made the proxy fall back to its built-in default endpoint instead of the in-cluster iron-control. Inject IRON_CONTROL_PLANE_URL on the proxy pod and key the entrypoint's managed-mode detection off it too. * feat(api-rs): single source of truth for harness auth mode The codex/claude auth mode was set independently on two sides — the proxy/iron-control fragment (api-rs CODEX_AUTH_MODE) and the sandbox agent (via sandbox.extraEnv, which only fed the legacy api). They could drift, e.g. agent on chatgpt.com (access_token) while the proxy registered the api.openai.com api_key secret, so no credential was injected and requests 401'd. Hoist the auth mode to first-class chart values sandbox.codexAuthMode / sandbox.claudeCodeAuthMode. api-rs reads them to register the matching iron-proxy credential AND now propagates them into each sandbox spec, so the agent's auth.json and the injected credential always agree. The legacy api sources the same values into its sandbox env. justfile sets the real vars from .env (was sandbox.extraEnv). * chore(justfile): route .env auth mode to sandbox.codexAuthMode Set the hoisted first-class chart values (sandbox.codexAuthMode / sandbox.claudeCodeAuthMode) from the .env vars instead of sandbox.extraEnv, so the dev deploy drives the single source of truth. * fix(api-rs): inject harness credential placeholder into the sandbox In api_key mode codex only runs `codex login --with-api-key` if an API key is present in the agent env; otherwise it keeps the dummy ChatGPT auth.json and talks to chatgpt.com. The managed-mode refactor stopped injecting the harness fragment's placeholder, so codex had no key. Inject placeholder_env from the resolved harness fragment (e.g. api_key mode → OPENAI_API_KEY=OPENAI_API_KEY) so codex authenticates against api.openai.com and iron-proxy swaps in the real credential. Empty for access_token, which uses inject rather than replace. * fix(api-rs): inject full fragment placeholder set into the sandbox Extend sandbox placeholder injection from the harness fragment to the full proxy fragment set (infra + harness + tools), matching the iron-control roles. Env-based consumers now send the proxy_value iron-proxy replaces: codex's OPENAI_API_KEY, git's GITHUB_TOKEN, and the rest of the infra/tool credentials. Managed mode had dropped these, breaking git auth and any env-based secret use in the sandbox. * feat(api-rs): register a single shared tools role, not one per tool Bootstrap registered the infra role plus a separate role per harness and per tool fragment. Collapse all harness + tool fragments into one shared "tools" role (RoleSpec::tools) holding every secret they declare, alongside the unchanged "infra" role. SessionRegistrar already grants every registered role to each session principal, so a channel's default principal gets both infra and tools. * fix(api-rs): rebind iron-proxy to its principal on resume via annotation Reconcile the base's pause-deletes / resume-recreates iron-proxy lifecycle with the iron-control model, where resolve_iron_proxy needs the session's principal but resume() has only the sandbox id. Stamp the principal OID as a sandbox annotation (centaur.ai/iron-control-principal) at create and read it back on resume. The annotation lives on the Sandbox CRD, which survives pause and api-rs restarts, so resume rebinds to the same identity without any in-memory or DB lookup. * chore(iron-proxy): bump base image to 0.42.0-rc.6 * feat(api-rs): integrate pg_dsn secrets end to end Register each fragment postgres listener as an iron-control pg_dsn secret (PUT /api/v1/pg_dsn_secrets, granted via pg_dsn_secret_id) instead of rejecting it: the upstream DSN resolves like any secret source and an optional role becomes the proxy's SET ROLE. The shared tools/infra role holds them alongside the other secrets. On the per-sandbox proxy pod, deliver the local listen/client knobs the managed proxy reads by foreign_id — IRON_PROXY_PG_<FID>_{LISTEN,CLIENT_USER, CLIENT_PASSWORD} (password value direct, no indirection) — and inject the proxied DSN into the sandbox env, exposing each listener port on the proxy Service + allowing it sandbox->proxy. foreign_id is derived from a shared pg_foreign_id() so registration and the proxy env agree. * refactor(api-rs): bootstrap infra only; stop registering tool secrets api-rs no longer discovers tool YAML fragments and registers a secret-per- tool at boot. It registers exactly one role, infra, holding the shared infra secrets plus the harness auth (codex/claude) — harness auth is infra, selected by auth mode and baked into the binary (harness_auth_fragment) rather than discovered from disk. The tools role and per-tool roles are gone. Tool secrets + their grants become operator-managed in the control plane; their sandbox env will be derived from the principal's effective grants (follow-up: GET /api/v1/principals/:id/effective_config). Removes the tool/harness fragment discovery args (TOOL_DIRS, KUBERNETES_IRON_PROXY_FRAGMENT_*), the harness YAML files, and the api-rs image's tool/harness COPY. Infra placeholders + harness auth still come from the known infra set, so the agent always boots. Dead fragment code (render.rs, disk discovery) and the effective_config read are follow-ups. * feat(api-rs): derive tool-secret sandbox env from effective_config Operator-managed tool secrets now reach the sandbox via the principal's effective config instead of local fragments. Add the iron-control read GET /api/v1/principals/:id/effective_config (EffectiveConfig: replace proxy_values + postgres {foreign_id, database}). At sandbox create/resume, api-rs fetches it for the bound principal and derives: replace-secret placeholders (sandbox env, set_missing so known infra placeholders win) and Postgres listeners. The control plane owns each pg upstream dsn/role/database; api-rs assigns the local coordination — listen port (sequential from 6432, sorted by foreign_id), client user, a generated password, and the sandbox env var name (<NORM_FID>_DSN) — and wires IRON_PROXY_PG_<FID>_{LISTEN,CLIENT_USER,CLIENT_PASSWORD} + the DSN + port exposure. Replaces the fragment-based pg derivation. * fix(api-rs): match effective_config to landed API; defer postgres Validated against iron-control's docs/API.md + PgDsnSecret#to_proxy_dsn: the effective_config response is {data:...}-enveloped (decode_data already unwraps it) and secrets carry replace.proxy_value — both as implemented. But postgres entries are {id, foreign_id, dsn, role} with no database (the dsn is an unresolved source, so iron-control can't surface a dbname). Drop the database field from EffectivePgDsn and defer pg-listener derivation: the sandbox connect-db source is unsettled. Replace-secret placeholders still flow from effective_config; the pg wiring stays dormant (empty). * refactor(iron-proxy): remove dead fragment-discovery code Harness fragments are now baked into harness_auth_fragment and tool fragments are operator-managed, so the disk-discovery path is dead. Remove discover_fragment_files, discover_harness_fragment_files, harness_fragment_from_dirs, harness_broker_fragments(_from_dirs), default_harness_fragment_dirs, HarnessFragmentFile, parse_harness_fragment_file, the visit_* walkers, strip_auth_suffix, and their consts + exports + test. Kept: load_fragment_file/load_fragment_str (infra + baked harness), infra_fragment, placeholder_env, and the pyproject parser (still reachable via load_fragment_file). render.rs + pg_dsn registration untouched. * refactor(iron-proxy): remove dead proxy-rendering machinery Managed mode never renders a proxy.yaml, so the rendering path is dead. Remove render.rs, ports.rs (listen_ports_from_yaml / pg_dsn_envs), the ProxyConfig/ProxySection structs, PgDsnEnv + PostgresListener::pg_dsn_env, the model's source-resolution methods (Transform/Secret/Postgres* resolve_sources, is_managed, explicit_id, fill_missing_source, MANAGED_TRANSFORMS), load_default_proxy_base_config + the base-config path, and the unused values.rs helpers (listen_port, non_empty). Drop the render-dependent tests; the token-broker render + source resolution (broker.rs) is untouched. Kept: load_fragment_file/str, infra_fragment, harness_auth_fragment, placeholder_env, the data model (used by the iron-control registry), and the pyproject parser (still reachable via load_fragment_file). * refactor(iron-proxy): remove dead pyproject tool-secret parser The pyproject parser was only reachable via load_fragment_file's pyproject.toml branch, which is never taken now (only infra.yaml is loaded; tool secrets are operator-managed). Remove load_pyproject_fragment_file and its toml helpers, the pyproject branch, the ParsePyproject + now-unused ParseBase error variants, and the toml + sha2 dependencies. load_fragment_file now loads YAML only; everything else (infra, harness, placeholder_env, token-broker) is untouched. * refactor(iron-proxy): embed infra.yaml via include_str!, drop runtime file loader infra.yaml is the only remaining fragment file and was read at runtime by walking up the filesystem (repo_relative_path) + a Dockerfile COPY. Embed it at compile time with include_str! and parse via load_fragment_str (same as the baked harness fragments). Removes load_fragment_file, repo_relative_path, read_file, the ReadFile/ReadDir error variants, and the image's infra.yaml COPY — the binary now carries no runtime config-file dependency. infra.yaml stays in the repo as the embed source; load_fragment_str + the Value-based model are kept (concise YAML beats hand-built struct literals). * fix(iron-proxy): move embedded infra.yaml into the crate The api-rs Docker build context is services/api-rs/ only, so include_str! of services/iron-proxy/infra.yaml (a sibling) failed to resolve. Move the embed source to crates/centaur-iron-proxy/src/infra.yaml and include it by local path so it's always in the build context. * feat(centaur-grants): add CLI to grant principals tool roles/secrets Add a centaur-grants CLI that lets operators add/revoke/list grants for Slack principals (users/channels) against iron-control. The primary unit is a tool's tool-{slug} role: the CLI resolves a tool by name across overlay-ordered tool dirs, parses its pyproject [tool.centaur] secrets, registers them as iron-control resources, grants them to the role, and assigns the role to the principal. A lower-level direct secret-grant path is also supported. Reuses centaur-iron-control's canonical mappings (derive_principal, RoleSpec::tool, source_from_placeholder) so the principal/role foreign_ids match exactly what api-rs writes. Extends that crate with the read/delete endpoints the CLI needs (get_principal, list_principal_roles, unassign_role, delete_grant) and factors grant_inputs_to_role out of register_role for reuse. * feat(centaur-grants): add `principals` subcommand for discovery List principals registered in iron-control so operators can find the exact foreign_id to grant against (e.g. whether a deployment folds the Slack team id into the key) without dropping to curl. Supports --label key=value filters, --managed (managed-by=centaur), and a --filter substring match. Backed by a new paginating list_principals client method on IronControlClient. * feat(centaur-grants): use grantee-scoped grant listing iron-control now exposes GET /principals/:id/grants and GET /roles/:id/grants. Add list_principal_grants/list_role_grants client methods (sharing a paginate helper) and enrich the Grant model with its grantee/secret references plus a secret_id() accessor. Use them to: revoke --secret <OID> by finding the principal's matching grant (no longer requires the grant OID), and show actual grants per role and direct grants in `list`. * refactor(centaur-perms): rename from centaur-grants, resource-first CLI Rename the crate/binary to centaur-perms and restructure the CLI as <noun> <verb>: principals list | show <p> | grant <p> | revoke <p> roles list | show <r> | grant <r> | revoke <r> principals grant/revoke take any mix of --tool, --role, and --secret targets; roles grant/revoke assign/remove secrets on a role (the previously missing role<-secret path). Adds get_role and list_roles client methods. * feat(centaur-perms): grant tool-config secrets to a role Add --tool (and --secret-name) to `roles grant`, so a tool's pyproject [tool.centaur] secrets can be registered and granted onto an explicit role: centaur-perms roles grant <role> --tool <name> [--secret-name NAME]... The secret resources keep their canonical tool-<slug>-<secret> foreign ids (keyed on the tool's own role), so the same secret object is shared regardless of which role it's granted to. The role is always required. * feat(centaur-perms): principals show accepts foreign_id or OID Look the principal up by whatever is passed — Slack thread key (derived), foreign_id, or prn_ OID, all accepted by GET /principals/:id — and display the principal's actual foreign_id from the response instead of echoing the input. Adds principal::resolve_lookup for the read-only path. * refactor(centaur-perms): drop redundant resolve_lookup helper resolve_lookup(p, u) returned the same string as resolve_principal(p, u, ns).foreign_id, so principals show now uses resolve_principal like the other read handlers. Keeps the display fix (show the response's actual foreign_id, not the echoed input). * fix(centaur-perms): resolve foreign_ids via namespaced lookup endpoint iron-control's bare GET /{principals,roles}/:id only matches OIDs, so a foreign_id 404s. get_principal/get_role now take a namespace and route foreign_ids through GET /{collection}/lookup/:namespace/:foreign_id, keeping the bare /:id route for prn_/role_ OIDs. Namespace is threaded through the CLI read paths. * fix(iron-control): route all foreign_id lookups via the lookup endpoints Per the API docs, the bare GET /{collection}/:id routes are OID-only; foreign_ids must use GET /{collection}/lookup/:namespace/:foreign_id (and GET /principals/lookup/:namespace/:foreign_id/effective_config). - Generalize the path helper to resource_path(.., suffix) and use it for get_principal, get_role, and effective_config (now namespace-aware, routing foreign_ids through the lookup variant). - Thread the iron-control namespace into effective_config callers: add a namespace field to IronControlSettings (set from IRON_CONTROL_NAMESPACE) and pass it from the agent-k8s backend; the CLI passes --namespace. - Correct the sub-resource list methods' docs (roles/grants under a principal or role are OID-only; callers pass the resolved OID). * feat(api-rs): wire pg_dsn secrets, default gcp_auth scope, name grants in show - pg_dsn: add database to PgDsnSecretInput/EffectivePgDsn; build per-principal Postgres listeners from effective_config in the sandbox agent (was deferred); make pg_dsn a first-class secret in the centaur-perms CLI (parse + translate), with foreign_id round-tripping to the sandbox DSN env var. - gcp_auth: default scope-less secrets to cloud-platform, matching the Python proxy_config (iron-control requires a non-empty scopes). - perms show: principals show and roles show now print each grant's type and name (resolved by OID), not just the bare secret OID. * feat(api-rs): wire hmac_sign secrets through iron-control and centaur-perms Make hmac_sign a first-class secret type end-to-end, mirroring pg_dsn: - iron-control: add HmacSecretInput/HmacSecretHeader, upsert_hmac_secret, GrantSecret::Hmac + Grant.hmac_secret_id, and the SecretInput::Hmac grant arm. - centaur-perms: parse hmac_sign from pyproject.toml (required secret credential, validated algorithm/encoding/timestamp enums) and translate it to a HmacSecretInput with a {role}-hmac-{slug} foreign_id; accept hms_ OIDs. Tool hmac secrets are operator-managed via the CLI; the infra/harness fragment translator still rejects hmac_sign since neither signs requests. * chore(iron-proxy): bump image to 0.42.0-rc.7 * feat(centaur-perms): add secrets list and show commands List sweeps all five secret types (static, oauth_token, gcp_auth, pg_dsn, hmac) in a namespace into one table, reusing the --label/--filter/--managed filters. Show fetches one secret's full configuration by OID or foreign_id, routing OIDs by prefix and resolving foreign_ids via per-type lookup. iron-control gains list_secrets / get_secret_detail and a SECRET_TYPES catalog mirroring Grant::secret_target. * refactor(api-rs): consolidate duplicated perms/iron-control helpers Dedupe helpers that had been copied across crates and tidy a few idioms surfaced in review: - Promote iron-control's `unique_foreign_id` to public and drop the byte-identical copy in centaur-perms. - Share `managed_labels()` from iron-control's util module; remove the three inline copies (registry, principal, perms translate/role_identity). - Add `GrantSecret::from_oid`, routing OIDs via the SECRET_TYPES prefix table. This also fixes `--secret <pgs_…>` being silently rejected by the old hand-rolled prefix match. - Extract `parse_field_source` shared by the oauth/hmac field parsers, and `grant_secrets`/`revoke_secrets` helpers for the repeated grant loops. - Extract a `token_broker_source` branch helper and a URL `authority` parser; reuse `string_value`/`non_empty` in iron-proxy/api-server. - Fix two stale doc comments in args.rs (only the infra role is registered; tool secrets are operator-managed) and collapse two nested ifs. * docs(api-rs): document centaur-perms operator CLI * fix(api-rs): renumber duplicate 0003 migration to 0004 (#405) The iron-control principal migration reused version 0003, which was already taken by the session handoff idempotency migration (#388). Two files sharing a version prefix collide in sqlx, causing VersionMismatch(3) at startup. Renumber the iron-control migration to 0004 so it applies as a fresh, idempotent migration. * feat(api-rs): make warm pools deployable (#400) * feat(api-rs): make warm pools deployable * fix: handle missing Slack root thread history * fix: allow repo-cache ref pinning * fix: expose sandbox tools as cli shims * fix: grant infra role to warm pool bootstrap * fix: clear warm pool rows on drain * fix: keep warm pool bootstrap roleless * fix(api-rs): reconcile warm pools with idempotent executions * refactor(api-rs): move warm pool policy into sandbox manager --------- Co-authored-by: Centaur AI <ai@centaur.local> * fix: remove slackbotv2 synthetic starting task (#406) * fix(api-rs): preserve session migration order (#407) * fix(api-rs): select idempotency_key in active/latest execution queries (#409) active_execution_for_thread and latest_execution_for_thread map results to SessionExecutionRow, which has an idempotency_key field (FromRow matches by name), but their SELECT lists omitted the column. sqlx then fails with "no column found for name: idempotency_key". The session stdout pump calls active_execution_for_thread for every session, so the pump died on the first read and never emitted output events — the Slack bot opened the event stream and sat at "thinking" forever without posting. The write/RETURNING queries already selected the column; only these two reads were missed when idempotency_key was introduced. * fix(api-rs): return idempotency key on terminal updates (#410) * fix(slackbotv2): render api-rs terminal result text (#413) * fix(api-rs): include terminal result text in completions (#412) * fix(slackbotv2): defer Slack stream until visible output (#415) fix(slackbotv2): defer slack stream until visible output * fix(slackbotv2): bound Slack task stream payloads (#416) * fix(slackbotv2): omit task output from Slack streams (#418) Co-authored-by: Centaur AI <ai@centaur.local> * fix: preserve final answer for textless turn completion (#421) Co-authored-by: Centaur AI <ai@centaur.local> * fix(slackbotv2): scope session streams to execution (#422) * fix(slackbotv2): recover from oversized Slack renders * feat(api-rs): manage broker credentials in iron-control, drop sidecar (#404) feat(api-rs): broker credentials in iron-control for codex access-token auth Manage iron-control broker credentials — managed OAuth refresh tokens that iron-control mints and delivers inline to proxies via a `token_broker` source — and drop the iron-proxy broker sidecar. Adds the centaur-perms `broker` subcommands, `brokered_token` tool-secret parsing/translation, and the iron-control client/models for broker credentials. Supporting infra so it runs end-to-end: - chart: iron-control Solid Queue worker (bin/jobs) that runs the broker OAuth refresh loop, plus its egress NetworkPolicy and the postgres ingress allowlist entry; image pullPolicy defaulted to Always for the mutable :latest tag. - chart: create-db init waits for postgres and provisions all four logical databases (primary + Solid Cache/Queue/Cable) idempotently. - slackbotv2: tolerate the postgres startup race with an owned pool + connect retry so a transient cold-start failure doesn't wedge the bot. * fix: fail sessions on oversized sandbox output * fix(slackbotv2): honor plain text render requests * fix(slackbotv2): show command details without output * feat: update iron-proxy to 0.42.0-rc.8 with single multiplexed pg listener * feat(api-rs): persist session personas (#429) This allows clients to define persona as a part of the request. * feat(api-rs): add telemetry observability (#446) * fix(slackbotv2): pin Slack stream continuation fix (#453) fix(slackbotv2): pin slack stream continuation fix * feat(api-rs): CloudWatch tool aws_auth via iron-control (#451) * feat(api-rs): add aws_auth credential type linked to iron-control Reimplements the CloudWatch tool's AWS SigV4 re-signing support in the Rust api-rs / iron-control control plane instead of the Python api service (superseding #449). The tool signs requests with placeholder credentials; iron-proxy's aws_auth transform re-signs with the real read-only IAM keys resolved from iron-control, so credentials never enter the sandbox. - centaur-iron-control: AwsAuthSecretInput model, aws_auth_secrets endpoint + upsert client, GrantSecret/Grant/SECRET_TYPES wiring (aas_ prefix), fragment translator marks aws_auth unsupported (it is a tool secret registered via the centaur-perms CLI, like hmac_sign) - centaur-perms: parse type = "aws_auth" from a tool's pyproject and translate it to an AwsAuthSecretInput granted to the tool's role - keep the cloudwatch tool and the iron-proxy SigV4 header allowlist; drop the Python services/api changes (api-rs replaces them). AWS_REGION is non-secret and reaches the sandbox via passthrough_env. * feat(api-rs): translate aws_auth in the iron-proxy fragment path Infra/harness fragments can now declare an aws_auth transform and have it registered as an iron-control aws_auth secret, instead of erroring as unsupported. Mirrors gcp_auth: access_key_id/secret_access_key (and optional session_token) are placeholder refs resolved via the source policy, allowed_regions/allowed_services scope signing, rules use the shared request-rule shape, and the foreign_id keys on the access-key placeholder (`{role}-aws-{slug}`). * feat(api-rs): support aws_auth in tool discovery tool_discovery rejected the aws_auth secret type and dropped the whole tool, so the cloudwatch tool got no proxy fragment and no sandbox AWS credentials. Parse aws_auth into an aws_auth transform (matching the iron-control translator), seed the sandbox AWS SDK placeholder creds via placeholder_env, and add GITHUB_TOKEN to the infra-env bootstrap for the repo-cache. * fix(chart): repo-cache temp dir broken by k8s $$ collapse The sync used a PID-suffixed temp dir, but Kubernetes collapses the doubled dollar sign in a container command to a single one during its own $(VAR) expansion, so the suffix became a constant literal and the mv failed ("subdirectory of itself"). Use a deterministic temp name and sweep any stale temp dirs before cloning; sync is sequential per pod so a fixed name is safe. Self-heals existing corrupt caches on next run. * Revert "fix(chart): repo-cache temp dir broken by k8s $$ collapse" This reverts commit 2913516d2e23421888f9ba6875bc0ebfea7c1f48. * fix(chart): repo-cache temp dir broken by k8s $$ collapse The sync used a PID-suffixed temp dir, but Kubernetes collapses the doubled dollar sign in a container command to a single one during its own $(VAR) expansion, so the suffix became a constant literal and the mv failed ("subdirectory of itself"). Use a deterministic temp name and sweep any stale temp dirs before cloning; sync is sequential per pod so a fixed name is safe. Self-heals existing corrupt caches on next run. * fix(chart): keep repo-cache target local * fix(slackbotv2): continue large task streams (#458) * feat(api-rs): add Absurd workflow runtime (#465) * fix: grant infra role to warm pool bootstrap * fix: keep warm pool bootstrap roleless * refactor(api-rs): move warm pool policy into sandbox manager * feat(api-rs): add absurd workflow runtime poc * chore(api-rs): refresh absurd workflow staging image * fix(api-rs): label workflow host sandboxes * fix(api-rs): serialize workflow timestamps as rfc3339 * fix(api-rs): mount overlay workflows in workflow sandbox * fix(api-rs): propagate sandbox image pull secrets * fix(api-rs): extend workflow agent turns * fix(api-rs): keep workflow host claims alive * fix(api-rs): heartbeat all workflow tasks * fix(workflows): call tools through sandbox shims * fix(workflows): resolve sandbox tool shim outside login shells * fix(workflows): bootstrap tool shims in workflow hosts * fix(workflows): mount tools in workflow host sandboxes * fix(workflows): grant workflow host tool secrets * ci: speed up branch image builds * ci: add Dockerfile package caches * fix(api-rs): repair workflow schedule test fixture * refactor(api-rs): rename workflows crate --------- Co-authored-by: Centaur AI <ai@centaur.local> * fix(slackbotv2): suppress too-long fallback reposts (#466) * fix(slackbotv2): avoid paging open task cards (#467) * fix(slackbotv2): page Slack plans by visible tasks (#469) * fix(slackbotv2): accept Slack events route (#471) Co-authored-by: Centaur AI <ai@centaur.local> * fix(api-rs): raise stdout cap and disable service links (#473) fix(api-rs): raise stdout line cap and disable service links Co-authored-by: Centaur AI <ai@centaur.local> * fix: keep sandbox bootstrap noise out of the harness stdout stream (#474) The sandbox entrypoint's install-tool-shims printed its success notice to stdout, which is the same pipe the session stdout pump streams to clients. slackbotv2 treated any JSON-unparseable output line as a terminal codex line, so on every fresh (non-warm) sandbox the very first bootstrap line ended the render stream before the agent produced output, finalizing the Slack reply as 'Execution completed, but no final text was captured.' while the real answer streamed afterwards and was dropped. - install_tool_shims.py: write the notice to stderr - slackbotv2: non-JSON output lines are noise, not terminal - regression test: bootstrap line before codex output still delivers the answer Amp-Thread-ID: https://ampcode.com/threads/T-019eb1eb-27c4-7169-9489-41d85f8e0614 Co-authored-by: Centaur AI <ai@centaur.local> Co-authored-by: Amp <amp@ampcode.com> * fix: P0 review fixes for the Rust control plane (#344 review) (#472) * fix(api-rs): only drive session executions claimed by this request mark_execution_running treated an already-running row as a successful claim, so a concurrent request with the same idempotency key could fall into the fallback fetch, see status=running, and send the same input to the sandbox a second time. It now returns ClaimExecutionResult with a claimed flag that is true only when this call did the queued->running transition; the runtime returns the existing execution without driving it when the claim was lost. Amp-Thread-ID: https://ampcode.com/threads/T-019eb167-76de-7515-84f7-4265ce53ba85 Co-authored-by: Amp <amp@ampcode.com> * fix(api-rs): reject HTTP tool secrets without hosts An HTTP secret parsed with an empty hosts list (empty tool-level default, hosts = [], or malformed hosts falling back to an empty default) translated to an empty iron-control rules array, leaving the credential host-unlimited. Both manifest parsers (centaur-perms and the api-server's tool discovery mirror) now fail closed, matching the brokered_token parser. Affected tools are warn-skipped at discovery. * fix(api-rs): guard absurd.await_event against task/run mismatch await_event trusted the caller-provided (task_id, run_id) pair: a mismatched call could attach one task's run to a wait/checkpoint on another task and put the wrong task to sleep. Reject mismatches like get_task_checkpoint_states already does. Shipped as migration 0009 (create or replace) because 0007 is already applied in live environments and sqlx validates migration checksums. * fix(api-rs): only claim running warm sandboxes A warm sandbox observed as Created is not ready for byte I/O and means the runtime regressed after the replenisher saw it running (backends wait for readiness before returning from create). Claiming it made the session fail at open_io; mark it failed and try the next one instead. Amp-Thread-ID: https://ampcode.com/threads/T-019eb167-76de-7515-84f7-4265ce53ba85 Co-authored-by: Amp <amp@ampcode.com> * fix(api-rs): make grant_inputs_to_role idempotent grant_inputs_to_role documented idempotency but always POSTed a new grant after upserting each secret, so re-running centaur-perms grants or startup role registration produced duplicate grants or conflicts. It now lists the role's existing grants once and reuses the grant for an already-granted secret. * fix(rendering): flush buffered answer in codexAppServerToRendererEvents The array helper never called mapper.flush(), so finite sources that end without an explicit terminal event lost buffered answer text and never received renderer.done. Flush like codexAppServerToChatSdkStream does; flush is a no-op when a terminal event already completed the stream. * fix(slackbotv2): cap inline attachments at 100 MiB serializeAttachment buffered every Slack attachment in memory and base64-inlined it with no size limit, letting one large upload blow request limits or OOM the process. Skip the download when Slack's size metadata exceeds 100 MiB and re-check the actual byte count after fetching; oversized attachments degrade through the existing fetchError channel. Amp-Thread-ID: https://ampcode.com/threads/T-019eb167-76de-7515-84f7-4265ce53ba85 Co-authored-by: Amp <amp@ampcode.com> * fix(slackbotv2): isolate render obligation recovery failures One thread's corrupt state, lease error, or failed render propagated out of the recovery scan, so the remaining indexed obligations were never attempted until the next restart. Isolate each thread, log the failure, and count it as deferred so the capped-backoff retry loop revisits it. * ci: skip registry cache export for fork PRs cache-to type=registry pushes a cache manifest to GHCR even when image push is disabled, and fork PRs run with a read-only GITHUB_TOKEN, so their builds failed at cache export. Gate the registry cache-to on the same not-a-fork predicate as push. --------- Co-authored-by: Amp <amp@ampcode.com> * fix: align Slack pagination with chat sdk and raise pg limits (#475) fix: update Slack stream pagination and Postgres pool limits * feat(api-rs): serve tools + overlay to agent sandboxes (#443) * feat(api-rs): serve tools + gerard overlay to agent sandboxes api-rs sandboxes had no tools and no overlay. Give api-rs-spawned agents the same base + overlay tools and overlay system-prompt the chart already wires for the api-rs pod, using upstream's CLI-shim tool model rather than a sidecar. Upstream direction: tools are shell CLI shims, not an HTTP registry. The agent image's install-tool-shims (services/sandbox/install_tool_shims.py) scans TOOL_DIRS at entrypoint and `uvx`-installs each pyproject [project.scripts] as a CLI; the SYSTEM_PROMPT points agents at those CLIs and `centaur-tools list`. The old `call <tool>` HTTP registry is deprecated to control-plane-only. Tool secrets are already handled upstream: codex_app_server_env_template pushes the tool placeholder creds onto the agent env, iron-control grants the per-sandbox principal the real secrets, and Postgres rides proxied `*_DSN` env from apply_proxy_env. So the agent needs only the tool SOURCES at the right paths — no sidecar, no HMAC sandbox token, no loopback tool server. - tools.rs (replaces tool_server.rs): a `tools-bootstrap` init container copies /app/tools out of the shared centaur-api image into an emptyDir mounted at /app/tools in the agent, and an `overlay-bootstrap` init container copies the org overlay tree into overlay-root mounted at overlay.mountPath (the same path the api-rs Deployment uses) and stages the overlay's SYSTEM_PROMPT.md as $HOME/AGENTS_OVERLAY.md, which the sandbox entrypoint appends to the base prompt. TOOL_DIRS is set on the agent env to /app/tools (or /app/tools:<mountPath>/tools with the overlay) — identical to the value the api-rs pod computes for its own tool discovery, set deterministically in the spec builder rather than via passthrough env. - lib.rs: build_agent_sandbox layers the tools/overlay env over spec.env, mounts the bootstrapped sources read-only into the agent, and appends the tools-bootstrap + overlay-bootstrap init containers and their volumes. No sidecar container, no token minting. - args.rs: a minimal ToolsArgs (source image/pull-policy, reusing the KUBERNETES_TOOL_SERVER_IMAGE* env the chart sets from the shared api image) and OverlayArgs (image/pull-policy/source-path/mount-path) wired into AgentSandboxConfig. Explicit clap arg ids avoid id collisions with the other flattened arg structs. - chart apirs.yaml: render the tools source image (api.image.*, gated on toolServer.enabled) and overlay (overlay.*) onto the api-rs env, replacing the KUBERNETES_TOOL_SERVER_* sidecar block. Gone vs the sidecar port: tool_server.rs, the sbx1 HMAC token minting and its SANDBOX_SIGNING_KEY requirement, CENTAUR_TOOLS_URL, the sidecar pg-DSN/proxy-env collection, and the hmac/base64/sha2 dependency additions (nothing else in the agent-k8s crate uses them). Warm-pool sandboxes route through the same build_agent_sandbox path, so they get the tools/overlay init containers and volumes for free. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(api-rs): stage tools-bootstrap copy outside /app/tools The tools-bootstrap init container mounted the tools emptyDir at /app/tools — the same path it copies FROM. The mount shadows the source image's tools tree, so the script self-copies the empty volume and GNU cp rejects it (exit 1); every sandbox dies with 'reached terminal state before running' and no agent ever starts. Mount the volume at /tools-bootstrap instead (mirroring how overlay-bootstrap stages to a distinct target) and copy the image's /app/tools into it. The agent container keeps mounting the same volume at /app/tools, so TOOL_DIRS and the shim installer are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix: wire sandbox overlays without tools Gate overlay env, volumes, and mounts independently from the tools source image so overlay-only sandbox configs produce valid pod specs. * fix: make sandbox bootstrap volumes writable Set an fsGroup on sandbox pods that use tools or overlays so non-root bootstrap init containers can populate their emptyDir mounts. * fix(api-rs): source sandbox tools image from the api-rs image The tools-bootstrap init container copied /app/tools from .Values.api.image (centaur-api), but api-rs discovers its tools from /app/tools in its own container (.Values.apiRs.image). Sourcing from a different image risked the agent installing a different tool set than api-rs granted per-sandbox creds for. Source from the same api-rs image the Deployment runs so the two match by construction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(api-rs): clone sandbox tools from a repo instead of baking them in Replumb the tools-bootstrap init container to git-clone the tools repo at a pinned ref into each sandbox's /app/tools (sparse on the tools subdir; GitHub token via askpass for private repos), instead of copying /app/tools out of the api-rs image. Mirrors the repo-cache architecture — clone a repo into a pre-provisioned directory — without sharing its node-level cache, so adding a tool is a push to the repo rather than an api-rs image rebuild. api-rs still discovers its own /app/tools to grant proxy creds, so pin toolServer.ref to the tool set the image carries to avoid drift. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(api-rs): route the tools clone through the per-sandbox iron-proxy The sandbox NetworkPolicy only allows egress to the sandbox's iron-proxy, api-rs, and DNS, so the tools-bootstrap init container's direct git clone to github.com is blocked whenever iron-proxy is enabled. Route the clone through the proxy like all other sandbox egress: export HTTPS_PROXY (the resolved per-sandbox proxy URL apply_proxy_env already put on the spec) and GIT_SSL_CAINFO, and mount the pod's existing firewall-ca volume into the init container. github.com/api.github.com are already in the baseline proxy allowlist, so no policy or allowlist changes are needed. Without iron-proxy the clone still goes direct. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(api-rs): quote repo/ref/subdir in the tools-bootstrap script These are operator config (helm values -> env -> clap), not user input, but interpolating them bare into the /bin/sh -ec script means a stray space or metacharacter breaks in the shell instead of loudly in git. Quote them at every interpolation site. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(api-rs): retry the tools clone through the proxy's startup window The per-sandbox iron-proxy is created in the same reconcile as the Sandbox CR and isn't accepting connections yet when the tools-bootstrap init container first runs — the clone dies with connection-refused, and an init failure is terminal for the Sandbox (no kubelet retry), so every cold spawn failed with 'reached terminal state before running'. Wrap the clone/sparse-checkout/ref fetch in a bounded retry loop (30 x 2s) so the init container rides out the proxy's startup instead of killing the sandbox. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> (cherry picked from commit b1f274db53bca25494d6f5a12d6c3b79d72b6409) * refactor(api-rs): converge sandbox overlay on the spec-level overlay-image plumbing The base branch grew its own overlay mechanism (SandboxSpec.overlay + overlay_json) for workflow-host sandboxes, configured by the same CENTAUR_OVERLAY_* env this branch's OverlayArgs read — so a workflow-host pod with an overlay configured got two init containers and two volumes with identical names, which Kubernetes rejects. Adopt the upstream plumbing wholesale: the backend default is now an OverlayImage from the same env helper the workflow host uses (the OverlayArgs flags are gone), a spec-level overlay takes precedence over the backend default so only one overlay-bootstrap/overlay-root pair ever exists, and agent sandboxes mount the overlay at /opt/centaur/overlay like workflow hosts do. The AGENTS_OVERLAY.md prompt staging moves into the shared overlay_json path, and the chart's duplicated CENTAUR_OVERLAY_* env block is dropped — the upstream block already feeds it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Add Discord tool * Use Discord self-token client * fix(api-rs): Rust CI gate, constant-time webhook auth, error-chain hardening (#489) fix(api-rs): add Rust CI gate, constant-time webhook auth, and error-chain hardening Findings from a workspace-wide review (clippy + manual audit), in three parts. CI + lint policy: - New rust-api CI job: cargo fmt --check, clippy --workspace --all-targets -D warnings, cargo test --workspace. The Rust workspace previously had no CI coverage at all. - [workspace.lints] in the root manifest (dbg_macro/todo/unimplemented), plumbed into every crate via [lints] workspace = true. - Fixed all 16 pre-existing clippy warnings. The too_many_arguments clusters got real fixes: session-runtime background tasks now share a RuntimeContext struct, and run_agent_session_turn takes an AgentTurnRequest struct instead of 12 positional args. Webhook auth + 5xx hygiene: - HMAC verification now decodes the presented signature and uses Mac::verify_slice (constant-time) instead of encoding the digest and string-comparing. Uppercase hex signatures now verify too. - Bearer token comparison goes through subtle::ConstantTimeEq. - Missing/invalid webhook secret is now a 500 (server misconfig), not a 400 blamed on the caller, matching the Python API behavior. - 5xx responses and SSE stream errors log the full error chain server-side and return an opaque message instead of leaking sqlx/kube/runtime internals to clients. Error-chain preservation: - SandboxError::Io/Backend now carry an optional #[source] cause (with io/io_source/backend/backend_source constructors). Display output is unchanged so existing log lines keep their content, but the structured chain survives for chain-walking consumers. - absurd::Error gained TaskFailed(Box<dyn Error + Send + Sync>); workflow failures are no longer flattened into InvalidOptions strings, and serialize_error persists the full source chain in failure payloads. - WorkflowRuntimeError gained Internal (HTTP 500) and Upstream (HTTP 502) variants; ~20 server-side failure sites (Python host spawn/protocol, queue dispatch, Slack API, agent-turn outcomes) were reclassified off BadRequest so they stop surfacing as 400s and become visible to 5xx alerting. Verified with cargo fmt --check, cargo clippy --workspace --all-targets -D warnings, and cargo test --workspace (190 passed, 0 failed). New tests cover uppercase-hex/base64/invalid HMAC signatures and the missing-secret-is-internal-error contract. Amp-Thread-ID: https://ampcode.com/threads/T-019eb66a-5494-7747-a029-f5b2d8a9a5b2 Co-authored-by: Amp <amp@ampcode.com> * feat(slackbotv2): conflate render streams for slow Slack consumers (#484) A busy execution can emit tens of thousands of session events while Slack rendering pays one rate-limited API call per chunk, so a large turn could take far longer to render than to run. On 2026-06-11 a 10,453-event turn (thread slack:C0A87C21805:1781171006.081809) was still rendering 36 minutes after the harness finished when a deploy killed the pod, leaving the thread wedged behind a stuck activeExecution flag. Wrap the chunk stream in a conflator that drains the source eagerly while the consumer is busy: markdown deltas concatenate (append-only content), task updates merge per field keyed by card id with the newest value winning (updates omit details/output to mean unchanged, so absent fields inherit the pending value), and plan updates keep the latest title. Each consumer pull yields one pending item, so the Slack call count is bounded by distinct cards plus markdown volume instead of source event count. When the consumer keeps up, pending never accumulates and behavior matches the unwrapped stream. The emulate regression test shows the effect directly: 400 output deltas to one card collapse from 402 chunk sends to a handful. Existing tests that asserted specific intermediate card states now assert terminal state plus aggregate content, since intermediates are timing-dependent under conflation; the open-segment test synchronizes on the in_progress state reaching Slack before completing the card, preserving its original intent. Amp-Thread-ID: https://ampcode.com/threads/T-019eb342-f946-750e-89d1-a8f028e80d0e Co-authored-by: Amp <amp@ampcode.com> * chore(sandbox): bump codex 0.130.0 -> 0.139.0 (#490) Fixes ~3% of executions stalling for exactly 300s (stream_idle_timeout) when a turn is sent over a cached Responses websocket: codex 0.130.0 never transmitted the Responses-Lite mode marker, so a cached socket whose negotiated mode mismatched the request would silently never respond. Fixed upstream in openai/codex#26542 (954e2878, 2026-06-05) by sending the lite marker per-request via client_metadata; first release containing it is 0.138.0, we bump to latest 0.139.0. Evidence: 17/562 executions in the last 25h show the signature userMessage item.completed -> 302-305s gap -> first model output, with no correlation to cluster restarts, proxy reloads, or connection idle time. App-server JSON-RPC protocol changes between 0.130 and 0.139 are additive only for the methods our wrapper uses. Amp-Thread-ID: https://ampcode.com/threads/T-019eb6e8-b3b6-71c7-8c41-7f4b99048861 Co-authored-by: Amp <amp@ampcode.com> * Remove overlay images and refresh repo-cache tools/workflows * docs: add api-rs migration checklist (#493) Amp-Thread-ID: https://ampcode.com/threads/T-019eb71c-209d-727d-bebb-cc47f9ba7ac7 Co-authored-by: Amp <amp@ampcode.com> * Add repo-cache extra tool sources * fix(sandbox): disable Codex multi-agent tools (#499) Amp-Thread-ID: https://ampcode.com/threads/T-019eb6f6-7c8f-72cf-8744-96b41852a123 Co-authored-by: Centaur AI <ai@centaur.local> Co-authored-by: Amp <amp@ampcode.com> * feat: adopt orphaned executions and unwedge render recovery (#486) * feat(api-rs): adopt executions orphaned by control plane restarts Execution rows never time out on their own: the only writer of a terminal status is the api-rs process watching the sandbox, so a kill mid-turn (for example a deploy) leaves the row 'running' forever. That wedges the thread - the one-active-execution index blocks new executes - and any event stream consumer waits for a terminal event that will never be written. In production one such zombie had been 'running' for 17 hours while its sandbox was still alive and the finished answer ('Done: pushed commit 5ca8c99 to PR #432...') sat unread in the pod logs. At startup the runtime now adopts every orphaned execution instead of leaving it stuck: 1. If the sandbox already finished the turn while nobody was attached, recover the terminal outcome (and final answer) from the backend's recorded output - k8s attach streams only deliver from attach time forward, but the kubelet's pod logs retain what was missed. New SandboxBackend::read_output_since with a pod-logs implementation for the k8s backend; other backends default to Unsupported. 2. If the turn is still in flight, re-attach the stdout pump and re-arm the remaining max-duration budget from execution metadata. 3. If the sandbox is gone (or the orphan never received input), record the failure honestly so the thread unwedges. Covered by four Postgres-gated integration tests (mirroring the SESSION_RUNTIME_TEST_DATABASE_URL pattern), including the production zombie shape: running row + alive sandbox + answer only in recorded logs -> execution completes with the answer, no live attach needed. Amp-Thread-ID: https://ampcode.com/threads/T-019eb342-f946-750e-89d1-a8f028e80d0e Co-authored-by: Amp <amp@ampcode.com> * fix(slackbotv2): keep render recovery moving past hung and corrupt obligations The startup recovery scan walks obligations serially and fully awaits each re-render, so a single obligation whose event stream never yields a chunk (for example a zombie execution with no terminal event) blocked every obligation queued behind it forever. In production the scan sat hung on one thread for hours with ~180 undelivered answers starved behind it. Three changes: - Per-thread deadline (renderRecoveryThreadTimeoutMs, default 2m): on timeout the scan defers the thread and moves on, leaving the attempt running detached with its lease held so a later pass cannot start a duplicate render alongside it. - Lease-skipped threads now count as deferred, so the retry loop keeps running until obligations are actually resolved instead of exiting while a crashed pass's lease blocks them. - A failure budget (5 non-retryable failures) abandons obligations that can never render - such as a corrupt thread id without a thread ts, which previously poisoned the retry loop on every pass - and clears their state so the thread unwedges. Emulate tests cover a renderable obligation queued behind a hung zombie (delivered despite the hang, zombie left pending) and abandonment of the corrupt-thread obligation after repeated failures. Amp-Thread-ID: https://ampcode.com/threads/T-019eb342-f946-750e-89d1-a8f028e80d0e Co-authored-by: Amp <amp@ampcode.com> --------- Co-authored-by: Amp <amp@ampcode.com> * fix(api-rs): raise session body limits (#501) * fix(api-rs): wire harness OTLP export so Laminar traces carry cost (#492) * fix(api-rs): wire harness OTLP export so Laminar traces carry cost Since the cutover to the Rust control plane, Laminar traces contain no cost — and no spans at all. Cost only ever came from codex's session_task.turn spans (token usage normalized into gen_ai.usage.* by codex-app-wrapper's OTLP prefix proxy and priced by Laminar), and the Rust control plane never wired any of that path up: - sandbox pods got no OTEL_* env (nothing consumed sandbox.extraEnv), so the wrapper never had an OTLP endpoint to export to; - session stdin lines carried only thread_key, never trace_id or traceparent, so the wrapper never wrote codex's [otel] config; - the per-sandbox egress NetworkPolicy only allows iron-proxy, api-rs, and DNS, so direct exports would time out (and plain-HTTP via iron-proxy is rejected with 405); - the chart's api-rs egress policy has no OTLP rule, so api-rs's own spans die with BatchSpanProcessor 'network error' export failures. Fix, end to end: - chart: render sandbox.extraEnv into SESSION_SANDBOX_EXTRA_ENV on the api-rs deployment (same contract as the Python control plane's KUBERNETES_SANDBOX_EXTRA_ENV) and add networkPolicy.otlpEgress values that open api-rs egress to an in-cluster collector namespace. - args: parse SESSION_SANDBOX_EXTRA_ENV (JSON name/value list) into the codex sandbox env template (operator wins), and derive the sandbox OTLP egress NetworkPolicy target from the sandbox's own OTLP endpoint when it is an in-cluster service DNS name. - agent-k8s: add the namespace-scoped OTLP egress rule to the per-sandbox egress policy (resume-safe, backend-level config) and auto-merge the OTLP endpoint host into the sandbox NO_PROXY so the wrapper's export bypasses iron-proxy. - session-runtime: enrich every sandbox stdin line with a deterministic per-thread trace_id (UUIDv5 of the thread key; derive-don't-store replacement for the Python thread_traces row) and the execution span's W3C traceparent (always sampled), on both the execute and steering paths, so codex joins the execution trace and the wrapper configures codex's OTLP export on the first turn. - telemetry: traceparent_for_span helper. Amp-Thread-ID: https://ampcode.com/threads/T-019eb733-e7f6-76b9-81ee-49254bc72181 Co-authored-by: Amp <amp@ampcode.com> * fix(api-rs): repair clippy gate broken on api-rs-control-plane The overlay-removal commit (41d36d64) pushed directly to the branch broke 'cargo clippy --workspace --all-targets -- -D warnings': - tools.rs module doc: list continuation without indentation (doc_lazy_continuation) — separate the trailing paragraph; - workflow_host_spec: collapsible nested if left behind by the removed overlay branch — use let-chains. Amp-Thread-ID: https://ampcode.com/threads/T-019eb733-e7f6-76b9-81ee-49254bc72181 Co-authored-by: Amp <amp@ampcode.com> * fix(api-rs): always forward OTLP env (incl. ingest auth header) into codex sandboxes Production investigation showed Laminar's /v1/traces requires a project API key bearer: every OTLP POST since the cutover is a 401 (the pre-cutover deployments carried OTEL_EXPORTER_OTLP_HEADERS as an unrecorded hot-patch that the cutover sync wiped, and sandboxes inherited it via the Python control plane's hardcoded passthrough set). Mirror that passthrough in the Rust control plane: the api-rs process's OTEL_EXPORTER_OTLP_{ENDPOINT,TRACES_ENDPOINT,HEADERS} and OTEL_RESOURCE_ATTRIBUTES are always forwarded into codex sandbox env, so the ingest key flows secret -> api-rs envFrom -> sandbox -> wrapper -> codex's [otel] exporter config without ever entering Helm values. Operator passthrough/extra env still override. api-rs's own exporter needs no code change: opentelemetry-otlp reads OTEL_EXPORTER_OTLP_HEADERS from the process env. Amp-Thread-ID: https://ampcode.com/threads/T-019eb733-e7f6-76b9-81ee-49254bc72181 Co-authored-by: Amp <amp@ampcode.com> --------- Co-authored-by: Amp <amp@ampcode.com> * fix(api-rs): materialize Codex attachments (#502) fix(api-rs): materialize codex attachments * build(sandbox): bake pnpm 10.9.0 into the agent image (#504) build(sandbox): bake pnpm into the agent image Agents working in pnpm-based repos can't install dependencies or run tests — 'pnpm: command not found' — so review/CI-fix workflows either skip verification or fail outright. Install pnpm in the global npm CLI layer alongside the other pinned CLIs, with the same --version build check. The baked version is only a bootstrap: when a repo's packageManager field declares a different version, pnpm downloads and runs the declared one (v11 pmOnFail=download default, manage-package-manager-versions in v10) over the same registry egress a dependency install already needs. Co-authored-by: Claude Fable 5 <noreply@anthropic.com> * fix(slackbotv2): survive Slack stream expiry and guarantee final-answer delivery (#505) Production evidence (prd-centaur-na, 2026-06-11): five user-reported 'Something went wrong' threads where every execution completed successfully and the full answer was durably stored, but Slack delivery broke: - Slack hard-expires a streaming message ~300s after chat.startStream (measured 303-304s). Renders that lag past that fail every append/stop with message_not_in_streaming_state; the bot logged render_failed and the answer was never posted (3 of 5 bugs). - Plan-card segments accumulate task chunk content that never counted toward segment rotation, so big turns hit msg_too_long mid-stream or on stop; the bot suppressed that error as 'rendered' and silently dropped the answer (2 of 5 bugs). 54 occurrences in the prior 24h vs 459 successful renders. Slack streaming is now best-effort; the durable session result is the delivery guarantee: Adapter patch (@chat-adapter/slack): - rotate every live segment before the ~300s streaming lifetime (SLACK_STREAM_SEGMENT_MAX_AGE_MS, default 240s) - budget structured task content per segment so plan cards can no longer push a message past Slack's size cap (SLACK_STREAM_SEGMENT_TASK_CHAR_BUDGET, default 16k) - annotate thrown stream errors with slackAnswerLost so the bot knows whether the final answer became visible before the failure - rethrow delivery-impacting Slack errors instead of degrading to text-only mid-stream slackbotv2: - on any render failure where the answer is not confirmed visible, replay the session event stream from the execution's start position (events are durable and replayable) and post the terminal result text exactly once; skip the repost when the adapter confirmed the answer was delivered (preserves the no-duplicate intent of #466) - check the adapter annotation before retryability so Slack network errors are not misclassified as retryable session errors - recovery now replays from the obligation's start position instead of lastEventId, which could skip the terminal event after a failed render - drop the msg_too_long suppression (#466's vacuous regression test is replaced by real fault-injection tests; the emulator now models how real Slack breaks never-stopped streams) Liveness (4 restarts/8h from 1s probe timeouts): - default the console logger to info so the adapter's raw-webhook-body debug logs (50KB+ JSON.stringify per event) stop stalling the event loop (SLACKBOTV2_LOG_LEVEL to override) - give the chart's health probes timeoutSeconds: 5 Amp-Thread-ID: https://ampcode.com/threads/T-019eb80a-fcb6-75b4-bd3d-dac822f0b700 Co-authored-by: Amp <amp@ampcode.com> * Add Slack feedback modal storage (#500) * Add Slack feedback modal storage * fix: make Slack feedback flow deliverable end to end - rename user_feedback migration 0009 -> 0010: version 9 already exists on this branch (absurd_await_event_task_guard), so sqlx failed with a _sqlx_migrations primary-key violation on fresh databases and a checksum VersionMismatch on existing ones - dispatch Slack interactive payloads (block_actions, message_action, view_submission) from the shared event handler: app manifests point interactivity at /api/webhooks/slack, where these payloads were previously swallowed by the event path before reaching the actions handler; dispatch happens before dedup, which assumes event envelopes - cap the feedback input max_length at 3000: Block Kit rejects values above 3000, so views.open failed outright with the previous 20000 - gate message_action and view_submission on the feedback callback_ids so future sh…

gakonst and others added 8 commits June 1, 2026 22:49

feat: add chat sdk slackbot v2

01d9298

feat: add slackbotv2 runtime entrypoint

9008b7d

refactor: inline harness event app server types

5ab0fc8

fix: create session before slackbotv2 append

90e8a5d

fix: improve slackbot v2 stream tasks

f303b2d

refactor: route chat sdk rendering through slackbotv2

c0f86a5

fix: improve slackbotv2 stream latency

dadf984

gakonst force-pushed the codex/chat-sdk-on-pr344 branch from dc664b0 to dadf984 Compare June 1, 2026 20:49

goksu mentioned this pull request Jun 1, 2026

fix: stream slackbotv2 responses #347

Merged