Skip to content

[codex] stack Chat SDK Slackbot on api-rs control plane#346

Merged
Zygimantass merged 10 commits into
api-rs-control-planefrom
codex/chat-sdk-on-pr344
Jun 1, 2026
Merged

[codex] stack Chat SDK Slackbot on api-rs control plane#346
Zygimantass merged 10 commits into
api-rs-control-planefrom
codex/chat-sdk-on-pr344

Conversation

@gakonst

@gakonst gakonst commented Jun 1, 2026

Copy link
Copy Markdown
Member

Summary

Validation

  • pnpm install --frozen-lockfile
  • pnpm --filter @centaur/harness-events test
  • pnpm --filter @centaur/harness-events typecheck
  • pnpm --filter @centaur/rendering test
  • pnpm --filter @centaur/rendering typecheck
  • pnpm --filter slackbotv2 test
  • pnpm --filter slackbotv2 check:types

Local deploy notes

  • built and rolled out centaur-slackbotv2-pr344-chat-sdk:18f27b5a to centaur-slackbotv2-pr327
  • API-rs remains on centaur-api-rs-pr344-chat-sdk:9939d6ab-linux; local API health is passing
  • local API-rs is running with RUN_MIGRATIONS=false against the reused PR327 DB because that DB has the older migration checksum
  • Funnel is on: https://gak.tail388b2e.ts.net -> 127.0.0.1:3002

gakonst and others added 8 commits June 1, 2026 22:49
Add @centaur/rendering as the boundary from Codex app-server events to generic renderer events and Chat SDK-shaped output.

Wire Slackbot Codex sessions through the renderer package while keeping Slack delivery in a Slackbot adapter.

Keep services/api-rs caller-neutral by removing Slack examples from the session control-plane docs and tests.
@gakonst gakonst force-pushed the codex/chat-sdk-on-pr344 branch from dc664b0 to dadf984 Compare June 1, 2026 20:49
Comment thread services/slackbotv2/src/index.ts Outdated
Comment thread services/slackbotv2/src/index.ts Outdated
Comment thread services/slackbotv2/src/index.ts Outdated
Comment thread services/slackbotv2/src/index.ts Outdated
Comment thread services/slackbotv2/src/index.ts Outdated
Comment thread services/slackbotv2/src/index.ts
@Zygimantass Zygimantass marked this pull request as ready for review June 1, 2026 22:00
@Zygimantass Zygimantass merged commit fc11a0d into api-rs-control-plane Jun 1, 2026
4 checks passed
@Zygimantass Zygimantass deleted the codex/chat-sdk-on-pr344 branch June 1, 2026 22:00
Zygimantass added a commit that referenced this pull request Jun 10, 2026
* feat: add chat sdk slackbot v2

* feat: add slackbotv2 runtime entrypoint

* refactor: inline harness event app server types

* fix: create session before slackbotv2 append

* fix: improve slackbot v2 stream tasks

* feat: add canonical renderer pipeline

Add @centaur/rendering as the boundary from Codex app-server events to generic renderer events and Chat SDK-shaped output.

Wire Slackbot Codex sessions through the renderer package while keeping Slack delivery in a Slackbot adapter.

Keep services/api-rs caller-neutral by removing Slack examples from the session control-plane docs and tests.

* refactor: route chat sdk rendering through slackbotv2

* fix: improve slackbotv2 stream latency

* fix: stream slackbotv2 responses (#347)

* fix: address slackbotv2 review comments

---------

Co-authored-by: Zygimantas <5236121+Zygimantass@users.noreply.github.com>
Co-authored-by: Goksu Toprak <goksu@tempo.xyz>
Co-authored-by: Goksu Toprak <19259594+goksu@users.noreply.github.com>
Zygimantass added a commit that referenced this pull request Jun 11, 2026
* feat(api-rs): add sandbox core

* feat(api-rs): add local sandbox backend

* feat(api-rs): add agent sandbox backend

* refactor(api-rs): parse sandbox test config with clap

* feat(api-rs): add session store

* chore(api-rs): raise session db pool limit

* chore(api-rs): remove unused execution claim helper

* feat(api-rs): restrict session harness types

* refactor(api-rs): derive session harness serde

* refactor(api-rs): derive session enum strings

* feat(api-rs): add session HTTP API

* refactor(api-rs): parse server config with clap

* fix(api-rs): render execution status with display

* feat(api-rs): add session CLI

* feat(api-rs): wire codex sandbox e2e path

* feat(api-rs): accept stdin session events in cli

* feat(api-rs): add session cli tui

* feat(api-rs): add sandbox core (#315)

* refactor: simplify sandbox interface

* refactor(api-rs): use owned sandbox io streams (#334)

* refactor(api-rs): use owned sandbox io streams

* refactor(api-rs): move session streaming into runtime

* refactor(api-rs): remove mock session runtime branch

* refactor(api-rs): move sandbox workload modes into runtime

* refactor(api-rs): clean up session API client and e2e tests

* refactor(api-rs): use library sse event types

* refactor(api-rs): address owned io review comments

* refactor(api-rs): wake session streams with postgres notify (#345)

* [codex] stack Chat SDK Slackbot on api-rs control plane (#346)

* feat: add chat sdk slackbot v2

* feat: add slackbotv2 runtime entrypoint

* refactor: inline harness event app server types

* fix: create session before slackbotv2 append

* fix: improve slackbot v2 stream tasks

* feat: add canonical renderer pipeline

Add @centaur/rendering as the boundary from Codex app-server events to generic renderer events and Chat SDK-shaped output.

Wire Slackbot Codex sessions through the renderer package while keeping Slack delivery in a Slackbot adapter.

Keep services/api-rs caller-neutral by removing Slack examples from the session control-plane docs and tests.

* refactor: route chat sdk rendering through slackbotv2

* fix: improve slackbotv2 stream latency

* fix: stream slackbotv2 responses (#347)

* fix: address slackbotv2 review comments

---------

Co-authored-by: Zygimantas <5236121+Zygimantass@users.noreply.github.com>
Co-authored-by: Goksu Toprak <goksu@tempo.xyz>
Co-authored-by: Goksu Toprak <19259594+goksu@users.noreply.github.com>

* fix: make sandbox tool CLIs importable (#351)

* fix: preserve final Slack answers after plans (#352)

* fix: use upstream Chat SDK Slack streaming (#360)

Streaming is patched and we no longer need our shim.

* docs: add local centaur dev skill

* [codex] add focused api-rs iron proxy integration (#365)

* feat: add api-rs iron proxy integration

* fix: align api-rs iron proxy wiring

* docs: fix run-centaur-dev api-rs env

* refactor: address api-rs review feedback

* fix: tighten api-rs dev sandbox setup

---------

Co-authored-by: Zygimantas <5236121+Zygimantass@users.noreply.github.com>

* chore: trim sandbox image size (#371)

* feat: load iron proxy fragments from tool pyprojects (#372)

* feat: load iron proxy fragments from tool pyprojects

* docs: note pyproject fragment conversion todo

* fix: improve sandbox tool smoke behavior (#373)

* fix: improve sandbox tool smoke behavior

* chore: remove unused harness env marker

* [codex] pass sandbox tool paths and package tool CLIs (#354)

fix: rebase tool cli packaging on api-rs control plane

Co-authored-by: Zygimantas <5236121+Zygimantass@users.noreply.github.com>

* feat(api-rs): wire api-rs and slackbotv2 into helm chart (#374)

Add Dockerfiles for the Rust control plane (centaur-api-server) and the
Chat SDK Slackbot v2, and deploy both via the chart alongside the Python
stack. api-rs runs the agent-k8s + codex sandbox backend with per-sandbox
iron-proxy; slackbotv2 replaces the legacy slackbot and targets api-rs.
Enables the agent-sandbox subchart, reuses the in-cluster postgres, and
repoints ingress/networkpolicy from slackbot to slackbotv2.

* fix(rendering): preserve full task output (#379)

* fix(slackbotv2): harden api-rs handoff and stream recovery

* fix(session): make Slack retries idempotent for append and execute (#388)

fix(session): make slack handoff idempotent

* chore(sandbox): add thin test image (#389)

* fix: render terminal slack session completions

* feat(api-rs): integrate iron-control for per-principal proxy credential grants (#378)

* feat(api-rs): integrate iron-control for per-principal proxy credential grants

Add a centaur-iron-control crate (typed admin client + fragment->resource
translation + principal derivation) and wire it into the session and sandbox
paths. Sessions upsert an iron-control principal (Slack user for DMs, channel
otherwise) and assign infra + tool/harness roles; per-sandbox iron-proxies are
registered to that principal and sync their config over IRON_CONTROL_URL with
an iprx_ token instead of a rendered static config. Gated on iron-control being
configured, so the legacy rendered-proxy path stays intact as a fallback. Adds
the Helm wiring (api-rs env + egress NetworkPolicy) behind ironControl.enabled.

* chore(iron-proxy): bump base image to 0.42.0-rc.5

* fix(chart): allow api-rs ingress to iron-control network policy

api-rs registers principals/roles/grants with iron-control on startup,
but iron-control's NetworkPolicy ingress only permitted the legacy Python
api component. Under the default-deny policy the CNI dropped api-rs's
inbound connection, surfacing as a reqwest 'stream closed: EOF' transport
error during startup role registration. Add api-rs to the ingress
allowlist, gated on apiRs.enabled.

* feat(api-rs): make iron-control mandatory and add sandbox drain

Remove the legacy non-synced iron-proxy fallback: a per-sandbox proxy
always syncs its config from iron-control, so iron-proxy now requires
iron-control to be configured (hard error at startup otherwise) and a
session-registration failure fails session creation instead of silently
booting a sandbox with a non-functional proxy.

Persist the iron-control principal OID on the session row (new migration
0003) instead of an in-memory map, so a resumed session can recreate its
sandbox after an api-rs restart without re-deriving the principal.

Add SessionRuntime::drain and POST /api/sandboxes/drain to stop every
non-terminal sandbox, reporting which stopped and which failed.

* feat(iron-proxy): run managed proxies with no local config

A managed (control-plane-synced) iron-proxy rejects a local config that
sets management.listen, since the control plane owns it. Stop providing
any rendered proxy.yaml to per-sandbox proxies: the entrypoint runs
iron-proxy with no -config when IRON_CONTROL_URL is set, and the local
settings the control plane does not own (tunnel/dns listen, TLS/CA paths,
log level) are passed as IRON_* env vars instead.

Removes the per-sandbox ConfigMap and the now-dead fragment-derived
config/env (postgres listeners are already unsupported on the iron-control
path).

* fix(chart): allow per-sandbox iron-proxy ingress to iron-control

In managed mode each per-sandbox iron-proxy syncs its effective config
directly from IRON_CONTROL_URL (the iron-control Service) over /proxy/sync.
Proxy pods are labeled centaur.ai/iron-proxy=true, not component=api-rs, so
iron-control's NetworkPolicy was dropping their sync connection. Add the
iron-proxy pod selector to the ingress allowlist.

* fix(iron-proxy): use IRON_CONTROL_PLANE_URL for the managed proxy

The iron-proxy binary reads its control-plane base URL from
IRON_CONTROL_PLANE_URL, not IRON_CONTROL_URL (the latter is api-rs's own
admin-client var). Setting the wrong name made the proxy fall back to its
built-in default endpoint instead of the in-cluster iron-control. Inject
IRON_CONTROL_PLANE_URL on the proxy pod and key the entrypoint's
managed-mode detection off it too.

* feat(api-rs): single source of truth for harness auth mode

The codex/claude auth mode was set independently on two sides — the
proxy/iron-control fragment (api-rs CODEX_AUTH_MODE) and the sandbox agent
(via sandbox.extraEnv, which only fed the legacy api). They could drift,
e.g. agent on chatgpt.com (access_token) while the proxy registered the
api.openai.com api_key secret, so no credential was injected and requests
401'd.

Hoist the auth mode to first-class chart values sandbox.codexAuthMode /
sandbox.claudeCodeAuthMode. api-rs reads them to register the matching
iron-proxy credential AND now propagates them into each sandbox spec, so
the agent's auth.json and the injected credential always agree. The legacy
api sources the same values into its sandbox env. justfile sets the real
vars from .env (was sandbox.extraEnv).

* chore(justfile): route .env auth mode to sandbox.codexAuthMode

Set the hoisted first-class chart values (sandbox.codexAuthMode /
sandbox.claudeCodeAuthMode) from the .env vars instead of
sandbox.extraEnv, so the dev deploy drives the single source of truth.

* fix(api-rs): inject harness credential placeholder into the sandbox

In api_key mode codex only runs `codex login --with-api-key` if an API
key is present in the agent env; otherwise it keeps the dummy ChatGPT
auth.json and talks to chatgpt.com. The managed-mode refactor stopped
injecting the harness fragment's placeholder, so codex had no key. Inject
placeholder_env from the resolved harness fragment (e.g. api_key mode →
OPENAI_API_KEY=OPENAI_API_KEY) so codex authenticates against api.openai.com
and iron-proxy swaps in the real credential. Empty for access_token, which
uses inject rather than replace.

* fix(api-rs): inject full fragment placeholder set into the sandbox

Extend sandbox placeholder injection from the harness fragment to the full
proxy fragment set (infra + harness + tools), matching the iron-control
roles. Env-based consumers now send the proxy_value iron-proxy replaces:
codex's OPENAI_API_KEY, git's GITHUB_TOKEN, and the rest of the infra/tool
credentials. Managed mode had dropped these, breaking git auth and any
env-based secret use in the sandbox.

* feat(api-rs): register a single shared tools role, not one per tool

Bootstrap registered the infra role plus a separate role per harness and
per tool fragment. Collapse all harness + tool fragments into one shared
"tools" role (RoleSpec::tools) holding every secret they declare, alongside
the unchanged "infra" role. SessionRegistrar already grants every
registered role to each session principal, so a channel's default principal
gets both infra and tools.

* fix(api-rs): rebind iron-proxy to its principal on resume via annotation

Reconcile the base's pause-deletes / resume-recreates iron-proxy lifecycle
with the iron-control model, where resolve_iron_proxy needs the session's
principal but resume() has only the sandbox id. Stamp the principal OID as
a sandbox annotation (centaur.ai/iron-control-principal) at create and read
it back on resume. The annotation lives on the Sandbox CRD, which survives
pause and api-rs restarts, so resume rebinds to the same identity without
any in-memory or DB lookup.

* chore(iron-proxy): bump base image to 0.42.0-rc.6

* feat(api-rs): integrate pg_dsn secrets end to end

Register each fragment postgres listener as an iron-control pg_dsn secret
(PUT /api/v1/pg_dsn_secrets, granted via pg_dsn_secret_id) instead of
rejecting it: the upstream DSN resolves like any secret source and an
optional role becomes the proxy's SET ROLE. The shared tools/infra role
holds them alongside the other secrets.

On the per-sandbox proxy pod, deliver the local listen/client knobs the
managed proxy reads by foreign_id — IRON_PROXY_PG_<FID>_{LISTEN,CLIENT_USER,
CLIENT_PASSWORD} (password value direct, no indirection) — and inject the
proxied DSN into the sandbox env, exposing each listener port on the proxy
Service + allowing it sandbox->proxy. foreign_id is derived from a shared
pg_foreign_id() so registration and the proxy env agree.

* refactor(api-rs): bootstrap infra only; stop registering tool secrets

api-rs no longer discovers tool YAML fragments and registers a secret-per-
tool at boot. It registers exactly one role, infra, holding the shared infra
secrets plus the harness auth (codex/claude) — harness auth is infra,
selected by auth mode and baked into the binary (harness_auth_fragment)
rather than discovered from disk. The tools role and per-tool roles are gone.

Tool secrets + their grants become operator-managed in the control plane;
their sandbox env will be derived from the principal's effective grants
(follow-up: GET /api/v1/principals/:id/effective_config).

Removes the tool/harness fragment discovery args (TOOL_DIRS,
KUBERNETES_IRON_PROXY_FRAGMENT_*), the harness YAML files, and the api-rs
image's tool/harness COPY. Infra placeholders + harness auth still come from
the known infra set, so the agent always boots. Dead fragment code
(render.rs, disk discovery) and the effective_config read are follow-ups.

* feat(api-rs): derive tool-secret sandbox env from effective_config

Operator-managed tool secrets now reach the sandbox via the principal's
effective config instead of local fragments. Add the iron-control read
GET /api/v1/principals/:id/effective_config (EffectiveConfig: replace
proxy_values + postgres {foreign_id, database}).

At sandbox create/resume, api-rs fetches it for the bound principal and
derives: replace-secret placeholders (sandbox env, set_missing so known
infra placeholders win) and Postgres listeners. The control plane owns each
pg upstream dsn/role/database; api-rs assigns the local coordination —
listen port (sequential from 6432, sorted by foreign_id), client user, a
generated password, and the sandbox env var name (<NORM_FID>_DSN) — and
wires IRON_PROXY_PG_<FID>_{LISTEN,CLIENT_USER,CLIENT_PASSWORD} + the DSN +
port exposure. Replaces the fragment-based pg derivation.

* fix(api-rs): match effective_config to landed API; defer postgres

Validated against iron-control's docs/API.md + PgDsnSecret#to_proxy_dsn:
the effective_config response is {data:...}-enveloped (decode_data already
unwraps it) and secrets carry replace.proxy_value — both as implemented.

But postgres entries are {id, foreign_id, dsn, role} with no database (the
dsn is an unresolved source, so iron-control can't surface a dbname). Drop
the database field from EffectivePgDsn and defer pg-listener derivation: the
sandbox connect-db source is unsettled. Replace-secret placeholders still
flow from effective_config; the pg wiring stays dormant (empty).

* refactor(iron-proxy): remove dead fragment-discovery code

Harness fragments are now baked into harness_auth_fragment and tool
fragments are operator-managed, so the disk-discovery path is dead. Remove
discover_fragment_files, discover_harness_fragment_files,
harness_fragment_from_dirs, harness_broker_fragments(_from_dirs),
default_harness_fragment_dirs, HarnessFragmentFile, parse_harness_fragment_file,
the visit_* walkers, strip_auth_suffix, and their consts + exports + test.

Kept: load_fragment_file/load_fragment_str (infra + baked harness),
infra_fragment, placeholder_env, and the pyproject parser (still reachable
via load_fragment_file). render.rs + pg_dsn registration untouched.

* refactor(iron-proxy): remove dead proxy-rendering machinery

Managed mode never renders a proxy.yaml, so the rendering path is dead.
Remove render.rs, ports.rs (listen_ports_from_yaml / pg_dsn_envs), the
ProxyConfig/ProxySection structs, PgDsnEnv + PostgresListener::pg_dsn_env,
the model's source-resolution methods (Transform/Secret/Postgres*
resolve_sources, is_managed, explicit_id, fill_missing_source,
MANAGED_TRANSFORMS), load_default_proxy_base_config + the base-config path,
and the unused values.rs helpers (listen_port, non_empty). Drop the
render-dependent tests; the token-broker render + source resolution
(broker.rs) is untouched.

Kept: load_fragment_file/str, infra_fragment, harness_auth_fragment,
placeholder_env, the data model (used by the iron-control registry), and
the pyproject parser (still reachable via load_fragment_file).

* refactor(iron-proxy): remove dead pyproject tool-secret parser

The pyproject parser was only reachable via load_fragment_file's
pyproject.toml branch, which is never taken now (only infra.yaml is
loaded; tool secrets are operator-managed). Remove load_pyproject_fragment_file
and its toml helpers, the pyproject branch, the ParsePyproject + now-unused
ParseBase error variants, and the toml + sha2 dependencies.

load_fragment_file now loads YAML only; everything else (infra, harness,
placeholder_env, token-broker) is untouched.

* refactor(iron-proxy): embed infra.yaml via include_str!, drop runtime file loader

infra.yaml is the only remaining fragment file and was read at runtime by
walking up the filesystem (repo_relative_path) + a Dockerfile COPY. Embed it
at compile time with include_str! and parse via load_fragment_str (same as
the baked harness fragments). Removes load_fragment_file, repo_relative_path,
read_file, the ReadFile/ReadDir error variants, and the image's infra.yaml
COPY — the binary now carries no runtime config-file dependency.

infra.yaml stays in the repo as the embed source; load_fragment_str + the
Value-based model are kept (concise YAML beats hand-built struct literals).

* fix(iron-proxy): move embedded infra.yaml into the crate

The api-rs Docker build context is services/api-rs/ only, so include_str!
of services/iron-proxy/infra.yaml (a sibling) failed to resolve. Move the
embed source to crates/centaur-iron-proxy/src/infra.yaml and include it by
local path so it's always in the build context.

* feat(centaur-grants): add CLI to grant principals tool roles/secrets

Add a centaur-grants CLI that lets operators add/revoke/list grants for
Slack principals (users/channels) against iron-control. The primary unit
is a tool's tool-{slug} role: the CLI resolves a tool by name across
overlay-ordered tool dirs, parses its pyproject [tool.centaur] secrets,
registers them as iron-control resources, grants them to the role, and
assigns the role to the principal. A lower-level direct secret-grant path
is also supported.

Reuses centaur-iron-control's canonical mappings (derive_principal,
RoleSpec::tool, source_from_placeholder) so the principal/role foreign_ids
match exactly what api-rs writes. Extends that crate with the read/delete
endpoints the CLI needs (get_principal, list_principal_roles,
unassign_role, delete_grant) and factors grant_inputs_to_role out of
register_role for reuse.

* feat(centaur-grants): add `principals` subcommand for discovery

List principals registered in iron-control so operators can find the
exact foreign_id to grant against (e.g. whether a deployment folds the
Slack team id into the key) without dropping to curl. Supports --label
key=value filters, --managed (managed-by=centaur), and a --filter
substring match. Backed by a new paginating list_principals client
method on IronControlClient.

* feat(centaur-grants): use grantee-scoped grant listing

iron-control now exposes GET /principals/:id/grants and
GET /roles/:id/grants. Add list_principal_grants/list_role_grants client
methods (sharing a paginate helper) and enrich the Grant model with its
grantee/secret references plus a secret_id() accessor.

Use them to: revoke --secret <OID> by finding the principal's matching
grant (no longer requires the grant OID), and show actual grants per role
and direct grants in `list`.

* refactor(centaur-perms): rename from centaur-grants, resource-first CLI

Rename the crate/binary to centaur-perms and restructure the CLI as
<noun> <verb>:

  principals list | show <p> | grant <p> | revoke <p>
  roles      list | show <r> | grant <r> | revoke <r>

principals grant/revoke take any mix of --tool, --role, and --secret
targets; roles grant/revoke assign/remove secrets on a role (the
previously missing role<-secret path). Adds get_role and list_roles
client methods.

* feat(centaur-perms): grant tool-config secrets to a role

Add --tool (and --secret-name) to `roles grant`, so a tool's pyproject
[tool.centaur] secrets can be registered and granted onto an explicit
role:

  centaur-perms roles grant <role> --tool <name> [--secret-name NAME]...

The secret resources keep their canonical tool-<slug>-<secret> foreign
ids (keyed on the tool's own role), so the same secret object is shared
regardless of which role it's granted to. The role is always required.

* feat(centaur-perms): principals show accepts foreign_id or OID

Look the principal up by whatever is passed — Slack thread key (derived),
foreign_id, or prn_ OID, all accepted by GET /principals/:id — and display
the principal's actual foreign_id from the response instead of echoing the
input. Adds principal::resolve_lookup for the read-only path.

* refactor(centaur-perms): drop redundant resolve_lookup helper

resolve_lookup(p, u) returned the same string as
resolve_principal(p, u, ns).foreign_id, so principals show now uses
resolve_principal like the other read handlers. Keeps the display fix
(show the response's actual foreign_id, not the echoed input).

* fix(centaur-perms): resolve foreign_ids via namespaced lookup endpoint

iron-control's bare GET /{principals,roles}/:id only matches OIDs, so a
foreign_id 404s. get_principal/get_role now take a namespace and route
foreign_ids through GET /{collection}/lookup/:namespace/:foreign_id,
keeping the bare /:id route for prn_/role_ OIDs. Namespace is threaded
through the CLI read paths.

* fix(iron-control): route all foreign_id lookups via the lookup endpoints

Per the API docs, the bare GET /{collection}/:id routes are OID-only;
foreign_ids must use GET /{collection}/lookup/:namespace/:foreign_id (and
GET /principals/lookup/:namespace/:foreign_id/effective_config).

- Generalize the path helper to resource_path(.., suffix) and use it for
  get_principal, get_role, and effective_config (now namespace-aware,
  routing foreign_ids through the lookup variant).
- Thread the iron-control namespace into effective_config callers: add a
  namespace field to IronControlSettings (set from IRON_CONTROL_NAMESPACE)
  and pass it from the agent-k8s backend; the CLI passes --namespace.
- Correct the sub-resource list methods' docs (roles/grants under a
  principal or role are OID-only; callers pass the resolved OID).

* feat(api-rs): wire pg_dsn secrets, default gcp_auth scope, name grants in show

- pg_dsn: add database to PgDsnSecretInput/EffectivePgDsn; build per-principal
  Postgres listeners from effective_config in the sandbox agent (was deferred);
  make pg_dsn a first-class secret in the centaur-perms CLI (parse + translate),
  with foreign_id round-tripping to the sandbox DSN env var.
- gcp_auth: default scope-less secrets to cloud-platform, matching the Python
  proxy_config (iron-control requires a non-empty scopes).
- perms show: principals show and roles show now print each grant's type and
  name (resolved by OID), not just the bare secret OID.

* feat(api-rs): wire hmac_sign secrets through iron-control and centaur-perms

Make hmac_sign a first-class secret type end-to-end, mirroring pg_dsn:
- iron-control: add HmacSecretInput/HmacSecretHeader, upsert_hmac_secret,
  GrantSecret::Hmac + Grant.hmac_secret_id, and the SecretInput::Hmac grant arm.
- centaur-perms: parse hmac_sign from pyproject.toml (required secret
  credential, validated algorithm/encoding/timestamp enums) and translate it
  to a HmacSecretInput with a {role}-hmac-{slug} foreign_id; accept hms_ OIDs.

Tool hmac secrets are operator-managed via the CLI; the infra/harness fragment
translator still rejects hmac_sign since neither signs requests.

* chore(iron-proxy): bump image to 0.42.0-rc.7

* feat(centaur-perms): add secrets list and show commands

List sweeps all five secret types (static, oauth_token, gcp_auth, pg_dsn,
hmac) in a namespace into one table, reusing the --label/--filter/--managed
filters. Show fetches one secret's full configuration by OID or foreign_id,
routing OIDs by prefix and resolving foreign_ids via per-type lookup.

iron-control gains list_secrets / get_secret_detail and a SECRET_TYPES
catalog mirroring Grant::secret_target.

* refactor(api-rs): consolidate duplicated perms/iron-control helpers

Dedupe helpers that had been copied across crates and tidy a few idioms
surfaced in review:

- Promote iron-control's `unique_foreign_id` to public and drop the
  byte-identical copy in centaur-perms.
- Share `managed_labels()` from iron-control's util module; remove the
  three inline copies (registry, principal, perms translate/role_identity).
- Add `GrantSecret::from_oid`, routing OIDs via the SECRET_TYPES prefix
  table. This also fixes `--secret <pgs_…>` being silently rejected by the
  old hand-rolled prefix match.
- Extract `parse_field_source` shared by the oauth/hmac field parsers, and
  `grant_secrets`/`revoke_secrets` helpers for the repeated grant loops.
- Extract a `token_broker_source` branch helper and a URL `authority`
  parser; reuse `string_value`/`non_empty` in iron-proxy/api-server.
- Fix two stale doc comments in args.rs (only the infra role is registered;
  tool secrets are operator-managed) and collapse two nested ifs.

* docs(api-rs): document centaur-perms operator CLI

* fix(api-rs): renumber duplicate 0003 migration to 0004 (#405)

The iron-control principal migration reused version 0003, which was
already taken by the session handoff idempotency migration (#388).
Two files sharing a version prefix collide in sqlx, causing
VersionMismatch(3) at startup. Renumber the iron-control migration
to 0004 so it applies as a fresh, idempotent migration.

* feat(api-rs): make warm pools deployable (#400)

* feat(api-rs): make warm pools deployable

* fix: handle missing Slack root thread history

* fix: allow repo-cache ref pinning

* fix: expose sandbox tools as cli shims

* fix: grant infra role to warm pool bootstrap

* fix: clear warm pool rows on drain

* fix: keep warm pool bootstrap roleless

* fix(api-rs): reconcile warm pools with idempotent executions

* refactor(api-rs): move warm pool policy into sandbox manager

---------

Co-authored-by: Centaur AI <ai@centaur.local>

* fix: remove slackbotv2 synthetic starting task (#406)

* fix(api-rs): preserve session migration order (#407)

* fix(api-rs): select idempotency_key in active/latest execution queries (#409)

active_execution_for_thread and latest_execution_for_thread map results to
SessionExecutionRow, which has an idempotency_key field (FromRow matches by
name), but their SELECT lists omitted the column. sqlx then fails with
"no column found for name: idempotency_key".

The session stdout pump calls active_execution_for_thread for every session,
so the pump died on the first read and never emitted output events — the Slack
bot opened the event stream and sat at "thinking" forever without posting.
The write/RETURNING queries already selected the column; only these two reads
were missed when idempotency_key was introduced.

* fix(api-rs): return idempotency key on terminal updates (#410)

* fix(slackbotv2): render api-rs terminal result text (#413)

* fix(api-rs): include terminal result text in completions (#412)

* fix(slackbotv2): defer Slack stream until visible output (#415)

fix(slackbotv2): defer slack stream until visible output

* fix(slackbotv2): bound Slack task stream payloads (#416)

* fix(slackbotv2): omit task output from Slack streams (#418)

Co-authored-by: Centaur AI <ai@centaur.local>

* fix: preserve final answer for textless turn completion (#421)

Co-authored-by: Centaur AI <ai@centaur.local>

* fix(slackbotv2): scope session streams to execution (#422)

* fix(slackbotv2): recover from oversized Slack renders

* feat(api-rs): manage broker credentials in iron-control, drop sidecar (#404)

feat(api-rs): broker credentials in iron-control for codex access-token auth

Manage iron-control broker credentials — managed OAuth refresh tokens that
iron-control mints and delivers inline to proxies via a `token_broker` source —
and drop the iron-proxy broker sidecar. Adds the centaur-perms `broker`
subcommands, `brokered_token` tool-secret parsing/translation, and the
iron-control client/models for broker credentials.

Supporting infra so it runs end-to-end:
- chart: iron-control Solid Queue worker (bin/jobs) that runs the broker OAuth
  refresh loop, plus its egress NetworkPolicy and the postgres ingress allowlist
  entry; image pullPolicy defaulted to Always for the mutable :latest tag.
- chart: create-db init waits for postgres and provisions all four logical
  databases (primary + Solid Cache/Queue/Cable) idempotently.
- slackbotv2: tolerate the postgres startup race with an owned pool + connect
  retry so a transient cold-start failure doesn't wedge the bot.

* fix: fail sessions on oversized sandbox output

* fix(slackbotv2): honor plain text render requests

* fix(slackbotv2): show command details without output

* feat: update iron-proxy to 0.42.0-rc.8 with single multiplexed pg listener

* feat(api-rs): persist session personas (#429)

This allows clients to define persona as a part of the request.

* feat(api-rs): add telemetry observability (#446)

* fix(slackbotv2): pin Slack stream continuation fix (#453)

fix(slackbotv2): pin slack stream continuation fix

* feat(api-rs): CloudWatch tool aws_auth via iron-control (#451)

* feat(api-rs): add aws_auth credential type linked to iron-control

Reimplements the CloudWatch tool's AWS SigV4 re-signing support in the
Rust api-rs / iron-control control plane instead of the Python api
service (superseding #449).

The tool signs requests with placeholder credentials; iron-proxy's
aws_auth transform re-signs with the real read-only IAM keys resolved
from iron-control, so credentials never enter the sandbox.

- centaur-iron-control: AwsAuthSecretInput model, aws_auth_secrets
  endpoint + upsert client, GrantSecret/Grant/SECRET_TYPES wiring
  (aas_ prefix), fragment translator marks aws_auth unsupported (it is
  a tool secret registered via the centaur-perms CLI, like hmac_sign)
- centaur-perms: parse type = "aws_auth" from a tool's pyproject and
  translate it to an AwsAuthSecretInput granted to the tool's role
- keep the cloudwatch tool and the iron-proxy SigV4 header allowlist;
  drop the Python services/api changes (api-rs replaces them). AWS_REGION
  is non-secret and reaches the sandbox via passthrough_env.

* feat(api-rs): translate aws_auth in the iron-proxy fragment path

Infra/harness fragments can now declare an aws_auth transform and have
it registered as an iron-control aws_auth secret, instead of erroring as
unsupported. Mirrors gcp_auth: access_key_id/secret_access_key (and
optional session_token) are placeholder refs resolved via the source
policy, allowed_regions/allowed_services scope signing, rules use the
shared request-rule shape, and the foreign_id keys on the access-key
placeholder (`{role}-aws-{slug}`).

* feat(api-rs): support aws_auth in tool discovery

tool_discovery rejected the aws_auth secret type and dropped the whole
tool, so the cloudwatch tool got no proxy fragment and no sandbox AWS
credentials. Parse aws_auth into an aws_auth transform (matching the
iron-control translator), seed the sandbox AWS SDK placeholder creds via
placeholder_env, and add GITHUB_TOKEN to the infra-env bootstrap for the
repo-cache.

* fix(chart): repo-cache temp dir broken by k8s $$ collapse

The sync used a PID-suffixed temp dir, but Kubernetes collapses the
doubled dollar sign in a container command to a single one during its own
$(VAR) expansion, so the suffix became a constant literal and the mv
failed ("subdirectory of itself"). Use a deterministic temp name and
sweep any stale temp dirs before cloning; sync is sequential per pod so a
fixed name is safe. Self-heals existing corrupt caches on next run.

* Revert "fix(chart): repo-cache temp dir broken by k8s $$ collapse"

This reverts commit 2913516d2e23421888f9ba6875bc0ebfea7c1f48.

* fix(chart): repo-cache temp dir broken by k8s $$ collapse

The sync used a PID-suffixed temp dir, but Kubernetes collapses the
doubled dollar sign in a container command to a single one during its own
$(VAR) expansion, so the suffix became a constant literal and the mv
failed ("subdirectory of itself"). Use a deterministic temp name and
sweep any stale temp dirs before cloning; sync is sequential per pod so a
fixed name is safe. Self-heals existing corrupt caches on next run.

* fix(chart): keep repo-cache target local

* fix(slackbotv2): continue large task streams (#458)

* feat(api-rs): add Absurd workflow runtime (#465)

* fix: grant infra role to warm pool bootstrap

* fix: keep warm pool bootstrap roleless

* refactor(api-rs): move warm pool policy into sandbox manager

* feat(api-rs): add absurd workflow runtime poc

* chore(api-rs): refresh absurd workflow staging image

* fix(api-rs): label workflow host sandboxes

* fix(api-rs): serialize workflow timestamps as rfc3339

* fix(api-rs): mount overlay workflows in workflow sandbox

* fix(api-rs): propagate sandbox image pull secrets

* fix(api-rs): extend workflow agent turns

* fix(api-rs): keep workflow host claims alive

* fix(api-rs): heartbeat all workflow tasks

* fix(workflows): call tools through sandbox shims

* fix(workflows): resolve sandbox tool shim outside login shells

* fix(workflows): bootstrap tool shims in workflow hosts

* fix(workflows): mount tools in workflow host sandboxes

* fix(workflows): grant workflow host tool secrets

* ci: speed up branch image builds

* ci: add Dockerfile package caches

* fix(api-rs): repair workflow schedule test fixture

* refactor(api-rs): rename workflows crate

---------

Co-authored-by: Centaur AI <ai@centaur.local>

* fix(slackbotv2): suppress too-long fallback reposts (#466)

* fix(slackbotv2): avoid paging open task cards (#467)

* fix(slackbotv2): page Slack plans by visible tasks (#469)

* fix(slackbotv2): accept Slack events route (#471)

Co-authored-by: Centaur AI <ai@centaur.local>

* fix(api-rs): raise stdout cap and disable service links (#473)

fix(api-rs): raise stdout line cap and disable service links

Co-authored-by: Centaur AI <ai@centaur.local>

* fix: keep sandbox bootstrap noise out of the harness stdout stream (#474)

The sandbox entrypoint's install-tool-shims printed its success notice to
stdout, which is the same pipe the session stdout pump streams to clients.
slackbotv2 treated any JSON-unparseable output line as a terminal codex
line, so on every fresh (non-warm) sandbox the very first bootstrap line
ended the render stream before the agent produced output, finalizing the
Slack reply as 'Execution completed, but no final text was captured.'
while the real answer streamed afterwards and was dropped.

- install_tool_shims.py: write the notice to stderr
- slackbotv2: non-JSON output lines are noise, not terminal
- regression test: bootstrap line before codex output still delivers the answer

Amp-Thread-ID: https://ampcode.com/threads/T-019eb1eb-27c4-7169-9489-41d85f8e0614

Co-authored-by: Centaur AI <ai@centaur.local>
Co-authored-by: Amp <amp@ampcode.com>

* fix: P0 review fixes for the Rust control plane (#344 review) (#472)

* fix(api-rs): only drive session executions claimed by this request

mark_execution_running treated an already-running row as a successful
claim, so a concurrent request with the same idempotency key could fall
into the fallback fetch, see status=running, and send the same input to
the sandbox a second time. It now returns ClaimExecutionResult with a
claimed flag that is true only when this call did the queued->running
transition; the runtime returns the existing execution without driving
it when the claim was lost.

Amp-Thread-ID: https://ampcode.com/threads/T-019eb167-76de-7515-84f7-4265ce53ba85
Co-authored-by: Amp <amp@ampcode.com>

* fix(api-rs): reject HTTP tool secrets without hosts

An HTTP secret parsed with an empty hosts list (empty tool-level
default, hosts = [], or malformed hosts falling back to an empty
default) translated to an empty iron-control rules array, leaving the
credential host-unlimited. Both manifest parsers (centaur-perms and the
api-server's tool discovery mirror) now fail closed, matching the
brokered_token parser. Affected tools are warn-skipped at discovery.

* fix(api-rs): guard absurd.await_event against task/run mismatch

await_event trusted the caller-provided (task_id, run_id) pair: a
mismatched call could attach one task's run to a wait/checkpoint on
another task and put the wrong task to sleep. Reject mismatches like
get_task_checkpoint_states already does. Shipped as migration 0009
(create or replace) because 0007 is already applied in live
environments and sqlx validates migration checksums.

* fix(api-rs): only claim running warm sandboxes

A warm sandbox observed as Created is not ready for byte I/O and means
the runtime regressed after the replenisher saw it running (backends
wait for readiness before returning from create). Claiming it made the
session fail at open_io; mark it failed and try the next one instead.

Amp-Thread-ID: https://ampcode.com/threads/T-019eb167-76de-7515-84f7-4265ce53ba85
Co-authored-by: Amp <amp@ampcode.com>

* fix(api-rs): make grant_inputs_to_role idempotent

grant_inputs_to_role documented idempotency but always POSTed a new
grant after upserting each secret, so re-running centaur-perms grants
or startup role registration produced duplicate grants or conflicts.
It now lists the role's existing grants once and reuses the grant for
an already-granted secret.

* fix(rendering): flush buffered answer in codexAppServerToRendererEvents

The array helper never called mapper.flush(), so finite sources that end
without an explicit terminal event lost buffered answer text and never
received renderer.done. Flush like codexAppServerToChatSdkStream does;
flush is a no-op when a terminal event already completed the stream.

* fix(slackbotv2): cap inline attachments at 100 MiB

serializeAttachment buffered every Slack attachment in memory and
base64-inlined it with no size limit, letting one large upload blow
request limits or OOM the process. Skip the download when Slack's size
metadata exceeds 100 MiB and re-check the actual byte count after
fetching; oversized attachments degrade through the existing fetchError
channel.

Amp-Thread-ID: https://ampcode.com/threads/T-019eb167-76de-7515-84f7-4265ce53ba85
Co-authored-by: Amp <amp@ampcode.com>

* fix(slackbotv2): isolate render obligation recovery failures

One thread's corrupt state, lease error, or failed render propagated
out of the recovery scan, so the remaining indexed obligations were
never attempted until the next restart. Isolate each thread, log the
failure, and count it as deferred so the capped-backoff retry loop
revisits it.

* ci: skip registry cache export for fork PRs

cache-to type=registry pushes a cache manifest to GHCR even when image
push is disabled, and fork PRs run with a read-only GITHUB_TOKEN, so
their builds failed at cache export. Gate the registry cache-to on the
same not-a-fork predicate as push.

---------

Co-authored-by: Amp <amp@ampcode.com>

* fix: align Slack pagination with chat sdk and raise pg limits (#475)

fix: update Slack stream pagination and Postgres pool limits

* feat(api-rs): serve tools + overlay to agent sandboxes (#443)

* feat(api-rs): serve tools + gerard overlay to agent sandboxes

api-rs sandboxes had no tools and no overlay. Give api-rs-spawned agents the
same base + overlay tools and overlay system-prompt the chart already wires for
the api-rs pod, using upstream's CLI-shim tool model rather than a sidecar.

Upstream direction: tools are shell CLI shims, not an HTTP registry. The agent
image's install-tool-shims (services/sandbox/install_tool_shims.py) scans
TOOL_DIRS at entrypoint and `uvx`-installs each pyproject [project.scripts] as a
CLI; the SYSTEM_PROMPT points agents at those CLIs and `centaur-tools list`. The
old `call <tool>` HTTP registry is deprecated to control-plane-only. Tool
secrets are already handled upstream: codex_app_server_env_template pushes the
tool placeholder creds onto the agent env, iron-control grants the per-sandbox
principal the real secrets, and Postgres rides proxied `*_DSN` env from
apply_proxy_env. So the agent needs only the tool SOURCES at the right paths —
no sidecar, no HMAC sandbox token, no loopback tool server.

- tools.rs (replaces tool_server.rs): a `tools-bootstrap` init container copies
  /app/tools out of the shared centaur-api image into an emptyDir mounted at
  /app/tools in the agent, and an `overlay-bootstrap` init container copies the
  org overlay tree into overlay-root mounted at overlay.mountPath (the same path
  the api-rs Deployment uses) and stages the overlay's SYSTEM_PROMPT.md as
  $HOME/AGENTS_OVERLAY.md, which the sandbox entrypoint appends to the base
  prompt. TOOL_DIRS is set on the agent env to /app/tools (or
  /app/tools:<mountPath>/tools with the overlay) — identical to the value the
  api-rs pod computes for its own tool discovery, set deterministically in the
  spec builder rather than via passthrough env.
- lib.rs: build_agent_sandbox layers the tools/overlay env over spec.env, mounts
  the bootstrapped sources read-only into the agent, and appends the
  tools-bootstrap + overlay-bootstrap init containers and their volumes. No
  sidecar container, no token minting.
- args.rs: a minimal ToolsArgs (source image/pull-policy, reusing the
  KUBERNETES_TOOL_SERVER_IMAGE* env the chart sets from the shared api image) and
  OverlayArgs (image/pull-policy/source-path/mount-path) wired into
  AgentSandboxConfig. Explicit clap arg ids avoid id collisions with the other
  flattened arg structs.
- chart apirs.yaml: render the tools source image (api.image.*, gated on
  toolServer.enabled) and overlay (overlay.*) onto the api-rs env, replacing the
  KUBERNETES_TOOL_SERVER_* sidecar block.

Gone vs the sidecar port: tool_server.rs, the sbx1 HMAC token minting and its
SANDBOX_SIGNING_KEY requirement, CENTAUR_TOOLS_URL, the sidecar pg-DSN/proxy-env
collection, and the hmac/base64/sha2 dependency additions (nothing else in the
agent-k8s crate uses them).

Warm-pool sandboxes route through the same build_agent_sandbox path, so they get
the tools/overlay init containers and volumes for free.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(api-rs): stage tools-bootstrap copy outside /app/tools

The tools-bootstrap init container mounted the tools emptyDir at
/app/tools — the same path it copies FROM. The mount shadows the source
image's tools tree, so the script self-copies the empty volume and GNU
cp rejects it (exit 1); every sandbox dies with 'reached terminal state
before running' and no agent ever starts.

Mount the volume at /tools-bootstrap instead (mirroring how
overlay-bootstrap stages to a distinct target) and copy the image's
/app/tools into it. The agent container keeps mounting the same volume
at /app/tools, so TOOL_DIRS and the shim installer are unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix: wire sandbox overlays without tools

Gate overlay env, volumes, and mounts independently from the tools source image so overlay-only sandbox configs produce valid pod specs.

* fix: make sandbox bootstrap volumes writable

Set an fsGroup on sandbox pods that use tools or overlays so non-root bootstrap init containers can populate their emptyDir mounts.

* fix(api-rs): source sandbox tools image from the api-rs image

The tools-bootstrap init container copied /app/tools from .Values.api.image
(centaur-api), but api-rs discovers its tools from /app/tools in its own
container (.Values.apiRs.image). Sourcing from a different image risked the
agent installing a different tool set than api-rs granted per-sandbox creds
for. Source from the same api-rs image the Deployment runs so the two match
by construction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(api-rs): clone sandbox tools from a repo instead of baking them in

Replumb the tools-bootstrap init container to git-clone the tools repo at a
pinned ref into each sandbox's /app/tools (sparse on the tools subdir; GitHub
token via askpass for private repos), instead of copying /app/tools out of the
api-rs image. Mirrors the repo-cache architecture — clone a repo into a
pre-provisioned directory — without sharing its node-level cache, so adding a
tool is a push to the repo rather than an api-rs image rebuild.

api-rs still discovers its own /app/tools to grant proxy creds, so pin
toolServer.ref to the tool set the image carries to avoid drift.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(api-rs): route the tools clone through the per-sandbox iron-proxy

The sandbox NetworkPolicy only allows egress to the sandbox's iron-proxy,
api-rs, and DNS, so the tools-bootstrap init container's direct git clone to
github.com is blocked whenever iron-proxy is enabled. Route the clone through
the proxy like all other sandbox egress: export HTTPS_PROXY (the resolved
per-sandbox proxy URL apply_proxy_env already put on the spec) and
GIT_SSL_CAINFO, and mount the pod's existing firewall-ca volume into the init
container. github.com/api.github.com are already in the baseline proxy
allowlist, so no policy or allowlist changes are needed. Without iron-proxy
the clone still goes direct.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(api-rs): quote repo/ref/subdir in the tools-bootstrap script

These are operator config (helm values -> env -> clap), not user input, but
interpolating them bare into the /bin/sh -ec script means a stray space or
metacharacter breaks in the shell instead of loudly in git. Quote them at
every interpolation site.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(api-rs): retry the tools clone through the proxy's startup window

The per-sandbox iron-proxy is created in the same reconcile as the Sandbox CR
and isn't accepting connections yet when the tools-bootstrap init container
first runs — the clone dies with connection-refused, and an init failure is
terminal for the Sandbox (no kubelet retry), so every cold spawn failed with
'reached terminal state before running'. Wrap the clone/sparse-checkout/ref
fetch in a bounded retry loop (30 x 2s) so the init container rides out the
proxy's startup instead of killing the sandbox.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
(cherry picked from commit b1f274db53bca25494d6f5a12d6c3b79d72b6409)

* refactor(api-rs): converge sandbox overlay on the spec-level overlay-image plumbing

The base branch grew its own overlay mechanism (SandboxSpec.overlay +
overlay_json) for workflow-host sandboxes, configured by the same
CENTAUR_OVERLAY_* env this branch's OverlayArgs read — so a workflow-host
pod with an overlay configured got two init containers and two volumes
with identical names, which Kubernetes rejects.

Adopt the upstream plumbing wholesale: the backend default is now an
OverlayImage from the same env helper the workflow host uses (the
OverlayArgs flags are gone), a spec-level overlay takes precedence over
the backend default so only one overlay-bootstrap/overlay-root pair ever
exists, and agent sandboxes mount the overlay at /opt/centaur/overlay
like workflow hosts do. The AGENTS_OVERLAY.md prompt staging moves into
the shared overlay_json path, and the chart's duplicated CENTAUR_OVERLAY_*
env block is dropped — the upstream block already feeds it.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Add Discord tool

* Use Discord self-token client

* fix(api-rs): Rust CI gate, constant-time webhook auth, error-chain hardening (#489)

fix(api-rs): add Rust CI gate, constant-time webhook auth, and error-chain hardening

Findings from a workspace-wide review (clippy + manual audit), in three parts.

CI + lint policy:
- New rust-api CI job: cargo fmt --check, clippy --workspace --all-targets
  -D warnings, cargo test --workspace. The Rust workspace previously had no
  CI coverage at all.
- [workspace.lints] in the root manifest (dbg_macro/todo/unimplemented),
  plumbed into every crate via [lints] workspace = true.
- Fixed all 16 pre-existing clippy warnings. The too_many_arguments
  clusters got real fixes: session-runtime background tasks now share a
  RuntimeContext struct, and run_agent_session_turn takes an
  AgentTurnRequest struct instead of 12 positional args.

Webhook auth + 5xx hygiene:
- HMAC verification now decodes the presented signature and uses
  Mac::verify_slice (constant-time) instead of encoding the digest and
  string-comparing. Uppercase hex signatures now verify too.
- Bearer token comparison goes through subtle::ConstantTimeEq.
- Missing/invalid webhook secret is now a 500 (server misconfig), not a
  400 blamed on the caller, matching the Python API behavior.
- 5xx responses and SSE stream errors log the full error chain server-side
  and return an opaque message instead of leaking sqlx/kube/runtime
  internals to clients.

Error-chain preservation:
- SandboxError::Io/Backend now carry an optional #[source] cause (with
  io/io_source/backend/backend_source constructors). Display output is
  unchanged so existing log lines keep their content, but the structured
  chain survives for chain-walking consumers.
- absurd::Error gained TaskFailed(Box<dyn Error + Send + Sync>); workflow
  failures are no longer flattened into InvalidOptions strings, and
  serialize_error persists the full source chain in failure payloads.
- WorkflowRuntimeError gained Internal (HTTP 500) and Upstream (HTTP 502)
  variants; ~20 server-side failure sites (Python host spawn/protocol,
  queue dispatch, Slack API, agent-turn outcomes) were reclassified off
  BadRequest so they stop surfacing as 400s and become visible to 5xx
  alerting.

Verified with cargo fmt --check, cargo clippy --workspace --all-targets
-D warnings, and cargo test --workspace (190 passed, 0 failed). New tests
cover uppercase-hex/base64/invalid HMAC signatures and the
missing-secret-is-internal-error contract.

Amp-Thread-ID: https://ampcode.com/threads/T-019eb66a-5494-7747-a029-f5b2d8a9a5b2

Co-authored-by: Amp <amp@ampcode.com>

* feat(slackbotv2): conflate render streams for slow Slack consumers (#484)

A busy execution can emit tens of thousands of session events while Slack
rendering pays one rate-limited API call per chunk, so a large turn could
take far longer to render than to run. On 2026-06-11 a 10,453-event turn
(thread slack:C0A87C21805:1781171006.081809) was still rendering 36 minutes
after the harness finished when a deploy killed the pod, leaving the thread
wedged behind a stuck activeExecution flag.

Wrap the chunk stream in a conflator that drains the source eagerly while
the consumer is busy: markdown deltas concatenate (append-only content),
task updates merge per field keyed by card id with the newest value winning
(updates omit details/output to mean unchanged, so absent fields inherit
the pending value), and plan updates keep the latest title. Each consumer
pull yields one pending item, so the Slack call count is bounded by
distinct cards plus markdown volume instead of source event count. When
the consumer keeps up, pending never accumulates and behavior matches the
unwrapped stream.

The emulate regression test shows the effect directly: 400 output deltas
to one card collapse from 402 chunk sends to a handful. Existing tests
that asserted specific intermediate card states now assert terminal state
plus aggregate content, since intermediates are timing-dependent under
conflation; the open-segment test synchronizes on the in_progress state
reaching Slack before completing the card, preserving its original intent.

Amp-Thread-ID: https://ampcode.com/threads/T-019eb342-f946-750e-89d1-a8f028e80d0e

Co-authored-by: Amp <amp@ampcode.com>

* chore(sandbox): bump codex 0.130.0 -> 0.139.0 (#490)

Fixes ~3% of executions stalling for exactly 300s (stream_idle_timeout)
when a turn is sent over a cached Responses websocket: codex 0.130.0
never transmitted the Responses-Lite mode marker, so a cached socket
whose negotiated mode mismatched the request would silently never
respond. Fixed upstream in openai/codex#26542 (954e2878, 2026-06-05)
by sending the lite marker per-request via client_metadata; first
release containing it is 0.138.0, we bump to latest 0.139.0.

Evidence: 17/562 executions in the last 25h show the signature
userMessage item.completed -> 302-305s gap -> first model output,
with no correlation to cluster restarts, proxy reloads, or
connection idle time. App-server JSON-RPC protocol changes between
0.130 and 0.139 are additive only for the methods our wrapper uses.

Amp-Thread-ID: https://ampcode.com/threads/T-019eb6e8-b3b6-71c7-8c41-7f4b99048861

Co-authored-by: Amp <amp@ampcode.com>

* Remove overlay images and refresh repo-cache tools/workflows

* docs: add api-rs migration checklist (#493)

Amp-Thread-ID: https://ampcode.com/threads/T-019eb71c-209d-727d-bebb-cc47f9ba7ac7

Co-authored-by: Amp <amp@ampcode.com>

* Add repo-cache extra tool sources

* fix(sandbox): disable Codex multi-agent tools (#499)

Amp-Thread-ID: https://ampcode.com/threads/T-019eb6f6-7c8f-72cf-8744-96b41852a123

Co-authored-by: Centaur AI <ai@centaur.local>
Co-authored-by: Amp <amp@ampcode.com>

* feat: adopt orphaned executions and unwedge render recovery (#486)

* feat(api-rs): adopt executions orphaned by control plane restarts

Execution rows never time out on their own: the only writer of a terminal
status is the api-rs process watching the sandbox, so a kill mid-turn (for
example a deploy) leaves the row 'running' forever. That wedges the thread
- the one-active-execution index blocks new executes - and any event
stream consumer waits for a terminal event that will never be written. In
production one such zombie had been 'running' for 17 hours while its
sandbox was still alive and the finished answer ('Done: pushed commit
5ca8c99 to PR #432...') sat unread in the pod logs.

At startup the runtime now adopts every orphaned execution instead of
leaving it stuck:

1. If the sandbox already finished the turn while nobody was attached,
   recover the terminal outcome (and final answer) from the backend's
   recorded output - k8s attach streams only deliver from attach time
   forward, but the kubelet's pod logs retain what was missed. New
   SandboxBackend::read_output_since with a pod-logs implementation for
   the k8s backend; other backends default to Unsupported.
2. If the turn is still in flight, re-attach the stdout pump and re-arm
   the remaining max-duration budget from execution metadata.
3. If the sandbox is gone (or the orphan never received input), record
   the failure honestly so the thread unwedges.

Covered by four Postgres-gated integration tests (mirroring the
SESSION_RUNTIME_TEST_DATABASE_URL pattern), including the production
zombie shape: running row + alive sandbox + answer only in recorded logs
-> execution completes with the answer, no live attach needed.

Amp-Thread-ID: https://ampcode.com/threads/T-019eb342-f946-750e-89d1-a8f028e80d0e
Co-authored-by: Amp <amp@ampcode.com>

* fix(slackbotv2): keep render recovery moving past hung and corrupt obligations

The startup recovery scan walks obligations serially and fully awaits each
re-render, so a single obligation whose event stream never yields a chunk
(for example a zombie execution with no terminal event) blocked every
obligation queued behind it forever. In production the scan sat hung on
one thread for hours with ~180 undelivered answers starved behind it.

Three changes:
- Per-thread deadline (renderRecoveryThreadTimeoutMs, default 2m): on
  timeout the scan defers the thread and moves on, leaving the attempt
  running detached with its lease held so a later pass cannot start a
  duplicate render alongside it.
- Lease-skipped threads now count as deferred, so the retry loop keeps
  running until obligations are actually resolved instead of exiting
  while a crashed pass's lease blocks them.
- A failure budget (5 non-retryable failures) abandons obligations that
  can never render - such as a corrupt thread id without a thread ts,
  which previously poisoned the retry loop on every pass - and clears
  their state so the thread unwedges.

Emulate tests cover a renderable obligation queued behind a hung zombie
(delivered despite the hang, zombie left pending) and abandonment of the
corrupt-thread obligation after repeated failures.

Amp-Thread-ID: https://ampcode.com/threads/T-019eb342-f946-750e-89d1-a8f028e80d0e
Co-authored-by: Amp <amp@ampcode.com>

---------

Co-authored-by: Amp <amp@ampcode.com>

* fix(api-rs): raise session body limits (#501)

* fix(api-rs): wire harness OTLP export so Laminar traces carry cost (#492)

* fix(api-rs): wire harness OTLP export so Laminar traces carry cost

Since the cutover to the Rust control plane, Laminar traces contain no
cost — and no spans at all. Cost only ever came from codex's
session_task.turn spans (token usage normalized into gen_ai.usage.* by
codex-app-wrapper's OTLP prefix proxy and priced by Laminar), and the
Rust control plane never wired any of that path up:

- sandbox pods got no OTEL_* env (nothing consumed sandbox.extraEnv),
  so the wrapper never had an OTLP endpoint to export to;
- session stdin lines carried only thread_key, never trace_id or
  traceparent, so the wrapper never wrote codex's [otel] config;
- the per-sandbox egress NetworkPolicy only allows iron-proxy, api-rs,
  and DNS, so direct exports would time out (and plain-HTTP via
  iron-proxy is rejected with 405);
- the chart's api-rs egress policy has no OTLP rule, so api-rs's own
  spans die with BatchSpanProcessor 'network error' export failures.

Fix, end to end:

- chart: render sandbox.extraEnv into SESSION_SANDBOX_EXTRA_ENV on the
  api-rs deployment (same contract as the Python control plane's
  KUBERNETES_SANDBOX_EXTRA_ENV) and add networkPolicy.otlpEgress values
  that open api-rs egress to an in-cluster collector namespace.
- args: parse SESSION_SANDBOX_EXTRA_ENV (JSON name/value list) into the
  codex sandbox env template (operator wins), and derive the sandbox
  OTLP egress NetworkPolicy target from the sandbox's own OTLP endpoint
  when it is an in-cluster service DNS name.
- agent-k8s: add the namespace-scoped OTLP egress rule to the
  per-sandbox egress policy (resume-safe, backend-level config) and
  auto-merge the OTLP endpoint host into the sandbox NO_PROXY so the
  wrapper's export bypasses iron-proxy.
- session-runtime: enrich every sandbox stdin line with a deterministic
  per-thread trace_id (UUIDv5 of the thread key; derive-don't-store
  replacement for the Python thread_traces row) and the execution
  span's W3C traceparent (always sampled), on both the execute and
  steering paths, so codex joins the execution trace and the wrapper
  configures codex's OTLP export on the first turn.
- telemetry: traceparent_for_span helper.

Amp-Thread-ID: https://ampcode.com/threads/T-019eb733-e7f6-76b9-81ee-49254bc72181
Co-authored-by: Amp <amp@ampcode.com>

* fix(api-rs): repair clippy gate broken on api-rs-control-plane

The overlay-removal commit (41d36d64) pushed directly to the branch
broke 'cargo clippy --workspace --all-targets -- -D warnings':

- tools.rs module doc: list continuation without indentation
  (doc_lazy_continuation) — separate the trailing paragraph;
- workflow_host_spec: collapsible nested if left behind by the removed
  overlay branch — use let-chains.

Amp-Thread-ID: https://ampcode.com/threads/T-019eb733-e7f6-76b9-81ee-49254bc72181
Co-authored-by: Amp <amp@ampcode.com>

* fix(api-rs): always forward OTLP env (incl. ingest auth header) into codex sandboxes

Production investigation showed Laminar's /v1/traces requires a project
API key bearer: every OTLP POST since the cutover is a 401 (the
pre-cutover deployments carried OTEL_EXPORTER_OTLP_HEADERS as an
unrecorded hot-patch that the cutover sync wiped, and sandboxes
inherited it via the Python control plane's hardcoded passthrough set).

Mirror that passthrough in the Rust control plane: the api-rs process's
OTEL_EXPORTER_OTLP_{ENDPOINT,TRACES_ENDPOINT,HEADERS} and
OTEL_RESOURCE_ATTRIBUTES are always forwarded into codex sandbox env,
so the ingest key flows secret -> api-rs envFrom -> sandbox -> wrapper
-> codex's [otel] exporter config without ever entering Helm values.
Operator passthrough/extra env still override. api-rs's own exporter
needs no code change: opentelemetry-otlp reads OTEL_EXPORTER_OTLP_HEADERS
from the process env.

Amp-Thread-ID: https://ampcode.com/threads/T-019eb733-e7f6-76b9-81ee-49254bc72181
Co-authored-by: Amp <amp@ampcode.com>

---------

Co-authored-by: Amp <amp@ampcode.com>

* fix(api-rs): materialize Codex attachments (#502)

fix(api-rs): materialize codex attachments

* build(sandbox): bake pnpm 10.9.0 into the agent image (#504)

build(sandbox): bake pnpm into the agent image

Agents working in pnpm-based repos can't install dependencies or run
tests — 'pnpm: command not found' — so review/CI-fix workflows either
skip verification or fail outright. Install pnpm in the global npm CLI
layer alongside the other pinned CLIs, with the same --version build
check.

The baked version is only a bootstrap: when a repo's packageManager
field declares a different version, pnpm downloads and runs the declared
one (v11 pmOnFail=download default, manage-package-manager-versions in
v10) over the same registry egress a dependency install already needs.

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

* fix(slackbotv2): survive Slack stream expiry and guarantee final-answer delivery (#505)

Production evidence (prd-centaur-na, 2026-06-11): five user-reported
'Something went wrong' threads where every execution completed successfully
and the full answer was durably stored, but Slack delivery broke:

- Slack hard-expires a streaming message ~300s after chat.startStream
  (measured 303-304s). Renders that lag past that fail every append/stop
  with message_not_in_streaming_state; the bot logged render_failed and
  the answer was never posted (3 of 5 bugs).
- Plan-card segments accumulate task chunk content that never counted
  toward segment rotation, so big turns hit msg_too_long mid-stream or on
  stop; the bot suppressed that error as 'rendered' and silently dropped
  the answer (2 of 5 bugs). 54 occurrences in the prior 24h vs 459
  successful renders.

Slack streaming is now best-effort; the durable session result is the
delivery guarantee:

Adapter patch (@chat-adapter/slack):
- rotate every live segment before the ~300s streaming lifetime
  (SLACK_STREAM_SEGMENT_MAX_AGE_MS, default 240s)
- budget structured task content per segment so plan cards can no longer
  push a message past Slack's size cap
  (SLACK_STREAM_SEGMENT_TASK_CHAR_BUDGET, default 16k)
- annotate thrown stream errors with slackAnswerLost so the bot knows
  whether the final answer became visible before the failure
- rethrow delivery-impacting Slack errors instead of degrading to
  text-only mid-stream

slackbotv2:
- on any render failure where the answer is not confirmed visible, replay
  the session event stream from the execution's start position (events are
  durable and replayable) and post the terminal result text exactly once;
  skip the repost when the adapter confirmed the answer was delivered
  (preserves the no-duplicate intent of #466)
- check the adapter annotation before retryability so Slack network
  errors are not misclassified as retryable session errors
- recovery now replays from the obligation's start position instead of
  lastEventId, which could skip the terminal event after a failed render
- drop the msg_too_long suppression (#466's vacuous regression test is
  replaced by real fault-injection tests; the emulator now models how
  real Slack breaks never-stopped streams)

Liveness (4 restarts/8h from 1s probe timeouts):
- default the console logger to info so the adapter's raw-webhook-body
  debug logs (50KB+ JSON.stringify per event) stop stalling the event loop
  (SLACKBOTV2_LOG_LEVEL to override)
- give the chart's health probes timeoutSeconds: 5

Amp-Thread-ID: https://ampcode.com/threads/T-019eb80a-fcb6-75b4-bd3d-dac822f0b700

Co-authored-by: Amp <amp@ampcode.com>

* Add Slack feedback modal storage (#500)

* Add Slack feedback modal storage

* fix: make Slack feedback flow deliverable end to end

- rename user_feedback migration 0009 -> 0010: version 9 already exists on
  this branch (absurd_await_event_task_guard), so sqlx failed with a
  _sqlx_migrations primary-key violation on fresh databases and a checksum
  VersionMismatch on existing ones
- dispatch Slack interactive payloads (block_actions, message_action,
  view_submission) from the shared event handler: app manifests point
  interactivity at /api/webhooks/slack, where these payloads were previously
  swallowed by the event path before reaching the actions handler; dispatch
  happens before dedup, which assumes event envelopes
- cap the feedback input max_length at 3000: Block Kit rejects values above
  3000, so views.open failed outright with the previous 20000
- gate message_action and view_submission on the feedback callback_ids so
  future sh…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants