fix: P0 review fixes for the Rust control plane (#344 review) by Zygimantass · Pull Request #472 · paradigmxyz/centaur

Zygimantass · 2026-06-10T14:12:54Z

Fixes the low-effort/high-reward (P0) findings from the code review of #344, one commit per finding.

Correctness

Execution claim race — mark_execution_running treated an already-running row as a successful claim, so a concurrent request with the same idempotency key could send the same input to the sandbox twice. It now returns a claimed flag set only by the actual queued→running transition, and the runtime only drives the execution when it owns the claim.
absurd.await_event task/run mismatch — the function never verified p_run_id belongs to p_task_id, so a mismatched call could attach one task's run to another task's wait/checkpoint and put the wrong task to sleep. Added the same guard get_task_checkpoint_states already had. Shipped as migration 0009 (create or replace) rather than editing 0007, since 0007 is applied in live environments and sqlx validates migration checksums.
Warm pool handed out not-ready sandboxes — Created sandboxes (not ready for byte I/O) were claimable and failed later at open_io. Since backends wait for readiness before the replenisher records a sandbox, Created at claim time means the runtime regressed: it is now marked failed and the next one is tried.
Grant idempotency — grant_inputs_to_role documented idempotency but always POSTed a new grant; re-running centaur-perms … grant or startup role registration duplicated grants or conflicted. It now reuses the role's existing grant per secret.

Security

Empty hosts ⇒ unscoped secret — an HTTP tool secret parsed with an empty hosts list emitted an empty iron-control rules array, leaving the credential host-unlimited. Both manifest parsers (centaur-perms and the api-server tool-discovery mirror) now fail closed, matching the existing brokered_token behavior.
- ⚠️ tools/infra/grafana/pyproject.toml declares hosts = [] and is the one in-repo tool relying on the old behavior: it is now warn-skipped at discovery and errors in centaur-perms until real hosts are declared. Someone who knows the deployment's Grafana/VictoriaMetrics hostnames should add them.

Bot / rendering

Lost final answers — codexAppServerToRendererEvents never called mapper.flush(), so finite sources ending without a terminal event lost buffered answer text and never got renderer.done. Now flushes like the streaming variant (regression test included, verified red-without-fix).
Unbounded attachment buffering — Slack attachments were downloaded into memory and base64-inlined with no limit. Now capped at 100 MiB, checked against Slack's size metadata before downloading and re-checked on actual bytes; oversized attachments degrade via the existing fetchError channel.
Recovery scan aborted on first failure — one corrupt obligation or render error stopped startup recovery for all remaining threads until the next restart. Each thread is now isolated, logged (slackbotv2_render_recovery_thread_failed), and deferred to the capped-backoff retry loop.

CI

Fork PR builds failed at cache export — cache-to: type=registry pushes a cache manifest to GHCR even with push: false, and fork PRs run with a read-only token. The registry cache export is now gated on the same not-a-fork predicate as push.

Verification

cargo check / cargo test on all touched crates (135 tests), cargo fmt --check, no new clippy warnings
bun test + tsc --noEmit for packages/rendering (18) and services/slackbotv2 (22)
Migrations 0001→0009 applied to a disposable Postgres 16; functional smoke test of the new guard: mismatched (task, run) raises does not belong, matched pair passes the guard and fails on the pre-existing must be running check
actionlint on the workflow change

mark_execution_running treated an already-running row as a successful claim, so a concurrent request with the same idempotency key could fall into the fallback fetch, see status=running, and send the same input to the sandbox a second time. It now returns ClaimExecutionResult with a claimed flag that is true only when this call did the queued->running transition; the runtime returns the existing execution without driving it when the claim was lost. Amp-Thread-ID: https://ampcode.com/threads/T-019eb167-76de-7515-84f7-4265ce53ba85 Co-authored-by: Amp <amp@ampcode.com>

An HTTP secret parsed with an empty hosts list (empty tool-level default, hosts = [], or malformed hosts falling back to an empty default) translated to an empty iron-control rules array, leaving the credential host-unlimited. Both manifest parsers (centaur-perms and the api-server's tool discovery mirror) now fail closed, matching the brokered_token parser. Affected tools are warn-skipped at discovery.

await_event trusted the caller-provided (task_id, run_id) pair: a mismatched call could attach one task's run to a wait/checkpoint on another task and put the wrong task to sleep. Reject mismatches like get_task_checkpoint_states already does. Shipped as migration 0009 (create or replace) because 0007 is already applied in live environments and sqlx validates migration checksums.

A warm sandbox observed as Created is not ready for byte I/O and means the runtime regressed after the replenisher saw it running (backends wait for readiness before returning from create). Claiming it made the session fail at open_io; mark it failed and try the next one instead. Amp-Thread-ID: https://ampcode.com/threads/T-019eb167-76de-7515-84f7-4265ce53ba85 Co-authored-by: Amp <amp@ampcode.com>

grant_inputs_to_role documented idempotency but always POSTed a new grant after upserting each secret, so re-running centaur-perms grants or startup role registration produced duplicate grants or conflicts. It now lists the role's existing grants once and reuses the grant for an already-granted secret.

The array helper never called mapper.flush(), so finite sources that end without an explicit terminal event lost buffered answer text and never received renderer.done. Flush like codexAppServerToChatSdkStream does; flush is a no-op when a terminal event already completed the stream.

serializeAttachment buffered every Slack attachment in memory and base64-inlined it with no size limit, letting one large upload blow request limits or OOM the process. Skip the download when Slack's size metadata exceeds 100 MiB and re-check the actual byte count after fetching; oversized attachments degrade through the existing fetchError channel. Amp-Thread-ID: https://ampcode.com/threads/T-019eb167-76de-7515-84f7-4265ce53ba85 Co-authored-by: Amp <amp@ampcode.com>

One thread's corrupt state, lease error, or failed render propagated out of the recovery scan, so the remaining indexed obligations were never attempted until the next restart. Isolate each thread, log the failure, and count it as deferred so the capped-backoff retry loop revisits it.

cache-to type=registry pushes a cache manifest to GHCR even when image push is disabled, and fork PRs run with a read-only GITHUB_TOKEN, so their builds failed at cache export. Gate the registry cache-to on the same not-a-fork predicate as push.

Zygimantass force-pushed the api-rs-control-plane branch from e2d3db4 to eb0969b Compare June 10, 2026 14:35

Zygimantass and others added 9 commits June 10, 2026 16:53

Zygimantass force-pushed the pr-344-review-fixes branch from 9f7495e to 997469b Compare June 10, 2026 14:54

Zygimantass merged commit d695348 into api-rs-control-plane Jun 10, 2026
6 checks passed

Zygimantass deleted the pr-344-review-fixes branch June 10, 2026 14:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: P0 review fixes for the Rust control plane (#344 review)#472

fix: P0 review fixes for the Rust control plane (#344 review)#472
Zygimantass merged 9 commits into
api-rs-control-planefrom
pr-344-review-fixes

Zygimantass commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Zygimantass commented Jun 10, 2026

Correctness

Security

Bot / rendering

CI

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant