Skip to content

fix: P0 review fixes for the Rust control plane (#344 review)#472

Merged
Zygimantass merged 9 commits into
api-rs-control-planefrom
pr-344-review-fixes
Jun 10, 2026
Merged

fix: P0 review fixes for the Rust control plane (#344 review)#472
Zygimantass merged 9 commits into
api-rs-control-planefrom
pr-344-review-fixes

Conversation

@Zygimantass

Copy link
Copy Markdown
Member

Fixes the low-effort/high-reward (P0) findings from the code review of #344, one commit per finding.

Correctness

  • Execution claim racemark_execution_running treated an already-running row as a successful claim, so a concurrent request with the same idempotency key could send the same input to the sandbox twice. It now returns a claimed flag set only by the actual queued→running transition, and the runtime only drives the execution when it owns the claim.
  • absurd.await_event task/run mismatch — the function never verified p_run_id belongs to p_task_id, so a mismatched call could attach one task's run to another task's wait/checkpoint and put the wrong task to sleep. Added the same guard get_task_checkpoint_states already had. Shipped as migration 0009 (create or replace) rather than editing 0007, since 0007 is applied in live environments and sqlx validates migration checksums.
  • Warm pool handed out not-ready sandboxesCreated sandboxes (not ready for byte I/O) were claimable and failed later at open_io. Since backends wait for readiness before the replenisher records a sandbox, Created at claim time means the runtime regressed: it is now marked failed and the next one is tried.
  • Grant idempotencygrant_inputs_to_role documented idempotency but always POSTed a new grant; re-running centaur-perms … grant or startup role registration duplicated grants or conflicted. It now reuses the role's existing grant per secret.

Security

  • Empty hosts ⇒ unscoped secret — an HTTP tool secret parsed with an empty hosts list emitted an empty iron-control rules array, leaving the credential host-unlimited. Both manifest parsers (centaur-perms and the api-server tool-discovery mirror) now fail closed, matching the existing brokered_token behavior.
    • ⚠️ tools/infra/grafana/pyproject.toml declares hosts = [] and is the one in-repo tool relying on the old behavior: it is now warn-skipped at discovery and errors in centaur-perms until real hosts are declared. Someone who knows the deployment's Grafana/VictoriaMetrics hostnames should add them.

Bot / rendering

  • Lost final answerscodexAppServerToRendererEvents never called mapper.flush(), so finite sources ending without a terminal event lost buffered answer text and never got renderer.done. Now flushes like the streaming variant (regression test included, verified red-without-fix).
  • Unbounded attachment buffering — Slack attachments were downloaded into memory and base64-inlined with no limit. Now capped at 100 MiB, checked against Slack's size metadata before downloading and re-checked on actual bytes; oversized attachments degrade via the existing fetchError channel.
  • Recovery scan aborted on first failure — one corrupt obligation or render error stopped startup recovery for all remaining threads until the next restart. Each thread is now isolated, logged (slackbotv2_render_recovery_thread_failed), and deferred to the capped-backoff retry loop.

CI

  • Fork PR builds failed at cache exportcache-to: type=registry pushes a cache manifest to GHCR even with push: false, and fork PRs run with a read-only token. The registry cache export is now gated on the same not-a-fork predicate as push.

Verification

  • cargo check / cargo test on all touched crates (135 tests), cargo fmt --check, no new clippy warnings
  • bun test + tsc --noEmit for packages/rendering (18) and services/slackbotv2 (22)
  • Migrations 0001→0009 applied to a disposable Postgres 16; functional smoke test of the new guard: mismatched (task, run) raises does not belong, matched pair passes the guard and fails on the pre-existing must be running check
  • actionlint on the workflow change

@Zygimantass Zygimantass force-pushed the api-rs-control-plane branch from e2d3db4 to eb0969b Compare June 10, 2026 14:35
Zygimantass and others added 9 commits June 10, 2026 16:53
mark_execution_running treated an already-running row as a successful
claim, so a concurrent request with the same idempotency key could fall
into the fallback fetch, see status=running, and send the same input to
the sandbox a second time. It now returns ClaimExecutionResult with a
claimed flag that is true only when this call did the queued->running
transition; the runtime returns the existing execution without driving
it when the claim was lost.

Amp-Thread-ID: https://ampcode.com/threads/T-019eb167-76de-7515-84f7-4265ce53ba85
Co-authored-by: Amp <amp@ampcode.com>
An HTTP secret parsed with an empty hosts list (empty tool-level
default, hosts = [], or malformed hosts falling back to an empty
default) translated to an empty iron-control rules array, leaving the
credential host-unlimited. Both manifest parsers (centaur-perms and the
api-server's tool discovery mirror) now fail closed, matching the
brokered_token parser. Affected tools are warn-skipped at discovery.
await_event trusted the caller-provided (task_id, run_id) pair: a
mismatched call could attach one task's run to a wait/checkpoint on
another task and put the wrong task to sleep. Reject mismatches like
get_task_checkpoint_states already does. Shipped as migration 0009
(create or replace) because 0007 is already applied in live
environments and sqlx validates migration checksums.
A warm sandbox observed as Created is not ready for byte I/O and means
the runtime regressed after the replenisher saw it running (backends
wait for readiness before returning from create). Claiming it made the
session fail at open_io; mark it failed and try the next one instead.

Amp-Thread-ID: https://ampcode.com/threads/T-019eb167-76de-7515-84f7-4265ce53ba85
Co-authored-by: Amp <amp@ampcode.com>
grant_inputs_to_role documented idempotency but always POSTed a new
grant after upserting each secret, so re-running centaur-perms grants
or startup role registration produced duplicate grants or conflicts.
It now lists the role's existing grants once and reuses the grant for
an already-granted secret.
The array helper never called mapper.flush(), so finite sources that end
without an explicit terminal event lost buffered answer text and never
received renderer.done. Flush like codexAppServerToChatSdkStream does;
flush is a no-op when a terminal event already completed the stream.
serializeAttachment buffered every Slack attachment in memory and
base64-inlined it with no size limit, letting one large upload blow
request limits or OOM the process. Skip the download when Slack's size
metadata exceeds 100 MiB and re-check the actual byte count after
fetching; oversized attachments degrade through the existing fetchError
channel.

Amp-Thread-ID: https://ampcode.com/threads/T-019eb167-76de-7515-84f7-4265ce53ba85
Co-authored-by: Amp <amp@ampcode.com>
One thread's corrupt state, lease error, or failed render propagated
out of the recovery scan, so the remaining indexed obligations were
never attempted until the next restart. Isolate each thread, log the
failure, and count it as deferred so the capped-backoff retry loop
revisits it.
cache-to type=registry pushes a cache manifest to GHCR even when image
push is disabled, and fork PRs run with a read-only GITHUB_TOKEN, so
their builds failed at cache export. Gate the registry cache-to on the
same not-a-fork predicate as push.
@Zygimantass Zygimantass force-pushed the pr-344-review-fixes branch from 9f7495e to 997469b Compare June 10, 2026 14:54
@Zygimantass Zygimantass merged commit d695348 into api-rs-control-plane Jun 10, 2026
6 checks passed
@Zygimantass Zygimantass deleted the pr-344-review-fixes branch June 10, 2026 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant