fix: P0 review fixes for the Rust control plane (#344 review)#472
Merged
Conversation
e2d3db4 to
eb0969b
Compare
mark_execution_running treated an already-running row as a successful claim, so a concurrent request with the same idempotency key could fall into the fallback fetch, see status=running, and send the same input to the sandbox a second time. It now returns ClaimExecutionResult with a claimed flag that is true only when this call did the queued->running transition; the runtime returns the existing execution without driving it when the claim was lost. Amp-Thread-ID: https://ampcode.com/threads/T-019eb167-76de-7515-84f7-4265ce53ba85 Co-authored-by: Amp <amp@ampcode.com>
An HTTP secret parsed with an empty hosts list (empty tool-level default, hosts = [], or malformed hosts falling back to an empty default) translated to an empty iron-control rules array, leaving the credential host-unlimited. Both manifest parsers (centaur-perms and the api-server's tool discovery mirror) now fail closed, matching the brokered_token parser. Affected tools are warn-skipped at discovery.
await_event trusted the caller-provided (task_id, run_id) pair: a mismatched call could attach one task's run to a wait/checkpoint on another task and put the wrong task to sleep. Reject mismatches like get_task_checkpoint_states already does. Shipped as migration 0009 (create or replace) because 0007 is already applied in live environments and sqlx validates migration checksums.
A warm sandbox observed as Created is not ready for byte I/O and means the runtime regressed after the replenisher saw it running (backends wait for readiness before returning from create). Claiming it made the session fail at open_io; mark it failed and try the next one instead. Amp-Thread-ID: https://ampcode.com/threads/T-019eb167-76de-7515-84f7-4265ce53ba85 Co-authored-by: Amp <amp@ampcode.com>
grant_inputs_to_role documented idempotency but always POSTed a new grant after upserting each secret, so re-running centaur-perms grants or startup role registration produced duplicate grants or conflicts. It now lists the role's existing grants once and reuses the grant for an already-granted secret.
The array helper never called mapper.flush(), so finite sources that end without an explicit terminal event lost buffered answer text and never received renderer.done. Flush like codexAppServerToChatSdkStream does; flush is a no-op when a terminal event already completed the stream.
serializeAttachment buffered every Slack attachment in memory and base64-inlined it with no size limit, letting one large upload blow request limits or OOM the process. Skip the download when Slack's size metadata exceeds 100 MiB and re-check the actual byte count after fetching; oversized attachments degrade through the existing fetchError channel. Amp-Thread-ID: https://ampcode.com/threads/T-019eb167-76de-7515-84f7-4265ce53ba85 Co-authored-by: Amp <amp@ampcode.com>
One thread's corrupt state, lease error, or failed render propagated out of the recovery scan, so the remaining indexed obligations were never attempted until the next restart. Isolate each thread, log the failure, and count it as deferred so the capped-backoff retry loop revisits it.
cache-to type=registry pushes a cache manifest to GHCR even when image push is disabled, and fork PRs run with a read-only GITHUB_TOKEN, so their builds failed at cache export. Gate the registry cache-to on the same not-a-fork predicate as push.
9f7495e to
997469b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes the low-effort/high-reward (P0) findings from the code review of #344, one commit per finding.
Correctness
mark_execution_runningtreated an already-runningrow as a successful claim, so a concurrent request with the same idempotency key could send the same input to the sandbox twice. It now returns aclaimedflag set only by the actualqueued→runningtransition, and the runtime only drives the execution when it owns the claim.absurd.await_eventtask/run mismatch — the function never verifiedp_run_idbelongs top_task_id, so a mismatched call could attach one task's run to another task's wait/checkpoint and put the wrong task to sleep. Added the same guardget_task_checkpoint_statesalready had. Shipped as migration 0009 (create or replace) rather than editing 0007, since 0007 is applied in live environments and sqlx validates migration checksums.Createdsandboxes (not ready for byte I/O) were claimable and failed later atopen_io. Since backends wait for readiness before the replenisher records a sandbox,Createdat claim time means the runtime regressed: it is now marked failed and the next one is tried.grant_inputs_to_roledocumented idempotency but always POSTed a new grant; re-runningcentaur-perms … grantor startup role registration duplicated grants or conflicted. It now reuses the role's existing grant per secret.Security
hostslist emitted an empty iron-controlrulesarray, leaving the credential host-unlimited. Both manifest parsers (centaur-permsand the api-server tool-discovery mirror) now fail closed, matching the existingbrokered_tokenbehavior.tools/infra/grafana/pyproject.tomldeclareshosts = []and is the one in-repo tool relying on the old behavior: it is now warn-skipped at discovery and errors incentaur-permsuntil real hosts are declared. Someone who knows the deployment's Grafana/VictoriaMetrics hostnames should add them.Bot / rendering
codexAppServerToRendererEventsnever calledmapper.flush(), so finite sources ending without a terminal event lost buffered answer text and never gotrenderer.done. Now flushes like the streaming variant (regression test included, verified red-without-fix).sizemetadata before downloading and re-checked on actual bytes; oversized attachments degrade via the existingfetchErrorchannel.slackbotv2_render_recovery_thread_failed), and deferred to the capped-backoff retry loop.CI
cache-to: type=registrypushes a cache manifest to GHCR even withpush: false, and fork PRs run with a read-only token. The registry cache export is now gated on the same not-a-fork predicate aspush.Verification
cargo check/cargo teston all touched crates (135 tests),cargo fmt --check, no new clippy warningsbun test+tsc --noEmitforpackages/rendering(18) andservices/slackbotv2(22)does not belong, matched pair passes the guard and fails on the pre-existingmust be runningcheckactionlinton the workflow change