Skip to content

fix: gate broker credentials to enabled harnesses only#398

Open
lyoungblood wants to merge 1 commit into
paradigmxyz:mainfrom
moonwell-fi:gate-broker-creds-by-enabled-harness
Open

fix: gate broker credentials to enabled harnesses only#398
lyoungblood wants to merge 1 commit into
paradigmxyz:mainfrom
moonwell-fi:gate-broker-creds-by-enabled-harness

Conversation

@lyoungblood
Copy link
Copy Markdown

Problem

ToolManager.collect_secrets() unconditionally appended a BrokeredTokenSecret for every (engine, auth_mode) in _HARNESS_SECRETS (anthropic-claude, openai-codex, …), so the iron-token-broker ConfigMap rendered by broker_config.render_broker_yaml() always carried credentials for harnesses a deployment may never run.

iron-token-broker (ironsh/iron-token-broker:0.0.1-rc.2) corrupts its shared 1Password Go-SDK client the first time it touches a credential whose referenced vault items are missing: the first item op succeeds, then every subsequent op fails with

an internal error occurred ... invalid client id

which breaks the rotation write path (Items.Get/Items.Put) for all credentials — including the one actually in use. Confirmed empirically: removing the phantom openai-codex credential made the broker healthy; re-adding it with missing vault items re-broke it.

The current workaround on the Moonwell deployment is to bootstrap placeholder OPENAI_CODEX_CLIENT_ID / OPENAI_CODEX_BLOB items in the ai-agents vault so resolution succeeds and codex is harmlessly marked unauthenticated.

Fix

Gate the harness loop in collect_secrets() on a new harness_config.enabled_harnesses() allowlist:

  • CENTAUR_DEFAULT_HARNESS is always enabled.
  • CENTAUR_ENABLED_HARNESSES (comma/whitespace-separated, same aliases as the default) adds any other harnesses the deployment can spawn.

The shared API-side iron-proxy and iron-token-broker now manage credentials only for harnesses the deployment can actually reach, so a credential with missing vault items for an unused harness is never emitted. Both auth-mode variants of each enabled engine are still emitted (the broker manages a credential regardless of which mode a sandbox currently uses).

After this lands, the Moonwell deployment can run claude-code-only (default) and delete the placeholder vault items.

Migration note ⚠️

This changes the set emitted by collect_secrets(). Previously every harness credential was managed; now only CENTAUR_DEFAULT_HARNESS is, unless CENTAUR_ENABLED_HARNESSES lists more. A deployment that can spawn more than one harness must enumerate them in CENTAUR_ENABLED_HARNESSES (api.enabledHarnesses) — otherwise a sandbox spawned on a non-enabled harness in access_token mode has no broker credential. (api_key mode is unaffected; it doesn't use the broker.)

Changes

  • harness_config: add enabled_harnesses().
  • tool_manager: gate collect_secrets() on the enabled set.
  • chart: api.enabledHarnessesCENTAUR_ENABLED_HARNESSES (rendered only when non-empty); documented in configuration.md.
  • tests: default-harness-only deployments omit codex creds; explicitly enabled harnesses still emit them; broker config omits unenabled brokered creds; enabled_harnesses() parsing.

Testing

uv run pytest tests/test_harness_config.py tests/test_broker_config.py \
  tests/test_tool_manager.py::TestHarnessSecretSelection -q
# 31 passed

helm template renders CENTAUR_ENABLED_HARNESSES only when api.enabledHarnesses is non-empty. (One pre-existing, unrelated failure — test_tool_rest_router_lists_describes_and_invokes_tools, a 401 auth-env artifact present on main too.)

Upstream

The robust fix is broker-side: iron-token-broker should tolerate a credential whose vault items are missing without poisoning the shared SDK client used by every other credential. Worth reporting to ironsh/iron-token-broker; this PR is the deployment-side mitigation.

🤖 Generated with Claude Code

ToolManager.collect_secrets() unconditionally emitted a BrokeredTokenSecret
for every (engine, auth_mode) in _HARNESS_SECRETS, so the iron-token-broker
ConfigMap always carried both anthropic-claude and openai-codex regardless of
which harnesses a deployment actually runs.

iron-token-broker (0.0.1-rc.2) corrupts its shared 1Password Go-SDK client the
first time it touches a credential whose vault items are missing: the initial
item op succeeds, then every subsequent op fails with "an internal error
occurred ... invalid client id", breaking token rotation for ALL credentials
— including the one actually in use. A deployment that only runs claude-code
therefore had to bootstrap placeholder OPENAI_CODEX_* vault items just to keep
the broker healthy.

Gate the harness loop in collect_secrets() on enabled_harnesses(): the
CENTAUR_DEFAULT_HARNESS plus any engines listed in the new
CENTAUR_ENABLED_HARNESSES. The broker (and the shared API-side iron-proxy) now
only manage credentials for harnesses the deployment can actually reach, so
missing-vault-item poisoning can't happen for a harness you don't use. After
this lands, the placeholder vault items can be deleted.

- harness_config: add enabled_harnesses()
- tool_manager: gate collect_secrets() on the enabled set (both auth-mode
  variants of each enabled engine are still emitted)
- chart: api.enabledHarnesses -> CENTAUR_ENABLED_HARNESSES (rendered only when
  set); documented in configuration.md
- tests: default-harness-only deployments omit codex creds; explicitly enabled
  harnesses still emit them; broker config omits unenabled brokered creds

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@lyoungblood
Copy link
Copy Markdown
Author

Reported the underlying broker bug upstream: ironsh/iron-proxy#176 (iron-token-broker's shared 1Password SDK client is poisoned by a credential with missing vault items). This PR is the deployment-side mitigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant