Summary
A single transient 1Password write failure during token rotation permanently bricks a brokered credential (observed with Claude Code OAuth). The broker consumes the old refresh token at the IdP before it has durably persisted the rotated replacement, and never recovers — every subsequent refresh fails invalid_grant and the credential is marked dead, requiring manual human re-auth.
Environment
ironsh/iron-token-broker:0.0.1-rc.2
- Store: 1Password service account (
store.type: 1password), single long-lived broker process (Deployment, replicas=1).
- Credential: rotating OAuth refresh token (Anthropic Claude Code;
token_endpoint: https://console.anthropic.com/v1/oauth/token).
Observed (broker logs)
WARN store Put failed, refresh did not persist credential_id=anthropic-claude
store=op://.../CLAUDE_CODE_BLOB/credential
error=1password Items.Get (pre-put) "...": an internal error occurred, ... : invalid client id
ERROR credential marked dead; human re-auth required credential_id=anthropic-claude reason=invalid_grant
Reproduces after the broker has been up for several hours: the first rotation write after a long idle period fails. Reads (Secrets.Resolve, exercised constantly by the proxy) never fail; only the rotation write path (Items.Get/Items.Put) does.
Root cause (two compounding issues)
-
Bundled onepassword-sdk-go is < v0.1.7. v0.1.7 fixed "Using an SDK client in a long-running process no longer causes 401 server responses." The broker writes to 1Password only during a rotation (every several hours), so its SDK write client sits idle long enough to go stale and returns the 401-class internal error ... invalid client id. Frequent Secrets.Resolve reads keep working, which is why this only ever bites the write.
-
Rotation is not crash-safe. The broker exchanges (and thereby consumes) the old refresh token at the IdP, then tries to persist the new one. If the Items.Put fails, the freshly-minted refresh token is lost while the old one is already dead → the next refresh is invalid_grant → the credential is marked dead. A single transient store-write error is therefore unrecoverable and silently takes down all turns using that credential until a human re-auths.
Requested fixes
- Bump
onepassword-sdk-go to >= v0.1.7. Consider re-initializing the SDK client periodically (or per write) as defense-in-depth against stale long-lived clients.
- Make rotation crash-safe: keep the newly-minted refresh token in memory and retry persistence with backoff on a store-write failure instead of discarding it; only mark a credential dead on an actual IdP
invalid_grant, not on a store write error.
- Retry the transient 1Password "internal error" class on the store path before giving up.
Current workaround
Restarting the broker every ~2h (k8s CronJob) keeps its SDK client young so rotations always run with a fresh client; plus an alert on credential marked dead. This shouldn't be necessary once (1)/(2) land.
Summary
A single transient 1Password write failure during token rotation permanently bricks a brokered credential (observed with Claude Code OAuth). The broker consumes the old refresh token at the IdP before it has durably persisted the rotated replacement, and never recovers — every subsequent refresh fails
invalid_grantand the credential is marked dead, requiring manual human re-auth.Environment
ironsh/iron-token-broker:0.0.1-rc.2store.type: 1password), single long-lived broker process (Deployment, replicas=1).token_endpoint: https://console.anthropic.com/v1/oauth/token).Observed (broker logs)
Reproduces after the broker has been up for several hours: the first rotation write after a long idle period fails. Reads (
Secrets.Resolve, exercised constantly by the proxy) never fail; only the rotation write path (Items.Get/Items.Put) does.Root cause (two compounding issues)
Bundled
onepassword-sdk-gois < v0.1.7. v0.1.7 fixed "Using an SDK client in a long-running process no longer causes 401 server responses." The broker writes to 1Password only during a rotation (every several hours), so its SDK write client sits idle long enough to go stale and returns the 401-classinternal error ... invalid client id. FrequentSecrets.Resolvereads keep working, which is why this only ever bites the write.Rotation is not crash-safe. The broker exchanges (and thereby consumes) the old refresh token at the IdP, then tries to persist the new one. If the
Items.Putfails, the freshly-minted refresh token is lost while the old one is already dead → the next refresh isinvalid_grant→ the credential is marked dead. A single transient store-write error is therefore unrecoverable and silently takes down all turns using that credential until a human re-auths.Requested fixes
onepassword-sdk-goto >= v0.1.7. Consider re-initializing the SDK client periodically (or per write) as defense-in-depth against stale long-lived clients.invalid_grant, not on a store write error.Current workaround
Restarting the broker every ~2h (k8s CronJob) keeps its SDK client young so rotations always run with a fresh client; plus an alert on
credential marked dead. This shouldn't be necessary once (1)/(2) land.