Skip to content

iron-token-broker: stale long-lived 1Password SDK client + non-atomic rotation bricks OAuth credential on one write failure #178

@lyoungblood

Description

@lyoungblood

Summary

A single transient 1Password write failure during token rotation permanently bricks a brokered credential (observed with Claude Code OAuth). The broker consumes the old refresh token at the IdP before it has durably persisted the rotated replacement, and never recovers — every subsequent refresh fails invalid_grant and the credential is marked dead, requiring manual human re-auth.

Environment

  • ironsh/iron-token-broker:0.0.1-rc.2
  • Store: 1Password service account (store.type: 1password), single long-lived broker process (Deployment, replicas=1).
  • Credential: rotating OAuth refresh token (Anthropic Claude Code; token_endpoint: https://console.anthropic.com/v1/oauth/token).

Observed (broker logs)

WARN  store Put failed, refresh did not persist  credential_id=anthropic-claude
      store=op://.../CLAUDE_CODE_BLOB/credential
      error=1password Items.Get (pre-put) "...": an internal error occurred, ... : invalid client id
ERROR credential marked dead; human re-auth required  credential_id=anthropic-claude  reason=invalid_grant

Reproduces after the broker has been up for several hours: the first rotation write after a long idle period fails. Reads (Secrets.Resolve, exercised constantly by the proxy) never fail; only the rotation write path (Items.Get/Items.Put) does.

Root cause (two compounding issues)

  1. Bundled onepassword-sdk-go is < v0.1.7. v0.1.7 fixed "Using an SDK client in a long-running process no longer causes 401 server responses." The broker writes to 1Password only during a rotation (every several hours), so its SDK write client sits idle long enough to go stale and returns the 401-class internal error ... invalid client id. Frequent Secrets.Resolve reads keep working, which is why this only ever bites the write.

  2. Rotation is not crash-safe. The broker exchanges (and thereby consumes) the old refresh token at the IdP, then tries to persist the new one. If the Items.Put fails, the freshly-minted refresh token is lost while the old one is already dead → the next refresh is invalid_grant → the credential is marked dead. A single transient store-write error is therefore unrecoverable and silently takes down all turns using that credential until a human re-auths.

Requested fixes

  1. Bump onepassword-sdk-go to >= v0.1.7. Consider re-initializing the SDK client periodically (or per write) as defense-in-depth against stale long-lived clients.
  2. Make rotation crash-safe: keep the newly-minted refresh token in memory and retry persistence with backoff on a store-write failure instead of discarding it; only mark a credential dead on an actual IdP invalid_grant, not on a store write error.
  3. Retry the transient 1Password "internal error" class on the store path before giving up.

Current workaround

Restarting the broker every ~2h (k8s CronJob) keeps its SDK client young so rotations always run with a fresh client; plus an alert on credential marked dead. This shouldn't be necessary once (1)/(2) land.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions