iron-token-broker: stale long-lived 1Password SDK client + non-atomic rotation bricks OAuth credential on one write failure

## Summary

A single transient 1Password **write** failure during token rotation permanently bricks a brokered credential (observed with Claude Code OAuth). The broker consumes the old refresh token at the IdP *before* it has durably persisted the rotated replacement, and never recovers — every subsequent refresh fails `invalid_grant` and the credential is marked dead, requiring manual human re-auth.

## Environment

- `ironsh/iron-token-broker:0.0.1-rc.2`
- Store: 1Password service account (`store.type: 1password`), single long-lived broker process (Deployment, replicas=1).
- Credential: rotating OAuth refresh token (Anthropic Claude Code; `token_endpoint: https://console.anthropic.com/v1/oauth/token`).

## Observed (broker logs)

```
WARN  store Put failed, refresh did not persist  credential_id=anthropic-claude
      store=op://.../CLAUDE_CODE_BLOB/credential
      error=1password Items.Get (pre-put) "...": an internal error occurred, ... : invalid client id
ERROR credential marked dead; human re-auth required  credential_id=anthropic-claude  reason=invalid_grant
```

Reproduces after the broker has been up for several hours: the *first* rotation write after a long idle period fails. Reads (`Secrets.Resolve`, exercised constantly by the proxy) never fail; only the rotation **write** path (`Items.Get`/`Items.Put`) does.

## Root cause (two compounding issues)

1. **Bundled `onepassword-sdk-go` is < v0.1.7.** v0.1.7 fixed *"Using an SDK client in a long-running process no longer causes 401 server responses."* The broker writes to 1Password only during a rotation (every several hours), so its SDK **write** client sits idle long enough to go stale and returns the 401-class `internal error ... invalid client id`. Frequent `Secrets.Resolve` reads keep working, which is why this only ever bites the write.

2. **Rotation is not crash-safe.** The broker exchanges (and thereby consumes) the old refresh token at the IdP, then tries to persist the new one. If the `Items.Put` fails, the freshly-minted refresh token is lost while the old one is already dead → the next refresh is `invalid_grant` → the credential is marked dead. A single transient store-write error is therefore unrecoverable and silently takes down all turns using that credential until a human re-auths.

## Requested fixes

1. Bump `onepassword-sdk-go` to **>= v0.1.7**. Consider re-initializing the SDK client periodically (or per write) as defense-in-depth against stale long-lived clients.
2. Make rotation crash-safe: keep the newly-minted refresh token in memory and **retry persistence with backoff** on a store-write failure instead of discarding it; only mark a credential dead on an actual IdP `invalid_grant`, **not** on a store write error.
3. Retry the transient 1Password "internal error" class on the store path before giving up.

## Current workaround

Restarting the broker every ~2h (k8s CronJob) keeps its SDK client young so rotations always run with a fresh client; plus an alert on `credential marked dead`. This shouldn't be necessary once (1)/(2) land.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iron-token-broker: stale long-lived 1Password SDK client + non-atomic rotation bricks OAuth credential on one write failure #178

Summary

Environment

Observed (broker logs)

Root cause (two compounding issues)

Requested fixes

Current workaround

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

iron-token-broker: stale long-lived 1Password SDK client + non-atomic rotation bricks OAuth credential on one write failure #178

Description

Summary

Environment

Observed (broker logs)

Root cause (two compounding issues)

Requested fixes

Current workaround

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions