fix: re-use cached Okta tokens across sessions instead of forcing device flow on every launch#43
Conversation
…stamp
The OktaAuthManager.is_valid_token() check used `time.time() - self.token_timestamp`
to decide if a cached keyring token was still valid. `token_timestamp` was a plain
dataclass field initialized to 0 and only updated when a token was minted in the
current process, so on every cold start `token_age` evaluated to ~1.7 trillion
seconds and the cached token was always treated as expired — defeating the keyring
cache entirely and forcing the Okta device authorization flow on every Claude Code
launch.
This change makes is_valid_token() read the JWT `exp` claim directly via
`jwt.decode(token, options={"verify_signature": False})`, with a 60-second safety
margin to absorb clock skew. Opaque tokens, malformed JWTs, and JWTs missing an
`exp` claim fall through to the existing refresh/reauth path so behavior is
unchanged for non-JWT auth servers.
Why JWT exp over persisting the timestamp:
- The token already carries the answer; a duplicate keyring entry would just be
another thing to keep in sync (TOCTOU smell).
- Persisting the timestamp would require picking a token-lifetime constant, which
was the original bug — Okta default is 3600s but custom auth servers vary.
- PyJWT is already a dependency.
Other changes in this commit:
- Removed the `token_timestamp` field and all four assignment sites (in
_browserless_authenticate, _poll_for_token, refresh_access_token, clear_tokens).
is_valid_token was the only reader, so the field is now dead state.
- Added a small public `has_token()` method so the lifespan handler can confirm a
token exists in the keyring without importing keyring/SERVICE_NAME directly.
- Removed the `expiry_duration` parameter from is_valid_token — the JWT exp claim
is authoritative; no callers passed it explicitly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… shutdown The okta_authorisation_flow lifespan handler had two bugs that defeated the keyring cache and forced a fresh OAuth device flow on every Claude Code launch: 1. authenticate() was called unconditionally on startup, with no check for an existing valid token. Even if a freshly cached token was sitting in the OS keyring from the previous session, the lifespan ignored it. 2. clear_tokens() was called in the finally block on shutdown, deleting both api_token and refresh_token from the keyring on every server exit. So even if bug 1 were fixed, the cache would have been empty by the time the next session started. This change: - Replaces the unconditional authenticate() call with an `is_valid_token()` gate. is_valid_token() encapsulates the full ladder (cached → refresh → reauth), so a single call is enough. - Adds a fail-fast check for the (rare) case where is_valid_token() returns False and the keyring is still empty afterwards. Without this guard, the device-flow path could yield a broken manager to tools because authenticate() in that path logs "Authentication failed" but doesn't sys.exit. Now both device flow and browserless behave consistently — sys.exit(1) on auth failure (the browserless side already does this at auth_manager.py:297). - Removes the clear_tokens() teardown call. The keyring is OS-managed, the OktaAuthManager dataclass holds no resources to release, and persistence across sessions is the desired behavior. clear_tokens() stays available on the public API for explicit logout. Tradeoffs considered and rejected: - Env-var toggle to keep the old clear_tokens-on-shutdown behavior: that behavior is the bug, not a feature; a toggle would just re-create it. - Importing keyring/SERVICE_NAME directly into server.py for the post-auth check: would leak storage details across the encapsulation boundary. The small public has_token() helper added to OktaAuthManager (committed in the previous commit) keeps server.py free of keyring knowledge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two new test files for the keyring-cache + lifespan-handler fix: tests/test_auth_manager.py (15 tests, organized into TestIsValidToken, TestTokenIsUnexpired, TestClearTokens): - Cold start with valid cached JWT skips auth entirely - Cold start with expired JWT runs refresh; refresh failure runs reauth - Cold start with no cached token runs reauth - Opaque tokens, malformed JWTs, and JWTs missing the exp claim all fall through to refresh/reauth (treated as expired, never raise) - Browserless flow with valid cached JWT skips auth; expired JWT triggers re-auth (not refresh — browserless has no refresh token) - 60-second clock-skew safety margin: 30s remaining = expired, 90s = valid - clear_tokens success and PasswordDeleteError paths - _token_is_unexpired direct edge case (empty string) tests/test_server_lifespan.py (4 tests in TestOktaAuthorisationFlow): - Lifespan handler skips authenticate when is_valid_token returns True - Lifespan handler does NOT call clear_tokens on teardown (prevents the original bug from regressing) - Fail-fast: sys.exit(1) when refresh+reauth still leave the keyring empty - Yields context after successful refresh / re-auth All mocks are at the import-site boundary (e.g. keyring is patched at okta_mcp_server.utils.auth.auth_manager.keyring). No real network, keyring, webbrowser, or time.sleep calls happen during test runs. JWTs are generated with HS256 since the implementation decodes with verify_signature=False — algorithm and signing key are irrelevant. Env vars (OKTA_ORG_URL, OKTA_CLIENT_ID) are set via autouse fixtures local to each new test file rather than in conftest.py, to avoid leaking into existing tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous lifespan handler logged "Re-using cached Okta token from keyring;
skipping interactive auth" whenever `is_valid_token()` returned True. But
`is_valid_token()` is side-effectful — on a cache miss it runs refresh and/or
re-authentication internally before returning, so a True return only signals
"a valid token now exists in the keyring," not "the cached token was already
valid." After a refresh or fresh device flow, the lifespan was logging a
message claiming cache reuse when in fact a network round trip (refresh) or
full interactive device flow had just completed.
The mirror-image issue: the `else` branch's "Okta authentication completed
(refresh or new auth)" log was unreachable in practice, since `is_valid_token`
returning False means the post-call keyring check is None — exactly the
condition that triggered the `sys.exit(1)` guard above the log.
Fix: split the two responsibilities. A new pure helper
`OktaAuthManager.is_cached_token_valid()` checks the keyring + JWT exp without
any side effects. The lifespan handler branches on it first:
- True → log "Re-using cached Okta token..." (now genuinely accurate)
- False → call the side-effectful `is_valid_token()` to drive refresh/re-auth
- if that also returns False → log error + sys.exit(1)
- otherwise log "Okta authentication completed..." (now reachable)
`has_token()` was added in the previous commit specifically for the lifespan
handler's old fail-fast check; with `is_valid_token()` now driven only on cache
miss, its boolean return doubles as the post-condition check, so `has_token()`
is removed. The behavior on auth failure (exit 1) is preserved.
Empirically verified end-to-end in a local Claude Code reconnect test: cache
hit logs "Re-using cached..." correctly; access-token expiry triggers refresh
which logs "Token refreshed successfully" → "Okta authentication completed
(refresh or new auth)"; and a deleted keyring triggers device flow that ends
in the same "completed" log.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Operator note: for the refresh-token path of this fix to work end-to-end, the Okta OAuth client itself must have The README's app-setup section doesn't currently call this out nor did the blog post. |
…eshed value get_okta_client() in client.py read api_token from the keyring BEFORE calling is_valid_token(). When is_valid_token() refreshes the token internally (via refresh_access_token()), the keyring is updated with a fresh token, but the api_token variable in get_okta_client still holds the pre-refresh value. The old re-read at line 22 only fires when is_valid_token() returns False, which after the JWT-exp fix only happens on hard auth failure. Symptom: ~1 hour after a successful device flow, when the access token expires mid-session, the next MCP tool call receives a 401/400 from Okta because client.py was passing the just-expired token to the Okta SDK while the freshly refreshed token sat unused in the keyring. Why now: this bug was always latent. Before the JWT-exp fix earlier in this branch, is_valid_token() always returned False on cold start due to the broken in-memory token_timestamp, so the if-branch always fired and client.py always re-read the keyring. After the JWT-exp fix, is_valid_token() returns True on cache hit / successful refresh, exposing the staleness. Fix: move keyring.get_password() to AFTER the is_valid_token() / authenticate() ladder so api_token always reflects the current keyring state. The pre-call read at the old line 18 was always pointless. Test: tests/test_client.py adds two tests for get_okta_client: - Regression test that fails on the buggy code: keyring returns a stale token before is_valid_token mutates it to a fresh value; the captured Okta client config must show the fresh token. - Happy-path sanity test: when is_valid_token returns True without mutation, the cached token is used and authenticate is not called. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Update: while testing this branch end-to-end, an additional bug surfaced ~1 hour after the original device flow — MCP tool calls began returning 401/400 errors from Okta. Investigation found a latent staleness bug in The original The original code's redundant re-read (after the now-removed Commit 07ae539 fixes it by moving the |
|
@tomekchime Thank you for your contribution! If you haven’t already, please sign the CLA using the link provided below for more details. Additionally, could you rebase the branch and test it again? Let me know once everything is complete. |
|
@tomekchime |
Summary
Fixes #42.
The okta-mcp-server triggers the OAuth device authorization flow on every server start (every new MCP-client session) instead of re-using the cached token from the OS keyring. This PR fixes the three bugs that combine to defeat the keyring cache, so subsequent launches use the cached access token (or refresh it via the cached refresh token) without user interaction.
Problem
See #42 for details.
Changes
ad41aef Replace
is_valid_token()'s broken in-memory timestamp check with a JWTexpclaim parser (60-second clock-skew safety margin). New_token_is_unexpired()helper handles opaque tokens, malformed JWTs, and JWTs missingexpby treating them as expired and falling through to the existing refresh / re-auth ladder. Remove thetoken_timestampfield and its four assignment sites; remove theexpiry_durationparameter fromis_valid_token(). Add ahas_token()helper for the lifespan handler.7c2e6a5 Rewrite the
okta_authorisation_flowlifespan handler: gateauthenticate()behindis_valid_token()instead of running it unconditionally; fail-fast (sys.exit(1)) when refresh + re-auth both leave the keyring empty (consistent with how the browserless flow already handles auth failure atauth_manager.py:297); remove theclear_tokens()call from the shutdown teardown so cached tokens survive across sessions.0389686 Add 19 tests across
tests/test_auth_manager.pyandtests/test_server_lifespan.pycovering cache hit, refresh path, re-auth path, opaque/malformed JWTs, browserless flow, 60-second safety-margin boundary,clear_tokenssuccess andPasswordDeleteErrorpaths, lifespan skip-on-cache-hit, lifespan no-clear-on-teardown, andsys.exit(1)on refresh+re-auth failure.5befe68 Replace the temporary
has_token()with a more useful pureis_cached_token_valid()helper and use it in the lifespan handler to distinguish a true cache hit from a successful refresh/re-auth, so theRe-using cached Okta token from keyringlog line is only emitted when the cache was genuinely re-used. Add 4 new tests foris_cached_token_valid(); update lifespan tests for the new branching.07ae539 Fix a latent bug in
get_okta_client()(utils/client.py) that this PR's earlier commits expose. The function readapi_tokenfrom the keyring BEFORE callingis_valid_token(), so whenis_valid_token()refreshes the token internally and returns True, the capturedapi_tokenvariable held the stale pre-refresh value while the fresh token sat unused in the keyring. Symptom: ~1 hour after a successful device flow, when the access token expires mid-session, the next MCP tool call returned a 401/400 from Okta. Fix: move thekeyring.get_password()call to AFTER theis_valid_token()/authenticate()ladder. Add 2 tests intests/test_client.py: a regression test that fails on the pre-fix behavior, and a happy-path sanity test.All mocking is at the import-site boundary (e.g.
okta_mcp_server.utils.auth.auth_manager.keyring). No real network, keyring, webbrowser, ortime.sleepcalls during test runs. JWTs are generated with HS256 since the implementation decodes withverify_signature=False— the algorithm and signing key are irrelevant. Env vars (OKTA_ORG_URL,OKTA_CLIENT_ID) are set via autouse fixtures local to the new files so existing tests that rely onFakeOktaAuthManagerare unaffected (tests/conftest.pyis unchanged).Verification
End-to-end tested locally against a real Okta org in a Claude Code session:
is_cached_token_valid()returns True → no auth flowexpsays expired →refresh_access_tokensucceeds → no device flow, no browseris_valid_token()refreshes internally,client.pyreads the freshly minted token from the keyring, MCP tool call succeeds against Okta (regression caught by 07ae539)Test plan
OKTA_ORG_URLandOKTA_CLIENT_IDRe-using cached Okta token from keyringToken refreshed successfully(provided the OAuth client hasrefresh_tokengrant) or device flow (otherwise)uv run pytest tests/ -v— 143 tests pass on the branch; existing 118 tests unchangeduv run ruff check . && uv run ruff format --check .— no new lint or format issues introduced (4 pre-existing copyright-header E501s are unchanged)