Skip to content

fix: re-use cached Okta tokens across sessions instead of forcing device flow on every launch#43

Open
tomekchime wants to merge 5 commits into
okta:mainfrom
tomekchime:fix/cache-okta-tokens-across-sessions
Open

fix: re-use cached Okta tokens across sessions instead of forcing device flow on every launch#43
tomekchime wants to merge 5 commits into
okta:mainfrom
tomekchime:fix/cache-okta-tokens-across-sessions

Conversation

@tomekchime
Copy link
Copy Markdown

@tomekchime tomekchime commented Apr 30, 2026

Summary

Fixes #42.

The okta-mcp-server triggers the OAuth device authorization flow on every server start (every new MCP-client session) instead of re-using the cached token from the OS keyring. This PR fixes the three bugs that combine to defeat the keyring cache, so subsequent launches use the cached access token (or refresh it via the cached refresh token) without user interaction.

Problem

See #42 for details.

Changes

ad41aef Replace is_valid_token()'s broken in-memory timestamp check with a JWT exp claim parser (60-second clock-skew safety margin). New _token_is_unexpired() helper handles opaque tokens, malformed JWTs, and JWTs missing exp by treating them as expired and falling through to the existing refresh / re-auth ladder. Remove the token_timestamp field and its four assignment sites; remove the expiry_duration parameter from is_valid_token(). Add a has_token() helper for the lifespan handler.

7c2e6a5 Rewrite the okta_authorisation_flow lifespan handler: gate authenticate() behind is_valid_token() instead of running it unconditionally; fail-fast (sys.exit(1)) when refresh + re-auth both leave the keyring empty (consistent with how the browserless flow already handles auth failure at auth_manager.py:297); remove the clear_tokens() call from the shutdown teardown so cached tokens survive across sessions.

0389686 Add 19 tests across tests/test_auth_manager.py and tests/test_server_lifespan.py covering cache hit, refresh path, re-auth path, opaque/malformed JWTs, browserless flow, 60-second safety-margin boundary, clear_tokens success and PasswordDeleteError paths, lifespan skip-on-cache-hit, lifespan no-clear-on-teardown, and sys.exit(1) on refresh+re-auth failure.

5befe68 Replace the temporary has_token() with a more useful pure is_cached_token_valid() helper and use it in the lifespan handler to distinguish a true cache hit from a successful refresh/re-auth, so the Re-using cached Okta token from keyring log line is only emitted when the cache was genuinely re-used. Add 4 new tests for is_cached_token_valid(); update lifespan tests for the new branching.

07ae539 Fix a latent bug in get_okta_client() (utils/client.py) that this PR's earlier commits expose. The function read api_token from the keyring BEFORE calling is_valid_token(), so when is_valid_token() refreshes the token internally and returns True, the captured api_token variable held the stale pre-refresh value while the fresh token sat unused in the keyring. Symptom: ~1 hour after a successful device flow, when the access token expires mid-session, the next MCP tool call returned a 401/400 from Okta. Fix: move the keyring.get_password() call to AFTER the is_valid_token() / authenticate() ladder. Add 2 tests in tests/test_client.py: a regression test that fails on the pre-fix behavior, and a happy-path sanity test.

All mocking is at the import-site boundary (e.g. okta_mcp_server.utils.auth.auth_manager.keyring). No real network, keyring, webbrowser, or time.sleep calls during test runs. JWTs are generated with HS256 since the implementation decodes with verify_signature=False — the algorithm and signing key are irrelevant. Env vars (OKTA_ORG_URL, OKTA_CLIENT_ID) are set via autouse fixtures local to the new files so existing tests that rely on FakeOktaAuthManager are unaffected (tests/conftest.py is unchanged).

Verification

End-to-end tested locally against a real Okta org in a Claude Code session:

Scenario Connect time Result
First launch after deploy (keyring empty) ~8 sec Refresh fails (no refresh token yet), device flow runs, tokens stored
Second launch (cache hit on valid JWT) ~1.1 sec is_cached_token_valid() returns True → no auth flow
Third launch ~1 hour later (access token expired) ~2.2 sec JWT exp says expired → refresh_access_token succeeds → no device flow, no browser
Mid-session access-token expiry (long-running server, ~1h after startup) < 2 sec is_valid_token() refreshes internally, client.py reads the freshly minted token from the keyring, MCP tool call succeeds against Okta (regression caught by 07ae539)

Test plan

  • Reviewer pulls the branch, sets OKTA_ORG_URL and OKTA_CLIENT_ID
  • First launch: device flow runs (expected — keyring empty)
  • Second launch within ~1 hour: no device flow, no browser, fast connect, log shows Re-using cached Okta token from keyring
  • Wait until access token expires; reconnect: log shows Token refreshed successfully (provided the OAuth client has refresh_token grant) or device flow (otherwise)
  • Leave a session running for >1 hour, then make a tool call: tool call succeeds (validates the 07ae539 fix end-to-end)
  • uv run pytest tests/ -v — 143 tests pass on the branch; existing 118 tests unchanged
  • uv run ruff check . && uv run ruff format --check . — no new lint or format issues introduced (4 pre-existing copyright-header E501s are unchanged)

tomekchime and others added 4 commits April 30, 2026 15:00
…stamp

The OktaAuthManager.is_valid_token() check used `time.time() - self.token_timestamp`
to decide if a cached keyring token was still valid. `token_timestamp` was a plain
dataclass field initialized to 0 and only updated when a token was minted in the
current process, so on every cold start `token_age` evaluated to ~1.7 trillion
seconds and the cached token was always treated as expired — defeating the keyring
cache entirely and forcing the Okta device authorization flow on every Claude Code
launch.

This change makes is_valid_token() read the JWT `exp` claim directly via
`jwt.decode(token, options={"verify_signature": False})`, with a 60-second safety
margin to absorb clock skew. Opaque tokens, malformed JWTs, and JWTs missing an
`exp` claim fall through to the existing refresh/reauth path so behavior is
unchanged for non-JWT auth servers.

Why JWT exp over persisting the timestamp:
- The token already carries the answer; a duplicate keyring entry would just be
  another thing to keep in sync (TOCTOU smell).
- Persisting the timestamp would require picking a token-lifetime constant, which
  was the original bug — Okta default is 3600s but custom auth servers vary.
- PyJWT is already a dependency.

Other changes in this commit:
- Removed the `token_timestamp` field and all four assignment sites (in
  _browserless_authenticate, _poll_for_token, refresh_access_token, clear_tokens).
  is_valid_token was the only reader, so the field is now dead state.
- Added a small public `has_token()` method so the lifespan handler can confirm a
  token exists in the keyring without importing keyring/SERVICE_NAME directly.
- Removed the `expiry_duration` parameter from is_valid_token — the JWT exp claim
  is authoritative; no callers passed it explicitly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… shutdown

The okta_authorisation_flow lifespan handler had two bugs that defeated the
keyring cache and forced a fresh OAuth device flow on every Claude Code launch:

1. authenticate() was called unconditionally on startup, with no check for an
   existing valid token. Even if a freshly cached token was sitting in the OS
   keyring from the previous session, the lifespan ignored it.
2. clear_tokens() was called in the finally block on shutdown, deleting both
   api_token and refresh_token from the keyring on every server exit. So even
   if bug 1 were fixed, the cache would have been empty by the time the next
   session started.

This change:
- Replaces the unconditional authenticate() call with an `is_valid_token()`
  gate. is_valid_token() encapsulates the full ladder (cached → refresh →
  reauth), so a single call is enough.
- Adds a fail-fast check for the (rare) case where is_valid_token() returns
  False and the keyring is still empty afterwards. Without this guard, the
  device-flow path could yield a broken manager to tools because authenticate()
  in that path logs "Authentication failed" but doesn't sys.exit. Now both
  device flow and browserless behave consistently — sys.exit(1) on auth
  failure (the browserless side already does this at auth_manager.py:297).
- Removes the clear_tokens() teardown call. The keyring is OS-managed, the
  OktaAuthManager dataclass holds no resources to release, and persistence
  across sessions is the desired behavior. clear_tokens() stays available
  on the public API for explicit logout.

Tradeoffs considered and rejected:
- Env-var toggle to keep the old clear_tokens-on-shutdown behavior: that
  behavior is the bug, not a feature; a toggle would just re-create it.
- Importing keyring/SERVICE_NAME directly into server.py for the post-auth
  check: would leak storage details across the encapsulation boundary. The
  small public has_token() helper added to OktaAuthManager (committed in
  the previous commit) keeps server.py free of keyring knowledge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two new test files for the keyring-cache + lifespan-handler fix:

tests/test_auth_manager.py (15 tests, organized into TestIsValidToken,
TestTokenIsUnexpired, TestClearTokens):
- Cold start with valid cached JWT skips auth entirely
- Cold start with expired JWT runs refresh; refresh failure runs reauth
- Cold start with no cached token runs reauth
- Opaque tokens, malformed JWTs, and JWTs missing the exp claim all fall
  through to refresh/reauth (treated as expired, never raise)
- Browserless flow with valid cached JWT skips auth; expired JWT triggers
  re-auth (not refresh — browserless has no refresh token)
- 60-second clock-skew safety margin: 30s remaining = expired, 90s = valid
- clear_tokens success and PasswordDeleteError paths
- _token_is_unexpired direct edge case (empty string)

tests/test_server_lifespan.py (4 tests in TestOktaAuthorisationFlow):
- Lifespan handler skips authenticate when is_valid_token returns True
- Lifespan handler does NOT call clear_tokens on teardown (prevents the
  original bug from regressing)
- Fail-fast: sys.exit(1) when refresh+reauth still leave the keyring empty
- Yields context after successful refresh / re-auth

All mocks are at the import-site boundary (e.g. keyring is patched at
okta_mcp_server.utils.auth.auth_manager.keyring). No real network, keyring,
webbrowser, or time.sleep calls happen during test runs. JWTs are generated
with HS256 since the implementation decodes with verify_signature=False —
algorithm and signing key are irrelevant.

Env vars (OKTA_ORG_URL, OKTA_CLIENT_ID) are set via autouse fixtures local
to each new test file rather than in conftest.py, to avoid leaking into
existing tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous lifespan handler logged "Re-using cached Okta token from keyring;
skipping interactive auth" whenever `is_valid_token()` returned True. But
`is_valid_token()` is side-effectful — on a cache miss it runs refresh and/or
re-authentication internally before returning, so a True return only signals
"a valid token now exists in the keyring," not "the cached token was already
valid." After a refresh or fresh device flow, the lifespan was logging a
message claiming cache reuse when in fact a network round trip (refresh) or
full interactive device flow had just completed.

The mirror-image issue: the `else` branch's "Okta authentication completed
(refresh or new auth)" log was unreachable in practice, since `is_valid_token`
returning False means the post-call keyring check is None — exactly the
condition that triggered the `sys.exit(1)` guard above the log.

Fix: split the two responsibilities. A new pure helper
`OktaAuthManager.is_cached_token_valid()` checks the keyring + JWT exp without
any side effects. The lifespan handler branches on it first:
  - True → log "Re-using cached Okta token..." (now genuinely accurate)
  - False → call the side-effectful `is_valid_token()` to drive refresh/re-auth
    - if that also returns False → log error + sys.exit(1)
    - otherwise log "Okta authentication completed..." (now reachable)

`has_token()` was added in the previous commit specifically for the lifespan
handler's old fail-fast check; with `is_valid_token()` now driven only on cache
miss, its boolean return doubles as the post-condition check, so `has_token()`
is removed. The behavior on auth failure (exit 1) is preserved.

Empirically verified end-to-end in a local Claude Code reconnect test: cache
hit logs "Re-using cached..." correctly; access-token expiry triggers refresh
which logs "Token refreshed successfully" → "Okta authentication completed
(refresh or new auth)"; and a deleted keyring triggers device flow that ends
in the same "completed" log.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tomekchime
Copy link
Copy Markdown
Author

tomekchime commented Apr 30, 2026

Operator note: for the refresh-token path of this fix to work end-to-end, the Okta OAuth client itself must have refresh_token listed in its grant_types. Without it, Okta will not return a refresh token regardless of the offline_access scope being requested, and after the access token expires the server will fall back to running device flow again — defeating the cache from the user's perspective.

The README's app-setup section doesn't currently call this out nor did the blog post.

…eshed value

get_okta_client() in client.py read api_token from the keyring BEFORE calling
is_valid_token(). When is_valid_token() refreshes the token internally (via
refresh_access_token()), the keyring is updated with a fresh token, but the
api_token variable in get_okta_client still holds the pre-refresh value. The
old re-read at line 22 only fires when is_valid_token() returns False, which
after the JWT-exp fix only happens on hard auth failure.

Symptom: ~1 hour after a successful device flow, when the access token
expires mid-session, the next MCP tool call receives a 401/400 from Okta
because client.py was passing the just-expired token to the Okta SDK while
the freshly refreshed token sat unused in the keyring.

Why now: this bug was always latent. Before the JWT-exp fix earlier in this
branch, is_valid_token() always returned False on cold start due to the
broken in-memory token_timestamp, so the if-branch always fired and
client.py always re-read the keyring. After the JWT-exp fix, is_valid_token()
returns True on cache hit / successful refresh, exposing the staleness.

Fix: move keyring.get_password() to AFTER the is_valid_token() / authenticate()
ladder so api_token always reflects the current keyring state. The pre-call
read at the old line 18 was always pointless.

Test: tests/test_client.py adds two tests for get_okta_client:
- Regression test that fails on the buggy code: keyring returns a stale token
  before is_valid_token mutates it to a fresh value; the captured Okta client
  config must show the fresh token.
- Happy-path sanity test: when is_valid_token returns True without mutation,
  the cached token is used and authenticate is not called.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tomekchime
Copy link
Copy Markdown
Author

Update: while testing this branch end-to-end, an additional bug surfaced ~1 hour after the original device flow — MCP tool calls began returning 401/400 errors from Okta. Investigation found a latent staleness bug in utils/client.py:get_okta_client() that this PR's earlier commits expose:

The original get_okta_client() reads api_token from the keyring at the top of the function, then calls is_valid_token(). But is_valid_token() is side-effectful and refreshes the token internally on expiry, writing a fresh token to the keyring. The captured api_token variable still holds the pre-refresh value, and that's what gets passed to the Okta SDK.

The original code's redundant re-read (after the now-removed if not is_valid_token(): authenticate(); api_token = keyring.get_password(...) branch) was the masking factor — it only fired when is_valid_token() returned False, which after this PR's JWT-exp fix only happens on hard auth failure. So before this PR the if-branch always fired (because the broken in-memory token_timestamp always made is_valid_token() return False on cold start), accidentally papering over the staleness; after this PR, is_valid_token() correctly returns True on a successful internal refresh, exposing the bug.

Commit 07ae539 fixes it by moving the keyring.get_password() call to AFTER the validity check. New tests in tests/test_client.py include a regression test that fails on the pre-fix code (the captured Okta SDK config holds the stale token) and a happy-path sanity test for the cached-valid case.

@aniket-okta
Copy link
Copy Markdown
Contributor

@tomekchime Thank you for your contribution! If you haven’t already, please sign the CLA using the link provided below for more details. Additionally, could you rebase the branch and test it again? Let me know once everything is complete.
https://developer.okta.com/cla/

@BinoyOza-okta
Copy link
Copy Markdown

@tomekchime
Following up on the thread. Did you get a chance to complete the CLA process, rebase the branch, and test the changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OAuth device authorization flow runs on every server start, ignoring cached tokens in OS keyring

3 participants