fix: re-use cached Okta tokens across sessions instead of forcing device flow on every launch by tomekchime · Pull Request #43 · okta/okta-mcp-server

tomekchime · 2026-04-30T22:30:43Z

Summary

Fixes #42.

The okta-mcp-server triggers the OAuth device authorization flow on every server start (every new MCP-client session) instead of re-using the cached token from the OS keyring. This PR fixes the three bugs that combine to defeat the keyring cache, so subsequent launches use the cached access token (or refresh it via the cached refresh token) without user interaction.

Problem

See #42 for details.

Changes

ad41aef Replace is_valid_token()'s broken in-memory timestamp check with a JWT exp claim parser (60-second clock-skew safety margin). New _token_is_unexpired() helper handles opaque tokens, malformed JWTs, and JWTs missing exp by treating them as expired and falling through to the existing refresh / re-auth ladder. Remove the token_timestamp field and its four assignment sites; remove the expiry_duration parameter from is_valid_token(). Add a has_token() helper for the lifespan handler.

7c2e6a5 Rewrite the okta_authorisation_flow lifespan handler: gate authenticate() behind is_valid_token() instead of running it unconditionally; fail-fast (sys.exit(1)) when refresh + re-auth both leave the keyring empty (consistent with how the browserless flow already handles auth failure at auth_manager.py:297); remove the clear_tokens() call from the shutdown teardown so cached tokens survive across sessions.

0389686 Add 19 tests across tests/test_auth_manager.py and tests/test_server_lifespan.py covering cache hit, refresh path, re-auth path, opaque/malformed JWTs, browserless flow, 60-second safety-margin boundary, clear_tokens success and PasswordDeleteError paths, lifespan skip-on-cache-hit, lifespan no-clear-on-teardown, and sys.exit(1) on refresh+re-auth failure.

5befe68 Replace the temporary has_token() with a more useful pure is_cached_token_valid() helper and use it in the lifespan handler to distinguish a true cache hit from a successful refresh/re-auth, so the Re-using cached Okta token from keyring log line is only emitted when the cache was genuinely re-used. Add 4 new tests for is_cached_token_valid(); update lifespan tests for the new branching.

07ae539 Fix a latent bug in get_okta_client() (utils/client.py) that this PR's earlier commits expose. The function read api_token from the keyring BEFORE calling is_valid_token(), so when is_valid_token() refreshes the token internally and returns True, the captured api_token variable held the stale pre-refresh value while the fresh token sat unused in the keyring. Symptom: ~1 hour after a successful device flow, when the access token expires mid-session, the next MCP tool call returned a 401/400 from Okta. Fix: move the keyring.get_password() call to AFTER the is_valid_token() / authenticate() ladder. Add 2 tests in tests/test_client.py: a regression test that fails on the pre-fix behavior, and a happy-path sanity test.

All mocking is at the import-site boundary (e.g. okta_mcp_server.utils.auth.auth_manager.keyring). No real network, keyring, webbrowser, or time.sleep calls during test runs. JWTs are generated with HS256 since the implementation decodes with verify_signature=False — the algorithm and signing key are irrelevant. Env vars (OKTA_ORG_URL, OKTA_CLIENT_ID) are set via autouse fixtures local to the new files so existing tests that rely on FakeOktaAuthManager are unaffected (tests/conftest.py is unchanged).

Verification

End-to-end tested locally against a real Okta org in a Claude Code session:

Scenario	Connect time	Result
First launch after deploy (keyring empty)	~8 sec	Refresh fails (no refresh token yet), device flow runs, tokens stored
Second launch (cache hit on valid JWT)	~1.1 sec	`is_cached_token_valid()` returns True → no auth flow
Third launch ~1 hour later (access token expired)	~2.2 sec	JWT `exp` says expired → `refresh_access_token` succeeds → no device flow, no browser
Mid-session access-token expiry (long-running server, ~1h after startup)	< 2 sec	`is_valid_token()` refreshes internally, `client.py` reads the freshly minted token from the keyring, MCP tool call succeeds against Okta (regression caught by `07ae539`)

Test plan

Reviewer pulls the branch, sets OKTA_ORG_URL and OKTA_CLIENT_ID
First launch: device flow runs (expected — keyring empty)
Second launch within ~1 hour: no device flow, no browser, fast connect, log shows Re-using cached Okta token from keyring
Wait until access token expires; reconnect: log shows Token refreshed successfully (provided the OAuth client has refresh_token grant) or device flow (otherwise)
Leave a session running for >1 hour, then make a tool call: tool call succeeds (validates the 07ae539 fix end-to-end)
uv run pytest tests/ -v — 143 tests pass on the branch; existing 118 tests unchanged
uv run ruff check . && uv run ruff format --check . — no new lint or format issues introduced (4 pre-existing copyright-header E501s are unchanged)

…stamp The OktaAuthManager.is_valid_token() check used `time.time() - self.token_timestamp` to decide if a cached keyring token was still valid. `token_timestamp` was a plain dataclass field initialized to 0 and only updated when a token was minted in the current process, so on every cold start `token_age` evaluated to ~1.7 trillion seconds and the cached token was always treated as expired — defeating the keyring cache entirely and forcing the Okta device authorization flow on every Claude Code launch. This change makes is_valid_token() read the JWT `exp` claim directly via `jwt.decode(token, options={"verify_signature": False})`, with a 60-second safety margin to absorb clock skew. Opaque tokens, malformed JWTs, and JWTs missing an `exp` claim fall through to the existing refresh/reauth path so behavior is unchanged for non-JWT auth servers. Why JWT exp over persisting the timestamp: - The token already carries the answer; a duplicate keyring entry would just be another thing to keep in sync (TOCTOU smell). - Persisting the timestamp would require picking a token-lifetime constant, which was the original bug — Okta default is 3600s but custom auth servers vary. - PyJWT is already a dependency. Other changes in this commit: - Removed the `token_timestamp` field and all four assignment sites (in _browserless_authenticate, _poll_for_token, refresh_access_token, clear_tokens). is_valid_token was the only reader, so the field is now dead state. - Added a small public `has_token()` method so the lifespan handler can confirm a token exists in the keyring without importing keyring/SERVICE_NAME directly. - Removed the `expiry_duration` parameter from is_valid_token — the JWT exp claim is authoritative; no callers passed it explicitly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… shutdown The okta_authorisation_flow lifespan handler had two bugs that defeated the keyring cache and forced a fresh OAuth device flow on every Claude Code launch: 1. authenticate() was called unconditionally on startup, with no check for an existing valid token. Even if a freshly cached token was sitting in the OS keyring from the previous session, the lifespan ignored it. 2. clear_tokens() was called in the finally block on shutdown, deleting both api_token and refresh_token from the keyring on every server exit. So even if bug 1 were fixed, the cache would have been empty by the time the next session started. This change: - Replaces the unconditional authenticate() call with an `is_valid_token()` gate. is_valid_token() encapsulates the full ladder (cached → refresh → reauth), so a single call is enough. - Adds a fail-fast check for the (rare) case where is_valid_token() returns False and the keyring is still empty afterwards. Without this guard, the device-flow path could yield a broken manager to tools because authenticate() in that path logs "Authentication failed" but doesn't sys.exit. Now both device flow and browserless behave consistently — sys.exit(1) on auth failure (the browserless side already does this at auth_manager.py:297). - Removes the clear_tokens() teardown call. The keyring is OS-managed, the OktaAuthManager dataclass holds no resources to release, and persistence across sessions is the desired behavior. clear_tokens() stays available on the public API for explicit logout. Tradeoffs considered and rejected: - Env-var toggle to keep the old clear_tokens-on-shutdown behavior: that behavior is the bug, not a feature; a toggle would just re-create it. - Importing keyring/SERVICE_NAME directly into server.py for the post-auth check: would leak storage details across the encapsulation boundary. The small public has_token() helper added to OktaAuthManager (committed in the previous commit) keeps server.py free of keyring knowledge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds two new test files for the keyring-cache + lifespan-handler fix: tests/test_auth_manager.py (15 tests, organized into TestIsValidToken, TestTokenIsUnexpired, TestClearTokens): - Cold start with valid cached JWT skips auth entirely - Cold start with expired JWT runs refresh; refresh failure runs reauth - Cold start with no cached token runs reauth - Opaque tokens, malformed JWTs, and JWTs missing the exp claim all fall through to refresh/reauth (treated as expired, never raise) - Browserless flow with valid cached JWT skips auth; expired JWT triggers re-auth (not refresh — browserless has no refresh token) - 60-second clock-skew safety margin: 30s remaining = expired, 90s = valid - clear_tokens success and PasswordDeleteError paths - _token_is_unexpired direct edge case (empty string) tests/test_server_lifespan.py (4 tests in TestOktaAuthorisationFlow): - Lifespan handler skips authenticate when is_valid_token returns True - Lifespan handler does NOT call clear_tokens on teardown (prevents the original bug from regressing) - Fail-fast: sys.exit(1) when refresh+reauth still leave the keyring empty - Yields context after successful refresh / re-auth All mocks are at the import-site boundary (e.g. keyring is patched at okta_mcp_server.utils.auth.auth_manager.keyring). No real network, keyring, webbrowser, or time.sleep calls happen during test runs. JWTs are generated with HS256 since the implementation decodes with verify_signature=False — algorithm and signing key are irrelevant. Env vars (OKTA_ORG_URL, OKTA_CLIENT_ID) are set via autouse fixtures local to each new test file rather than in conftest.py, to avoid leaking into existing tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous lifespan handler logged "Re-using cached Okta token from keyring; skipping interactive auth" whenever `is_valid_token()` returned True. But `is_valid_token()` is side-effectful — on a cache miss it runs refresh and/or re-authentication internally before returning, so a True return only signals "a valid token now exists in the keyring," not "the cached token was already valid." After a refresh or fresh device flow, the lifespan was logging a message claiming cache reuse when in fact a network round trip (refresh) or full interactive device flow had just completed. The mirror-image issue: the `else` branch's "Okta authentication completed (refresh or new auth)" log was unreachable in practice, since `is_valid_token` returning False means the post-call keyring check is None — exactly the condition that triggered the `sys.exit(1)` guard above the log. Fix: split the two responsibilities. A new pure helper `OktaAuthManager.is_cached_token_valid()` checks the keyring + JWT exp without any side effects. The lifespan handler branches on it first: - True → log "Re-using cached Okta token..." (now genuinely accurate) - False → call the side-effectful `is_valid_token()` to drive refresh/re-auth - if that also returns False → log error + sys.exit(1) - otherwise log "Okta authentication completed..." (now reachable) `has_token()` was added in the previous commit specifically for the lifespan handler's old fail-fast check; with `is_valid_token()` now driven only on cache miss, its boolean return doubles as the post-condition check, so `has_token()` is removed. The behavior on auth failure (exit 1) is preserved. Empirically verified end-to-end in a local Claude Code reconnect test: cache hit logs "Re-using cached..." correctly; access-token expiry triggers refresh which logs "Token refreshed successfully" → "Okta authentication completed (refresh or new auth)"; and a deleted keyring triggers device flow that ends in the same "completed" log. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tomekchime · 2026-04-30T22:33:59Z

Operator note: for the refresh-token path of this fix to work end-to-end, the Okta OAuth client itself must have refresh_token listed in its grant_types. Without it, Okta will not return a refresh token regardless of the offline_access scope being requested, and after the access token expires the server will fall back to running device flow again — defeating the cache from the user's perspective.

The README's app-setup section doesn't currently call this out nor did the blog post.

…eshed value get_okta_client() in client.py read api_token from the keyring BEFORE calling is_valid_token(). When is_valid_token() refreshes the token internally (via refresh_access_token()), the keyring is updated with a fresh token, but the api_token variable in get_okta_client still holds the pre-refresh value. The old re-read at line 22 only fires when is_valid_token() returns False, which after the JWT-exp fix only happens on hard auth failure. Symptom: ~1 hour after a successful device flow, when the access token expires mid-session, the next MCP tool call receives a 401/400 from Okta because client.py was passing the just-expired token to the Okta SDK while the freshly refreshed token sat unused in the keyring. Why now: this bug was always latent. Before the JWT-exp fix earlier in this branch, is_valid_token() always returned False on cold start due to the broken in-memory token_timestamp, so the if-branch always fired and client.py always re-read the keyring. After the JWT-exp fix, is_valid_token() returns True on cache hit / successful refresh, exposing the staleness. Fix: move keyring.get_password() to AFTER the is_valid_token() / authenticate() ladder so api_token always reflects the current keyring state. The pre-call read at the old line 18 was always pointless. Test: tests/test_client.py adds two tests for get_okta_client: - Regression test that fails on the buggy code: keyring returns a stale token before is_valid_token mutates it to a fresh value; the captured Okta client config must show the fresh token. - Happy-path sanity test: when is_valid_token returns True without mutation, the cached token is used and authenticate is not called. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tomekchime · 2026-05-01T15:24:58Z

Update: while testing this branch end-to-end, an additional bug surfaced ~1 hour after the original device flow — MCP tool calls began returning 401/400 errors from Okta. Investigation found a latent staleness bug in utils/client.py:get_okta_client() that this PR's earlier commits expose:

The original get_okta_client() reads api_token from the keyring at the top of the function, then calls is_valid_token(). But is_valid_token() is side-effectful and refreshes the token internally on expiry, writing a fresh token to the keyring. The captured api_token variable still holds the pre-refresh value, and that's what gets passed to the Okta SDK.

The original code's redundant re-read (after the now-removed if not is_valid_token(): authenticate(); api_token = keyring.get_password(...) branch) was the masking factor — it only fired when is_valid_token() returned False, which after this PR's JWT-exp fix only happens on hard auth failure. So before this PR the if-branch always fired (because the broken in-memory token_timestamp always made is_valid_token() return False on cold start), accidentally papering over the staleness; after this PR, is_valid_token() correctly returns True on a successful internal refresh, exposing the bug.

Commit 07ae539 fixes it by moving the keyring.get_password() call to AFTER the validity check. New tests in tests/test_client.py include a regression test that fails on the pre-fix code (the captured Okta SDK config holds the stale token) and a happy-path sanity test for the cached-valid case.

aniket-okta · 2026-05-12T04:55:13Z

@tomekchime Thank you for your contribution! If you haven’t already, please sign the CLA using the link provided below for more details. Additionally, could you rebase the branch and test it again? Let me know once everything is complete.
https://developer.okta.com/cla/

BinoyOza-okta · 2026-05-19T05:28:02Z

@tomekchime
Following up on the thread. Did you get a chance to complete the CLA process, rebase the branch, and test the changes?

tomekchime and others added 4 commits April 30, 2026 15:00

manmohan-shaw-okta requested a review from aniket-okta May 7, 2026 04:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: re-use cached Okta tokens across sessions instead of forcing device flow on every launch#43

fix: re-use cached Okta tokens across sessions instead of forcing device flow on every launch#43
tomekchime wants to merge 5 commits into
okta:mainfrom
tomekchime:fix/cache-okta-tokens-across-sessions

tomekchime commented Apr 30, 2026 •

edited

Loading

Uh oh!

tomekchime commented Apr 30, 2026 •

edited

Loading

Uh oh!

tomekchime commented May 1, 2026

Uh oh!

aniket-okta commented May 12, 2026

Uh oh!

BinoyOza-okta commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tomekchime commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Changes

Verification

Test plan

Uh oh!

tomekchime commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomekchime commented May 1, 2026

Uh oh!

aniket-okta commented May 12, 2026

Uh oh!

BinoyOza-okta commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tomekchime commented Apr 30, 2026 •

edited

Loading

tomekchime commented Apr 30, 2026 •

edited

Loading