fix(health-monitor): guard Keychain read + fix silent-fail task prompt#2548
Closed
alexli-77 wants to merge 21 commits into
Closed
fix(health-monitor): guard Keychain read + fix silent-fail task prompt#2548alexli-77 wants to merge 21 commits into
alexli-77 wants to merge 21 commits into
Conversation
Wire @chat-adapter/discord through the Chat SDK bridge so Discord messages flow into the standard channel pipeline. Reads DISCORD_BOT_TOKEN, DISCORD_PUBLIC_KEY, and DISCORD_APPLICATION_ID from .env at adapter construction. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
….0 → 4.27.0 Aligns with upstream channels branch recommendation. The chat dep had to be bumped together because the new adapter requires the new chat.processOptionsLoad type. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
chore: sync with upstream + Discord adapter bump
Detects "can't run" level failures that the existing stuck-container detection misses: sessions that produce a processing_ack=completed but zero messages_out — the signature of a silent 401 auth failure. - src/modules/health-monitor/setup.ts: idempotent DB bootstrap (agent group, messaging group for Discord keepalive channel, wiring, named destination) - src/modules/health-monitor/checks.ts: checkSilentFail() (ack with no output in 2h window, container stopped) + checkTokenExpiry() - src/modules/health-monitor/alert.ts: direct Discord REST alert to keepalive channel + task injection into health-monitor session - src/modules/health-monitor/index.ts: 5-min timer, 1h dedup per issue key, startHealthMonitor() (must run after initDb) - src/index.ts: MODULE-HOOK to start health-monitor after DB init - src/modules/index.ts: import health-monitor module Also adds pre-spawn OAuth token refresh from macOS Keychain in buildMounts() (container-runner.ts) — reads 'Claude Code-credentials' keychain entry before every container spawn so tokens are always fresh. Wrapped in try/catch, no-op on non-macOS. Upstream issues: nanocoai#730 (token expiry, macOS details added to comment), nanocoai#2492 (health-monitor feature proposal). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat(health-monitor): host-side silent-fail detection and operator alerting
When the health-monitor detects a token expiring within 60 min, it now attempts an automatic refresh via the Anthropic token endpoint before sending any alert: - POST https://platform.claude.com/v1/oauth/token with the stored refresh_token (RFC 6749 form-encoded) - On success: writes the new access_token + refresh_token to claude.json and updates macOS Keychain so the next pre-spawn read is also fresh; posts a "auto-refreshed" confirmation to the keepalive channel - On failure: posts the original warning with the failure reason and instructions to run `claude login` manually This means token expiry is now fully silent in the normal case — the only time the user gets an alert is when the refresh_token itself has expired (i.e., the user hasn't opened Claude Code in weeks). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ets Cloudflare 403
feat(health-monitor): auto-refresh OAuth token on expiry
…re; fix negative minutes display
… container-runner - Extract core refresh logic to src/oauth-token-refresh.ts so both the health-monitor and the container spawner can use the same code - health-monitor/token-refresh.ts is now a thin wrapper around the shared util - In spawnContainer(), refresh the token before buildMounts() if it's expiring within 60 min — fixes the shutdown case where a token expires while the host is off and a task fires immediately on boot before the health-monitor's 5-min check has a chance to run - Also fixes the remaining main-branch copy of token-refresh.ts which still had the old form-encoded body (Cloudflare 403); shared util uses JSON Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…h util; pre-spawn refresh fix(health-monitor): remove redundant injectTask; shared OAuth refresh util; pre-spawn refresh
- Migration 016: token_status table (one row per agent group, upserted
on each sweep — checked_at, expires_at, minutes_left, status, refreshed_at)
- src/db/token-status.ts: upsertTokenStatus() + getAllTokenStatuses()
- src/modules/health-monitor/token-sweep.ts: sweepAllTokens() — iterates
ALL groups, calls refreshOauthTokenIfNeeded() for each, writes results
to token_status. Returns array of results for alerting.
- Replaces per-alert token refresh in index.ts with sweepAllTokens().
Previously only groups that hit the 60-min alert threshold got checked;
now every group is checked and proactively refreshed every 5 minutes.
Fixes the root cause of the ag-1778266708996-ipsjnc (Terminal Agent)
stale-token issue: all groups are now covered regardless of which one
triggered the alert.
Query status any time:
pnpm exec tsx scripts/q.ts data/v2.db \
"SELECT agent_group_id, datetime(checked_at/1000,'unixepoch','localtime') \
as checked, minutes_left, status FROM token_status"
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat(health-monitor): token status table + sweep all groups every 5 min
…oding HEALTH_MONITOR_DISCORD_GUILD_ID and HEALTH_MONITOR_KEEPALIVE_CHANNEL_ID are now read from the .env file. If missing, Discord wiring is skipped with a warning rather than failing silently with wrong-server IDs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
refactor(health-monitor): read Discord IDs from .env instead of hardcoding
… refresh All agent groups share one macOS Keychain entry. When any group successfully refreshes, the refresh_token rotates and Keychain is updated, but other groups' claude.json files still carry the stale refresh_token. The next sweep would then attempt a refresh with the old RT and get rejected by Anthropic. syncFromKeychain() is now called before refreshOauthTokenIfNeeded() for each group, ensuring the latest refresh_token from Keychain is in place before the API call. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix(health-monitor): sync claude.json from Keychain before each token refresh
…r on refresh Three changes: 1. readKeychainOauth() reads the Keychain once before the group loop instead of once per group — all groups share the same entry. 2. syncOauthToFile() only writes if the Keychain token is newer (expiresAt comparison), preventing the stale snapshot from overwriting a just-refreshed file mid-sweep. 3. After a successful refresh, restartAgentGroupContainers() stops any running container so the next spawn reads the new token from the mounted claude.json. A Discord alert is posted on restart. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
OWL_RADAR_CHANNEL_ID is now required in .env. OWL_RADAR_MANIFEST_URL and OWL_RADAR_PAGES_URL are optional with sensible defaults for the existing fork. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…token; fix silent-fail task prompt - container-runner: only overwrite claude.json from Keychain when the Keychain token is strictly newer, preventing the Keychain read from rolling back a token that refreshOauthTokenIfNeeded just refreshed via the OAuth endpoint - health-monitor: rewrite injectTask prompt to reference mounted log/data paths instead of raw macOS security commands (which fail inside a Linux container and triggered the agent's security refusal) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Collaborator
|
@alexli-77 Appreciate the PR but lots of unrelated changes here. Please reopen with one fix or feature per PR and clean focused change set. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
buildMountsnow only overwritesclaude.jsonwhen the Keychain token is strictly newer than what's already on disk. Prevents the post-spawn Keychain read from rolling back a token thatrefreshOauthTokenIfNeededjust refreshed via the OAuth endpoint moments earlier.injectTaskprompt to reference mounted paths (/workspace/extra/nanoclaw-logs/,/workspace/extra/nanoclaw-data/) instead of raw macOSsecurity find-generic-passwordcommands, which (a) don't exist inside a Linux container and (b) were triggering the agent's security refusal.Test plan
refreshOauthTokenIfNeededrefreshes it and the subsequent Keychain read does not revert itsecurity find-generic-passwordand the health-monitor agent processes it without refusing🤖 Generated with Claude Code