Skip to content

fix(health-monitor): guard Keychain read + fix silent-fail task prompt#2548

Closed
alexli-77 wants to merge 21 commits into
nanocoai:mainfrom
alexli-77:fix/health-monitor-keychain-and-task-prompt
Closed

fix(health-monitor): guard Keychain read + fix silent-fail task prompt#2548
alexli-77 wants to merge 21 commits into
nanocoai:mainfrom
alexli-77:fix/health-monitor-keychain-and-task-prompt

Conversation

@alexli-77
Copy link
Copy Markdown

Summary

  • container-runner: Keychain read in buildMounts now only overwrites claude.json when the Keychain token is strictly newer than what's already on disk. Prevents the post-spawn Keychain read from rolling back a token that refreshOauthTokenIfNeeded just refreshed via the OAuth endpoint moments earlier.
  • health-monitor: Rewrote the injectTask prompt to reference mounted paths (/workspace/extra/nanoclaw-logs/, /workspace/extra/nanoclaw-data/) instead of raw macOS security find-generic-password commands, which (a) don't exist inside a Linux container and (b) were triggering the agent's security refusal.

Test plan

  • Spawn a container for an agent whose token is near-expiry — verify refreshOauthTokenIfNeeded refreshes it and the subsequent Keychain read does not revert it
  • Trigger a silent-fail detection manually — verify the injected task no longer contains security find-generic-password and the health-monitor agent processes it without refusing

🤖 Generated with Claude Code

alexli-77 and others added 21 commits May 8, 2026 23:54
Wire @chat-adapter/discord through the Chat SDK bridge so Discord
messages flow into the standard channel pipeline. Reads
DISCORD_BOT_TOKEN, DISCORD_PUBLIC_KEY, and DISCORD_APPLICATION_ID
from .env at adapter construction.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
….0 → 4.27.0

Aligns with upstream channels branch recommendation. The chat dep had to be
bumped together because the new adapter requires the new chat.processOptionsLoad
type.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
chore: sync with upstream + Discord adapter bump
Detects "can't run" level failures that the existing stuck-container
detection misses: sessions that produce a processing_ack=completed but
zero messages_out — the signature of a silent 401 auth failure.

- src/modules/health-monitor/setup.ts: idempotent DB bootstrap (agent
  group, messaging group for Discord keepalive channel, wiring,
  named destination)
- src/modules/health-monitor/checks.ts: checkSilentFail() (ack with
  no output in 2h window, container stopped) + checkTokenExpiry()
- src/modules/health-monitor/alert.ts: direct Discord REST alert to
  keepalive channel + task injection into health-monitor session
- src/modules/health-monitor/index.ts: 5-min timer, 1h dedup per
  issue key, startHealthMonitor() (must run after initDb)
- src/index.ts: MODULE-HOOK to start health-monitor after DB init
- src/modules/index.ts: import health-monitor module

Also adds pre-spawn OAuth token refresh from macOS Keychain in
buildMounts() (container-runner.ts) — reads 'Claude Code-credentials'
keychain entry before every container spawn so tokens are always
fresh. Wrapped in try/catch, no-op on non-macOS.

Upstream issues: nanocoai#730 (token expiry, macOS details added to comment),
nanocoai#2492 (health-monitor feature proposal).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat(health-monitor): host-side silent-fail detection and operator alerting
When the health-monitor detects a token expiring within 60 min, it now
attempts an automatic refresh via the Anthropic token endpoint before
sending any alert:
- POST https://platform.claude.com/v1/oauth/token with the stored
  refresh_token (RFC 6749 form-encoded)
- On success: writes the new access_token + refresh_token to claude.json
  and updates macOS Keychain so the next pre-spawn read is also fresh;
  posts a "auto-refreshed" confirmation to the keepalive channel
- On failure: posts the original warning with the failure reason and
  instructions to run `claude login` manually

This means token expiry is now fully silent in the normal case — the
only time the user gets an alert is when the refresh_token itself has
expired (i.e., the user hasn't opened Claude Code in weeks).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat(health-monitor): auto-refresh OAuth token on expiry
… container-runner

- Extract core refresh logic to src/oauth-token-refresh.ts so both the
  health-monitor and the container spawner can use the same code
- health-monitor/token-refresh.ts is now a thin wrapper around the shared util
- In spawnContainer(), refresh the token before buildMounts() if it's
  expiring within 60 min — fixes the shutdown case where a token expires
  while the host is off and a task fires immediately on boot before the
  health-monitor's 5-min check has a chance to run
- Also fixes the remaining main-branch copy of token-refresh.ts which still
  had the old form-encoded body (Cloudflare 403); shared util uses JSON

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…h util; pre-spawn refresh

fix(health-monitor): remove redundant injectTask; shared OAuth refresh util; pre-spawn refresh
- Migration 016: token_status table (one row per agent group, upserted
  on each sweep — checked_at, expires_at, minutes_left, status, refreshed_at)
- src/db/token-status.ts: upsertTokenStatus() + getAllTokenStatuses()
- src/modules/health-monitor/token-sweep.ts: sweepAllTokens() — iterates
  ALL groups, calls refreshOauthTokenIfNeeded() for each, writes results
  to token_status. Returns array of results for alerting.
- Replaces per-alert token refresh in index.ts with sweepAllTokens().
  Previously only groups that hit the 60-min alert threshold got checked;
  now every group is checked and proactively refreshed every 5 minutes.

Fixes the root cause of the ag-1778266708996-ipsjnc (Terminal Agent)
stale-token issue: all groups are now covered regardless of which one
triggered the alert.

Query status any time:
  pnpm exec tsx scripts/q.ts data/v2.db \
    "SELECT agent_group_id, datetime(checked_at/1000,'unixepoch','localtime') \
     as checked, minutes_left, status FROM token_status"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat(health-monitor): token status table + sweep all groups every 5 min
…oding

HEALTH_MONITOR_DISCORD_GUILD_ID and HEALTH_MONITOR_KEEPALIVE_CHANNEL_ID
are now read from the .env file. If missing, Discord wiring is skipped
with a warning rather than failing silently with wrong-server IDs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
refactor(health-monitor): read Discord IDs from .env instead of hardcoding
… refresh

All agent groups share one macOS Keychain entry. When any group
successfully refreshes, the refresh_token rotates and Keychain is updated,
but other groups' claude.json files still carry the stale refresh_token.
The next sweep would then attempt a refresh with the old RT and get rejected
by Anthropic.

syncFromKeychain() is now called before refreshOauthTokenIfNeeded() for
each group, ensuring the latest refresh_token from Keychain is in place
before the API call.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix(health-monitor): sync claude.json from Keychain before each token refresh
…r on refresh

Three changes:
1. readKeychainOauth() reads the Keychain once before the group loop instead
   of once per group — all groups share the same entry.
2. syncOauthToFile() only writes if the Keychain token is newer (expiresAt
   comparison), preventing the stale snapshot from overwriting a
   just-refreshed file mid-sweep.
3. After a successful refresh, restartAgentGroupContainers() stops any
   running container so the next spawn reads the new token from the mounted
   claude.json. A Discord alert is posted on restart.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
OWL_RADAR_CHANNEL_ID is now required in .env.
OWL_RADAR_MANIFEST_URL and OWL_RADAR_PAGES_URL are optional with
sensible defaults for the existing fork.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…token; fix silent-fail task prompt

- container-runner: only overwrite claude.json from Keychain when the
  Keychain token is strictly newer, preventing the Keychain read from
  rolling back a token that refreshOauthTokenIfNeeded just refreshed
  via the OAuth endpoint
- health-monitor: rewrite injectTask prompt to reference mounted log/data
  paths instead of raw macOS security commands (which fail inside a Linux
  container and triggered the agent's security refusal)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@gavrielc
Copy link
Copy Markdown
Collaborator

@alexli-77 Appreciate the PR but lots of unrelated changes here. Please reopen with one fix or feature per PR and clean focused change set.

@gavrielc gavrielc closed this May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants