Skip to content

fix(scripts): authenticate notification daemon requests#653

Closed
nnnet wants to merge 51 commits into
builderz-labs:mainfrom
nnnet:fix/notification-daemon-auth
Closed

fix(scripts): authenticate notification daemon requests#653
nnnet wants to merge 51 commits into
builderz-labs:mainfrom
nnnet:fix/notification-daemon-auth

Conversation

@nnnet
Copy link
Copy Markdown

@nnnet nnnet commented May 6, 2026

Summary

The notification daemon's curl invocations against /api/notifications/deliver and /api/notifications (stats) are issued without any Authorization header. Both endpoints use requireRole(request, 'operator') (src/lib/auth.ts), which accepts the global API key via Authorization: Bearer … or x-api-key. Without it every poll silently returns 401 and the script reports Delivery failed: HTTP 401 with no hint about why.

Fix

  • Read MC_API_KEY (with API_KEY fallback so a single .env value works for both MC and the daemon) from env.
  • Validate it up-front; print an actionable error and exit 1 if unset, instead of silently 401-ing in a loop.
  • Pass it as Authorization: Bearer … on both the POST deliver call and the GET stats call.
  • Sync help-text default URL with the script's actual default (30053000).

No new dependencies. No behavior changes outside the auth path.

Test plan

  • unset MC_API_KEY API_KEY; ./scripts/notification-daemon.sh → exits 1 with the actionable error message.
  • MC_API_KEY=<wrong> ./scripts/notification-daemon.sh --dry-run → 401 logged with response body (was silent in 2>/dev/null).
  • MC_API_KEY=<right> ./scripts/notification-daemon.sh --dry-run → 200 with dry-run report.
  • bash -n scripts/notification-daemon.sh clean.

nnnet added 30 commits April 29, 2026 21:31
Make MC runnable as a single self-contained container that can drive the
operator's host-installed Claude Code / Codex / OpenCode CLIs without
forcing them to re-authenticate inside the container, and without OpenClaw
gateway (which is macOS-only and not available on Linux hosts).

Changes:

- Dockerfile
  * Bake claude / codex CLIs into the image as a fallback so the Settings
    panel reports "Installed" even before the host bind-mounts attach. The
    host's /home/<user>/.local/bin gets first place in PATH, so an
    authenticated host install transparently shadows the baked one.
  * Add tmux + jq to the runtime image: required by /chat (PTY terminals)
    and by various agent runtime probes.
  * Reuse the slim image's existing uid 1000 user, renaming it to
    `nextjs`. This means bind-mounted host files (typical Linux uid 1000)
    are read/written without chown.

- docker-compose.yml
  * `user: "1000:1000"` so files written into bind-mounted host dirs
    stay owned by the host user.
  * Project the host user's $HOME into the container (`.local/bin`,
    `.bun`, `.claude`, `.claude.json`, `.local/share/claude`) so the
    container sees the same authenticated CLIs the operator uses on
    the host.
  * Mount /mnt and ${HOME} verbatim so file paths the user sees on
    the host work identically inside the container.
  * Bump memory limit to 2G (Next.js 16 + node-pty + the task-dispatch
    loop OOM at the upstream 512M when /chat opens a terminal).
  * Wire ANTHROPIC_API_KEY and OPENAI_API_KEY from .env so the direct-API
    dispatch path works without a gateway.
  * Make NEXT_PUBLIC_CHAT_POLL_INTERVAL_MS a build-time variable. Default
    1000ms here gives near-live transcript updates while claude writes
    the jsonl line-by-line.
  * Add MC_HOST_SESSION_MODE env (see follow-up commit).

- Makefile (new)
  Replaces the previous mc-up.sh / mc-reset-db.sh helpers with a single
  entrypoint for the operator: `up`, `down`, `restart`, `recreate`,
  `build`, `rebuild` (no-cache), `ps`, `logs`, `shell`, `status`,
  `wait-ready`, `reset-db`, `nuke`. URL is fixed to 127.0.0.1:7012.

- .gitignore
  Ignore WALKTHROUGH.md (operator-local notes file kept on disk, never
  committed).
The strict CSP set in src/proxy.ts uses `script-src 'nonce-X' 'strict-dynamic'`
which blocks every inline <script> that does not carry the matching nonce.

next-themes injects a tiny inline script at render time to set the
light/dark class before first paint (anti-FOUC). Without an explicit
`nonce` prop it ships that script with no nonce attribute, so the
browser blocks it. Strict-dynamic then refuses to load any further
script (because the boot script is the trust root), and React never
hydrates — the chat input field has no working Send button, fetch from
client code throws TypeError, etc.

Fix: pass the per-request nonce (already available in layout via the
x-nonce request header that the proxy/middleware sets) into
ThemeProvider's `nonce` prop. next-themes >=0.4 propagates this to
its inline script tag.

Also leaves two `// console.log('[DEBUG csp] ...')` lines (commented out)
in proxy.ts and layout.tsx — useful for the next person who has to debug
the nonce flow end-to-end.
When MC runs in Docker and the operator already has a live `claude` CLI
on the host attached to a project session, /chat → "Send" used to fail
with "spawn claude ENOENT" or "No conversation found" depending on
which root cause was hit first. This commit makes the endpoint actually
usable for that shared-session case.

What was broken:

1. `claude --resume <id>` only finds the transcript if the process cwd
   matches the project path encoded in `~/.claude/projects/<encoded>/`.
   MC's runCommand defaulted to /app inside the container, so resume
   silently picked up no conversation.

2. Decoding the directory name back to a path is unreliable: claude
   collapses both `/` and `_` to `-`, so `/foo/bar_baz` and
   `/foo/bar/baz` both round-trip to `-foo-bar-baz`. Naively decoding
   gave wrong cwd → cd failed → claude never ran.

3. Direct `spawn('claude', ...)` from the Next.js standalone server
   process produced ENOENT even though the binary was reachable from
   `which claude` and `node -e 'spawn(...)'` in the same container.
   Likely a Next 16 standalone runtime / process.env quirk; the exact
   root cause was not pinned down.

4. There was no policy for the operator-visible race when MC --resumes
   into a session that already has a live host CLI writing to the
   same jsonl.

Fix:

- resolveClaudeSessionCwd() now reads the actual `cwd` field from the
  first JSON line of the session jsonl. That field is authoritative
  and survives any encoding ambiguity.

- resolveExecutable() walks PATH with fs.access X_OK to pin the
  absolute path of `claude` before spawn, eliminating bare-name
  resolution as a variable.

- Spawn goes through `sh -c "cd <cwd> && exec <bin> --print --resume
  <id>"` instead of direct execvp. This sidesteps the ENOENT
  observed under Next 16 standalone, and shQuote() keeps the cd /
  bin paths safe.

- New env MC_HOST_SESSION_MODE controls how MC handles a session
  that may have a live host CLI:
    coexist (default) — both write to the same jsonl. Each side
      picks up the other's writes on its next prompt.
    block-active     — return 409 if jsonl mtime < 60s ago, so the
      operator only --resumes idle sessions.
    nudge            — coexist + best-effort utimes() after the
      reply, so a tail-watching host CLI sees a fresh mtime.

- TODO marker for the proper fix to "long wait on Send": replace
  this blocking `claude --print` call with an SSE endpoint backed
  by `--output-format stream-json`, so the chat UI can render
  tokens incrementally.
Two related issues caused the /chat session list to drift from reality.

1. Upstream considered any jsonl touched in the last 90 minutes as
   "active". For an operator with several claude sessions across
   projects this surfaces almost everything as live, including ones
   they finished with hours of think-time ago. Tightened to 15 minutes
   in both layers (the scanner-side `ACTIVE_THRESHOLD_MS` and the
   API-side derived recovery window `LOCAL_SESSION_ACTIVE_WINDOW_MS`)
   so they stay coherent.

2. Once a row was inserted into `claude_sessions`, it lived forever
   even after the underlying jsonl was deleted. The API still surfaced
   those orphans because the derived-active recovery in /api/sessions
   would happily flag them as "active" off the stale `last_message_at`
   column. After a scan cycle, also DELETE rows whose session_id is
   not present in the freshly-scanned set. The result message now
   reports `removed N orphan(s)` when this happens.
Three small UX fixes in the chat workspace:

- Stop excluding `session:*` conversations from the polling fallback.
  Without this, `/chat` would freeze until the operator hit F5 every
  time SSE dropped. The polling cadence is parameterized via
  NEXT_PUBLIC_CHAT_POLL_INTERVAL_MS (default 1500ms in code, 1000ms in
  the docker-compose file) so it can be tuned without touching the
  component.

- Drop the standalone "lastReply" panel that used to show the most
  recent assistant message above the input row. It duplicated the
  transcript and, on long replies, pushed the input field below the
  viewport. The reply now lands in the transcript via
  onRefreshTranscript() — same SessionMessage rendering, same
  formatting, same scroll behaviour as the rest of the conversation.
  This keeps the prompt input row anchored to the bottom regardless
  of reply size, matching how the host claude CLI itself displays
  things.

- Leave the [DEBUG chat] console.log lines commented in place — these
  were how we traced the CSP/spawn pipeline end-to-end and the next
  person debugging a regression here will want them.
Self-contained Docker stack + /chat fixes for shared host claude sessions
When the OpenClaw gateway isn't available (Linux: it's macOS-only), the
direct-API dispatch path was Anthropic-only — agents configured for OpenAI
or a local model couldn't run. This adds two more direct providers without
touching the existing Anthropic path.

Routing is by `dispatchModel` prefix on the agent config:

  `claude-*`, `anthropic/*`                            → ANTHROPIC_API_KEY
  `gpt-*`, `o1-*`, `o3-*`, `openai/*`                  → OPENAI_API_KEY
  `local/*`, `ollama/*`, `lmstudio/*`, `litellm/*`     → LOCAL_LLM_ENDPOINT
                                                         (+ optional
                                                          LOCAL_LLM_API_KEY)

The "local" path is intentionally generic — it speaks the OpenAI
`/v1/chat/completions` REST shape, which is what LMStudio, Ollama, vLLM,
and liteLLM proxies all expose. Operators who run several local backends
behind a single liteLLM endpoint can point LOCAL_LLM_ENDPOINT at it and
fan everything out from one place; the rest of MC stays unchanged.

Both Aegis review and the main task dispatch now go through `callDirectly()`
which picks the provider. The Aegis review path also now passes through
the agent's `agent_config` so per-agent `dispatchModel` overrides reach
the reviewer (previously hardcoded to `null`, which forced Anthropic).

Token usage from OpenAI-compatible responses (`usage.prompt_tokens` /
`usage.completion_tokens`) is recorded in the same `token_usage` table the
Anthropic path uses, with the model id verbatim — so cost reports cover
all three providers without further plumbing.

`docker-compose.yml` adds `LOCAL_LLM_ENDPOINT` (default
`http://host.docker.internal:1234/v1`, LMStudio's stock listener on the
docker host) and `LOCAL_LLM_API_KEY`.
…host session modes

Cover the operator-path features from the docker-stack PR and this branch:

- New "Self-contained Operator Setup" section explaining host $HOME
  bind-mounts, baked CLI fallback, the uid-1000 image constraint and
  the workaround for non-1000 hosts, the 2G memory floor, and why
  Makefile defaults to MC_PORT=7012.

- "Direct API dispatch" subsection with the dispatchModel → provider
  routing table (Anthropic / OpenAI / OpenAI-compatible local via
  LMStudio, Ollama, or liteLLM proxy).

- "Shared host Claude Code session" subsection describing the
  MC_HOST_SESSION_MODE env (coexist / block-active / nudge).

- Environment Variables table extended with MC_PORT, ANTHROPIC_API_KEY,
  OPENAI_API_KEY, LOCAL_LLM_ENDPOINT, LOCAL_LLM_API_KEY,
  MC_HOST_SESSION_MODE, NEXT_PUBLIC_CHAT_POLL_INTERVAL_MS.

No existing content removed or contradicted; all additions go before
"Production Hardening" so the canonical hardened-compose flow is
unchanged.
Step-by-step demo showing how to wire a 4-agent team with three
different providers — architect on Claude Opus, implementor on
gpt-4o-mini, linter on a local model via LMStudio, Aegis reviewer
on Claude Sonnet — and run a single master task end to end through
the dispatch / review loop.

The walkthrough is intentionally exhaustive on per-field values
(Display Name, Role, Soul, Settings → Agent Runtimes, dispatchModel,
temperature, sandbox/network options) so an operator can copy-paste
through it without guessing what the form expects.

Includes:
- .env preparation for all three providers
- Workspace + project setup
- Per-agent screens with full Soul prompt text
- Master task description (login → JWT migration) the architect
  decomposes
- Acceptance checklist (where to look, what to expect)
- Troubleshooting for the common LMStudio / OPENAI_API_KEY / Aegis
  failure modes
- Variants for Ollama, liteLLM proxy, Anthropic-only setups

Lives under examples/ so it stays out of the production build but is
discoverable next to the source.
Add `make dev` workflow that bind-mounts src/, public/, messages/, and configs
into a Node 22 container, so day-to-day .ts/.tsx edits hit Turbopack
hot-reload without rebuilding the image. Image is rebuilt only when
package.json / pnpm-lock.yaml / Dockerfile.dev change.

The dev compose shares the same `mission-control_mc-data` volume as
production so admin user / workspaces / projects / agents created via
`make up` are visible in `make dev` and vice versa (still single-writer
SQLite — only run one stack at a time).

Targets: dev, dev-down, dev-build, dev-rebuild, dev-logs, dev-shell, dev-ps.
The dispatch path silently fell back to `runOpenClaw` even when no gateway
was actually installed, producing `spawn openclaw ENOENT` on every tick
until the task was failed after 5 retries. This change adds the missing
gateway-availability evidence checks and a CLI fallback for the Claude
provider so MC functions standalone on Linux/Docker hosts where
OpenClaw is not installed.

Changes in src/lib/task-dispatch.ts:
- isGatewayAvailable() now requires physical evidence: either a real
  openclaw.json on disk, or a registered gateway row whose status is in
  the healthy set (online/healthy/ready). Previously a truthy default
  config path was sufficient, and an onboarding-seeded gateway row with
  status='unknown' falsely satisfied the check.
- Added isClaudeCliAvailable() + callClaudeViaCli() — when the host
  Claude Code CLI is bind-mounted into the container (/.local/bin/claude
  with ~/.claude.json), the anthropic provider routes through it via
  spawn('claude', ['--print', '--output-format', 'json', '--model', X]).
  This uses the operator's existing login/plan with no API key.
- callDirectly() prefers CLI for anthropic; falls back to direct API
  only if CLI is absent.
- isDirectDispatchAvailable('anthropic') is true when EITHER an API key
  OR the CLI is present.
- requeueStaleTasks() now skips the offline-check entirely when MC is in
  direct-API mode, since direct-API agents have no heartbeat by design
  and would otherwise be failed after 5 stale-cycles before any dispatch
  could run.
- scoreAgentForTask() lifts the offline/error/sleeping rejection in
  direct-API mode for the same reason.

Changes in src/app/api/agents/[id]/route.ts:
- Save flow no longer attempts to write gateway_config when openclaw.json
  doesn't exist — guards prevent the ENOENT error that caused agent
  edits to be silently reverted on Linux without OpenClaw.

Changes in src/lib/claude-sessions.ts:
- Oversized session jsonl files are logged once per process per filePath
  at INFO instead of WARN-spammed on every 30s sync tick.

Changes in src/components/panels/agent-detail-tabs.tsx:
- Defensive coercion in 4 places against nested model.primary objects
  ({primary: {primary: "..."}}), which crashed Config tab with React
  error builderz-labs#31 when MC and gateway disagreed on the model schema.
…through

Rewrite examples/MULTI-PROVIDER-DEMO.md from a high-level outline into a
13-section step-by-step guide a non-MC operator can follow front-to-back
without needing to read the source. Each UI step lists the exact label,
the value to enter, and the expected result.

Sections:
- 0  prepare .env (API keys, host session mode)
- 1  create workspace (8 fields, OpenClaw template note)
- 2  add project (with expandable card)
- 3-6  create 4 agents via the 3-step wizard with full text:
       Architect (Claude Opus), Aegis (Sonnet), Dev (OpenAI), Linter (Local LLM)
- 7   master task with full description copy
- 7.A explain Owner / "Awaiting Owner" gate
- 8   pipeline execution + how to inspect results

Adds a hard-won note at the top: in direct-API mode the agents stay in
'offline' status by design (no heartbeat), but tasks still dispatch via
the direct provider — this is normal and not a failure mode.
Add a parallel docker-compose stack that brings up the OpenClaw gateway
daemon (github.com/openclaw/openclaw) on host ports 18789/18790 — the
same defaults MC has long expected via OPENCLAW_GATEWAY_HOST/PORT in
docker-compose-dev.yml. The integration is strictly additive:

- docker-compose-openclaw.yml is independent of docker-compose.yml and
  docker-compose-dev.yml. MC reaches openclaw via host.docker.internal,
  no shared network or compose merging required.
- When openclaw is up and registered in MC's gateways table with
  status='online', isGatewayAvailable() returns true and dispatch
  automatically routes through runOpenClaw — agents get persistent
  PTY-backed sessions with full tool-use.
- When openclaw is down (or never started), MC silently falls back to
  the direct-API/CLI path introduced in the previous commit. Operators
  can adopt or remove the gateway without changing any MC code.

Makefile targets:
  openclaw-clone     git-clones github.com/openclaw/openclaw to ./openclaw-src
  openclaw-build     docker builds the gateway image (5-10 min, one-time)
  openclaw-up        starts the gateway daemon
  openclaw-down      stops it (MC keeps running on direct-API fallback)
  openclaw-restart / -logs / -ps / -status
  openclaw-onboard   runs the upstream interactive provider/skills wizard
  openclaw-shell     drops into the CLI sidecar
  openclaw-doctor    runs 'openclaw doctor' for diagnostics
  openclaw-token     prints the auto-generated gateway token (for MC .env)

Files:
  docker-compose-openclaw.yml  — gateway + cli sidecar, /healthz, persistent
                                 .openclaw-data volume, host.docker.internal
                                 alias
  .env.openclaw.example        — minimal env template (token + provider keys)
  examples/OPENCLAW-INTEGRATION.md  — 9-step walkthrough: clone, build,
                                 onboard, register gateway in MC UI,
                                 verify dispatch path, rollback procedure
  .gitignore                   — ignore /openclaw-src/, /.openclaw-data/,
                                 /.env.openclaw, /.idea/, /.vscode/
Add openclaw npm package to the dev image so src/lib/command.ts:runOpenClaw
finds a binary in PATH when MC is configured to dispatch through a
sibling openclaw-gateway container. Also propagate OPENCLAW_GATEWAY_URL,
OPENCLAW_GATEWAY_TOKEN, and OPENCLAW_ALLOW_INSECURE_PRIVATE_WS into the
dev container so the CLI knows where the gateway lives.

Status of the experiment (see docs/openclaw-experiment-notes.md to follow):
- openclaw gateway daemon comes up via docker-compose-openclaw.yml
- /healthz is reachable from MC, gateway-row in MC db transitions to
  status='online', isGatewayAvailable() returns true
- HTTP /v1/chat/completions with Bearer-token auth WORKS — no pairing
  required, full operator scope. gpt-5-mini round-trip confirmed.
- The CLI/WebSocket path used by runOpenClaw requires per-client device
  pairing approval (operator.admin scope), which is non-trivial in a
  cross-container topology. Next step is to add an HTTP fallback to
  task-dispatch.ts so MC routes via /v1/chat/completions when the gateway
  is online and a token is available, before falling back to the existing
  runOpenClaw CLI path or the direct-API/CLI path.
End-to-end success: MC dispatches LOGIN-001 through openclaw gateway with
ZERO MC code changes. Path:
  MC task assigned → dispatchAssignedTasks → runOpenClaw → openclaw CLI
  → ws://host.docker.internal:18789 → gateway → openai/gpt-5-mini
  → 4322 chars resolution → MC tasks.resolution → status='review'
Total wall-clock: ~70s.

This commit consolidates the third-variant integration approach (shared
docker bridge, separate compose files, persistent pair state):

scripts/openclaw-auto-pair.py
  Patches openclaw's pending → paired pairing files transactionally on
  the host filesystem. Pairing tokens are documented as 32-byte base64url
  random secrets in openclaw-src/src/infra/pairing-token.ts (no crypto
  signing), so producing them client-side is safe and produces the same
  end-state as the official `openclaw devices approve` flow which is
  unavailable in our cross-container topology (operator.admin scope is
  not held by any auto-paired loopback CLI).

  Idempotent: if MC's deviceId is already in paired.json with a matching
  token in MC's device-auth.json, the script exits 0 without changes.
  DeviceId+publicKey match check ensures we only approve the specific
  pending request that matches MC's identity, not arbitrary pending
  entries.

docker-compose-dev.yml
  Bind-mount ./.mc-openclaw/:/home/nextjs/.openclaw — gives MC's openclaw
  CLI a stable identity (private key, deviceId, paired token) that
  survives container recreate / dev-rebuild. Without this, every restart
  generates a new deviceId and re-pairing is needed.

Makefile (openclaw-pair-mc target)
  One-shot wrapper: trigger pairing request from MC, run auto-pair.py,
  bind MC agent display names ("Architect (Claude Opus)") to openclaw
  agent ids declared in openclaw.json ("architect"), verify with a
  health call. Agent binding is config-only (writes agents.config JSON,
  no code change).

Makefile (openclaw-unpair-mc target)
  Tear-down counterpart for CONFIRM=yes — clears MC-side identity files
  and removes MC's paired entry from the gateway side. Useful when
  experimenting or rotating identities.

examples/OPENCLAW-INTEGRATION.md
  Adds Step 7.5 (Auto-pair) with explanation of why pairing is needed
  and how the script bypasses the standard interactive admin-approval
  flow safely.

.gitignore
  Adds /.mc-openclaw/ to the openclaw-related local-only paths.

What is NOT changed:
  - src/lib/task-dispatch.ts: untouched. The existing runOpenClaw path
    works as designed once a paired CLI is in place.
  - src/lib/command.ts: untouched.
  - src/lib/openclaw-gateway.ts: untouched.
  - Any MC API route, scheduler, or UI component: untouched.

Verified test sequence (LOGIN-001 dispatch through gateway):
  $ make openclaw-up && make dev
  $ make openclaw-pair-mc          # one-shot, idempotent
  $ # ...drag LOGIN-001 in UI or reset to status='assigned'
  $ # within 60s task transitions through in_progress to review
  $ # tasks.resolution contains the gpt-5-mini agent's response
…builds for upgrades

Refactor both stacks so openclaw runs from a bind-mounted ./openclaw-src/
clone instead of a baked image. Updating openclaw is now `git pull` plus
one builder run, no docker image rebuild required.

Architecture before:
  - openclaw-gateway: built from openclaw-src via Dockerfile (baked dist
    + node_modules into image, ~3.5 GB image). Update = full image rebuild.
  - MC dev: `npm install -g openclaw` baked at image build time. Pinned to
    whatever was on npm at build time. Update = full image rebuild.
  - Result: MC frontend showed "update available v2026.4.27 (installed
    v2026.4.26)" because the gateway was rebuilt from a fresh clone but MC
    CLI was still on an older npm publish.

Architecture after:
  - Stock node:24-bookworm image for both gateway and CLI sidecar — no
    custom build needed.
  - ./openclaw-src/ bind-mounted at /app inside both. The gateway runs
    `node /app/dist/index.js gateway`. dist + node_modules live on the
    host under openclaw-src/ (~3.1 GB, .gitignored).
  - New `openclaw-builder` compose service (profile=build) compiles dist
    + installs node_modules into the bind-mount via a one-shot container
    that has bun + pnpm. First run ~5 min, incremental rebuilds faster.
  - MC dev: dropped `npm install -g openclaw` from Dockerfile.dev. A shim
    at /usr/local/bin/openclaw runs `node /opt/openclaw-src/dist/index.js`
    where /opt/openclaw-src is a read-only bind-mount of the same host
    clone. Both gateway and MC CLI always run the exact same dist.
  - Both containers run as uid 1000:1000 so files written into bind-mounts
    are readable on the host without sudo.
  - Plugin runtime stage moved from a Docker named volume (root-owned by
    default, blocked uid 1000) into ./.openclaw-data/plugin-runtime-deps/
    so it inherits the same uid 1000 ownership as the rest of state.

Update workflow:
  $ make openclaw-update
  → git pulls openclaw-src to latest, runs openclaw-builder to recompile
    dist into the bind-mount, restarts the gateway. Takes ~30s-2min for
    incremental updates. No docker rebuild, no image churn.

Verified after refactor:
  - MC CLI and gateway both report v2026.4.27 (the same dist).
  - LOGIN-001 dispatched end-to-end via openclaw gateway in 61s,
    outcome=success, resolution=3538 chars from openai/gpt-5-mini, then
    transitioned into quality_review (Aegis cycle picked it up).
  - make openclaw-pair-mc remained idempotent across the recreate
    (paired identity persisted in ./.mc-openclaw/).
Two operator-visible regressions after the openclaw integration:

1. WebSocket spam in /logs:
     "Handshake failed on root path. Retrying WebSocket via /gateway-ws."
     "Max reconnection attempts reached."
   The MC frontend reads gateway.host from the gateways table to build
   the WebSocket URL. Our row stored "host.docker.internal" (correct for
   the MC backend's HTTP probe) but a browser running on the host can't
   resolve that name — it's only injected into containers via Docker's
   `extra_hosts: host-gateway` mapping. The browser was hitting
   ws://host.docker.internal:18789, getting ENOTFOUND/refused, and
   exhausting the reconnect budget every few seconds.
   Fix: set NEXT_PUBLIC_GATEWAY_URL=ws://127.0.0.1:18789 in
   docker-compose-dev.yml. The /api/gateways/connect route already
   honours this env var as a browser-facing override (see
   src/app/api/gateways/connect/route.ts:156). MC backend continues to
   probe the gateway through host.docker.internal:18789 unchanged.

2. Orchestration → Command tab dropdown showing agents but not letting
   any be selected:
     <option ... disabled={!a.session_key}>
   (src/components/panels/orchestration-bar.tsx:275)
   Our agents were created in MC's setup wizard with session_key=null,
   and the Command UI disables non-null-session_key options. The /agents
   page list looked correct but every option was greyed out.
   Fix: make the openclaw-pair-mc target also write session_key="mc-<id>"
   for each MC agent it knows about (architect/aegis/dev/linter). The
   value lines up with the openclaw agent ids declared in openclaw.json,
   so the dropdown becomes selectable and downstream openclaw routing has
   a valid identifier to work with.

Also: tighten the venv probe in the openclaw-pair-mc target so it falls
back to system python3 cleanly when ./.venv is absent, instead of
printing a stderr warning.
… shim

MC's source uses an older openclaw CLI shape (`gateway sessions_send
--session X --message Y`) for the wake-agent and agent-message endpoints.
That shape is gone in openclaw 2026.4.x — `sessions_send` is no longer a
gateway subcommand, only an internal RPC name. The current public RPC for
sending into a session is `chat.send`, which additionally requires an
`idempotencyKey` and a `deliver` flag.

Symptom: clicking "Wake up" on a Linter (offline) agent in /agents
returned `Failed to wake agent`. The MC frontend Orchestration → Command
tab Send button would also fail silently with the same root cause.

Fix without modifying MC code: route every `openclaw` invocation in the
MC dev container through `scripts/openclaw-cli-shim.py`, which detects
known retired shapes and rewrites them on the fly:

  legacy: gateway sessions_send --session X --message Y
  modern: gateway call chat.send --params '{
            "sessionKey":"X","message":"Y",
            "idempotencyKey":"mc-shim-<pid>-<ts>","deliver":false
          }' --json

Also handled: sessions_history (-> sessions.history with key) and
sessions_list (-> sessions.list).

The shim is bind-mounted from ./scripts/openclaw-cli-shim.py into the
container at /usr/local/lib/openclaw-cli-shim.py via docker-compose-dev.yml,
so edits to the rewriter don't need an image rebuild — same live-update
ergonomics we now have for openclaw-src/dist itself.

Verified:
  $ curl -X POST .../api/agents/Linter%20%28Local%20LLM%29/wake -d '{"message":"Wake up Linter"}'
  {"success":true,"session_key":"mc-linter","stdout":"{\"runId\":\"mc-shim-...\",\"status\":\"started\"}"}

Pass-through of unknown shapes (`gateway call ...`, `agent ...`,
`devices ...`, etc.) is unchanged — the shim just `os.execvp`s straight
to the real `node /opt/openclaw-src/dist/index.js` for those.
…Send

Two operator-visible issues seen on /agents and the Orchestration → Command
tab after the openclaw integration:

1. /agents → Command → Send returned "Validation failed" for any message.
   Root cause is a pre-existing MC bug, not openclaw-related: the form in
   src/components/panels/orchestration-bar.tsx:94 sends
       { to, content: message, from }
   while POST /api/agents/message validates against
       { to, message, from }      // src/lib/validation.ts:137
   So zod rejects every payload as "Validation failed" before the request
   ever reaches the agent. Strictly the smallest possible MC code edit:
   rename the request field `content` -> `message` to match the API
   contract. The Wake button on /agents already worked because it goes
   through a separate endpoint that reads `body.message` itself.

2. /logs continued to spam:
       "WebSocket error occurred"
       "Max reconnection attempts reached. Please reconnect manually."
   even after fixing NEXT_PUBLIC_GATEWAY_URL. Gateway logs show why:
       [ws] closed before connect ... peer=192.168.48.1 remote=192.168.48.1
   The browser opens a WS to ws://127.0.0.1:18789, but from the gateway's
   side that connection arrives from the docker bridge IP, NOT loopback,
   so openclaw treats the browser as an unpaired device and aborts the
   handshake with code 1006. There's no clean way to pair every browser
   session non-interactively.
   Fix: set NEXT_PUBLIC_GATEWAY_OPTIONAL=true in docker-compose-dev.yml.
   The MC websocket client already has special-case behaviour for that
   flag (src/lib/websocket.ts:771): it gives up reconnecting silently and
   falls back to HTTP polling for live updates. Backend dispatch through
   openclaw is unaffected — the gateway is still the dispatch path; only
   the browser's live event stream is off.
Add a local-only auto-approval worker for pending Control UI device requests so Connect succeeds after restart and new request IDs. Tune security scan messaging for Docker localhost topology and HTTPS-only flags to reduce false local warnings.
nnnet added 21 commits May 4, 2026 11:23
Make up/restart/down/status now run against MC_MODE and include OpenClaw when OPENCLAW_ENABLED=1, with compatibility aliases preserved. Update env examples, deployment docs, and integration plan notes to document the minimal operator workflow.
Project Telegram dmPolicy/allowlists/owner allowlists from .env/.env.openclaw into gateway and MC CLI state idempotently, while preserving legacy TELEGRAM_NUMERIC_USER_ID behavior. Add an env toggle to hide non-actionable doctor security info lines without masking real warnings/errors.
Drop MC_OPENCLAW_DOCTOR_HIDE_INFO plumbing so Mission Control always returns full OpenClaw doctor output, including informational security lines.
…ompose

Two compose bugs blocking sandbox skill execution:

1. Path-equivalence for state dir. Gateway runs in a container but asks
   the host docker daemon to bind-mount ${state}/sandboxes/agent-X to
   /workspace in each sandbox. With ./.openclaw-data:/home/node/.openclaw,
   the source path the gateway passed was /home/node/... — nonexistent
   on the host, so docker silently mounted an empty dir and skills /
   AGENTS.md were invisible inside /workspace. Now state is also mounted
   at the same absolute path the host has; legacy /home/node/.openclaw
   alias preserved.

2. Drop ${VAR:-} declarations for TELEGRAM_*, OPENCLAW_GATEWAY_TOKEN,
   OPENCLAW_TOOLS_PROFILE in environment blocks. Compose merges
   environment ON TOP OF env_file, so empty fallbacks were blanking
   real values coming from .env.openclaw whenever the top-level .env
   (which compose interpolation reads) didn't define them.
The proxy is a separate, independent microservice — its source lives in
its own repo (github.com/nnnet/gpu-coordinator-proxy) and is checked out
here as ./gpu-coordinator-proxy-src/, mirroring the openclaw-src/ pattern
we already use. Both clones are gitignored; only the compose entry that
references them lives in this tree.

The service fronts LMStudio (:1235) and Ollama (:11435) on separate ports
with a shared VRAM lock so a 20B-class model in one runtime doesn't
collide with a 20B in the other on a single GPU. Behaviour is fully
configurable via env (defaults are safest):

  GPU_AUTO_FREE_ENABLED   default 1 (master switch)
  GPU_FREE_STRATEGY       default spare-target (keep target warm)
                          alternative wipe-all (cold reload, gated)
  GPU_WIPE_ALL_ALLOWED    default 0 (safety gate for wipe-all)
  GPU_FREE_SETTLE_MS      default 800 (driver reclaim pause)

If the sibling clone is missing the service block can be commented out
without affecting anything else in the MC stack.
…anner

Mission Control's `mission-control-dev` container couldn't reach the host
docker daemon, so the in-MC `openclaw doctor` (called by
src/app/api/openclaw/doctor/route.ts via runOpenClaw) failed its
isDockerAvailable() check and emitted "Sandbox mode is enabled but Docker
is not available" — surfacing as a permanent warning banner in the UI even
though docker on the host was healthy and the gateway sandbox flow worked.

Container-side docker access:
- Add docker.io to Dockerfile.dev so `docker` is on PATH inside the dev
  container; bind /var/run/docker.sock in docker-compose-dev.yml.
- Make uid 1000 reach the socket via `group_add` driven by a new
  DOCKER_SOCKET_GID env, auto-detected by the Makefile via
  `stat -c %g /var/run/docker.sock` (falls back to 994 for stock
  Debian/Ubuntu hosts; manually overridable on Fedora/Arch/colima/Rancher
  Desktop where the gid differs).

Mask host's stale ~/.openclaw from the container view:
- openclaw doctor's findOtherStateDirs() scans /home/*/.openclaw and trips
  "Multiple state directories detected" whenever the dev container's broad
  ${HOME}:${HOME}:rw bind exposes the host user's pre-fork .openclaw dir.
- Bind-mount an empty stub file (.docker-mask/openclaw-stub.empty) over
  ${HOME}/.openclaw so existsDir() returns false (the path is now a regular
  file, not a directory). Tmpfs is unsuitable here — it would still appear
  as a directory.

Doctor-banner suppression in the openclaw CLI shim:
- openclaw doctor unconditionally prints `Run "openclaw doctor --fix" to
  apply changes.` whenever shouldRepair is false, regardless of whether
  there is anything actually fixable (see openclaw-src
  flows/doctor-health-contributions.ts:580-582). The word "fix" trips MC's
  parseOpenClawDoctorOutput mentionsWarnings regex.
- The Plugins panel always prints "Errors: 0" — the substring "error"
  trips MC's level=error escalation regex.
- Both treated as upstream quirks. The shim (which already exists for
  legacy CLI compat) now intercepts plain `doctor` invocations, drops the
  footer line, and rewrites "Errors: 0" → "Errs: 0" only when the count
  is zero. Real plugin error counts pass through untouched so a genuine
  banner still surfaces if something breaks.

Brew-enabled gateway + sandbox images (carried over from prior pending
work, finally committed):
- Dockerfile.openclaw.dockercli: gateway image now ships docker CLI +
  Linuxbrew so skills.install RPC can `brew install <formula>` for skills
  declaring brew deps.
- Dockerfile.openclaw.sandbox: overlays brew on the upstream
  openclaw-sandbox:bookworm-slim base (kept separate from openclaw-src to
  preserve read-only update flow via `make openclaw-update`).

Makefile UX simplification (589 lines → 124):
- Replace the legacy multi-mode lifecycle wrapper with a minimal docker
  compose driver: `make`, `make build [SVC...]`, `make up [SVC...]`,
  `make down`, `make logs`, `make ps`, `make clean`. Positional service
  args via $(filter-out $(firstword $(MAKECMDGOALS)),$(MAKECMDGOALS)).
- MODE=dev|prod toggles which compose file pair is used.
- The pre-rewrite Makefile is preserved as Makefile.legacy for anyone who
  still relies on the older recipe names.

Drop tracked .beads/dolt-monitor.pid.lock — it's a runtime lock file and
is already covered by .gitignore.
The deliver_notifications() and get_delivery_stats() helpers in
scripts/notification-daemon.sh issue HTTP requests against
/api/notifications/deliver and /api/notifications without any
Authorization header. Both endpoints use requireRole(request, 'operator')
(src/lib/auth.ts), which accepts the global API key via the
`Authorization: Bearer ...` or `x-api-key` header. Without it every
poll silently returns 401 and the daemon reports "delivery failed"
without explaining why.

Add an MC_API_KEY env var (with API_KEY as a fallback so a single
.env value works for both MC and this daemon), validate it before
the run/stats path, and pass it as a Bearer header on both the POST
deliver call and the GET stats call. Also fix the help text default
URL (3005 -> 3000) to match the script's actual MISSION_CONTROL_URL
default.

Behavior change: the daemon now prints an actionable error and exits
1 if MC_API_KEY is not set, instead of running batches that silently
401. No new dependencies; no other behavior changes.
@nnnet nnnet requested a review from 0xNyk as a code owner May 6, 2026 08:32
Copy link
Copy Markdown

@0xbrainkid 0xbrainkid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the notification daemon auth path; that specific shell change looks directionally right, and the targeted doctor test passes: pnpm vitest run src/lib/__tests__/openclaw-doctor.test.ts.

Blocking before merge: this PR contains a large amount of unrelated, machine-specific runtime/config material that should not be committed as part of a notification-daemon auth fix. In particular .opencode/opencode.json enables many third-party plugins and includes local absolute paths such as /home/uadmin/.local/bin/codebase-memory-mcp and /mnt/9/gt/.../node_modules/@playwright/mcp/cli.js. Committing that repo-level config will be broken for other contributors and may cause nondeterministic external plugin execution for anyone using opencode in this repo.

Please reduce this PR to the auth fix and directly related docs/tests, or move the opencode/beads/docker/demo additions into separate intentionally reviewed PRs with portable, opt-in config. Once the unrelated machine-local config is removed, I can re-review the notification-daemon change.

Note: full typecheck could not be used as a clean gate in this checkout because the worktree dependency set is missing unrelated deps/types (next-intl, xterm, node-pty, radix/cva), causing broad failures outside this PR.

@0xNyk
Copy link
Copy Markdown
Member

0xNyk commented May 7, 2026

Thanks — the scripts/notification-daemon.sh auth fix itself is exactly right (read MC_API_KEY / API_KEY from env, validate up front, pass as Authorization: Bearer, sync the help-text default URL). I'd merge that as a single-file PR tomorrow.

But the title says "authenticate notification daemon requests" and the diff has 57 files / 6,450 added lines covering:

  • .beads/ directory (51 lines, hooks, README, config)
  • .opencode/ directory (237-line opencode config + READMEs)
  • .docker-mask/, .dockerignore, .env.example, .env.openclaw.example, .gitignore
  • .vibe/development-plan-experiment-openclaw-integration.md (555 lines)
  • AGENTS.md (150 lines), Makefile (124 lines), Makefile.legacy (589 lines)
  • 4 Dockerfiles, 3 docker-compose files, nginx config
  • docs/deployment.md (215 lines), telegram-onboarding doc, ops-cheatsheet.md
  • examples/MULTI-PROVIDER-DEMO.md (747 lines), examples/OPENCLAW-INTEGRATION.md (238 lines)
  • scripts/openclaw-auto-approve-control-ui.mjs, scripts/openclaw-auto-pair.py, scripts/openclaw-cli-shim.py
  • src/app/api/agents/[id]/route.ts, src/app/api/sessions/continue/route.ts (+222), src/app/api/sessions/route.ts, src/app/api/settings/route.ts, src/app/api/spawn/route.ts
  • src/components/chat/chat-workspace.tsx, src/components/panels/agent-detail-tabs.tsx, src/components/panels/orchestration-bar.tsx
  • src/lib/claude-sessions.ts, src/lib/openclaw-doctor.ts, src/lib/security-scan.ts, src/lib/task-dispatch.ts (+280)
  • src/proxy.ts

This PR is structurally three or four other PRs stacked together (and most of those overlap with #647, #648, #649 which I just reviewed/merged separately).

Could you split this into:

  1. fix(scripts): authenticate notification daemon requests — just scripts/notification-daemon.sh and any minimal supporting changes (.env.example if you want to add MC_API_KEY). I'd merge that immediately.
  2. The .beads/ / .opencode/ / .vibe/ / AGENTS.md additions — these look like personal/local tooling configs that probably shouldn't ship at all (and are already in your other compose PRs' .gitignore).
  3. Whatever's left over after merging fix(chat,csp): nonce hydration + chat session continuity with host Claude CLI #647/feat(dispatch): direct multi-provider dispatch (Anthropic / OpenAI / local OpenAI-compatible) #648/feat(openclaw): additive Docker integration with env-driven hardening + doctor cleanup #649 — please rebase first to see what actually remains as new work.

Closing pending the focused split.

@0xNyk 0xNyk closed this May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants