fix(scripts): authenticate notification daemon requests#653
Conversation
Make MC runnable as a single self-contained container that can drive the
operator's host-installed Claude Code / Codex / OpenCode CLIs without
forcing them to re-authenticate inside the container, and without OpenClaw
gateway (which is macOS-only and not available on Linux hosts).
Changes:
- Dockerfile
* Bake claude / codex CLIs into the image as a fallback so the Settings
panel reports "Installed" even before the host bind-mounts attach. The
host's /home/<user>/.local/bin gets first place in PATH, so an
authenticated host install transparently shadows the baked one.
* Add tmux + jq to the runtime image: required by /chat (PTY terminals)
and by various agent runtime probes.
* Reuse the slim image's existing uid 1000 user, renaming it to
`nextjs`. This means bind-mounted host files (typical Linux uid 1000)
are read/written without chown.
- docker-compose.yml
* `user: "1000:1000"` so files written into bind-mounted host dirs
stay owned by the host user.
* Project the host user's $HOME into the container (`.local/bin`,
`.bun`, `.claude`, `.claude.json`, `.local/share/claude`) so the
container sees the same authenticated CLIs the operator uses on
the host.
* Mount /mnt and ${HOME} verbatim so file paths the user sees on
the host work identically inside the container.
* Bump memory limit to 2G (Next.js 16 + node-pty + the task-dispatch
loop OOM at the upstream 512M when /chat opens a terminal).
* Wire ANTHROPIC_API_KEY and OPENAI_API_KEY from .env so the direct-API
dispatch path works without a gateway.
* Make NEXT_PUBLIC_CHAT_POLL_INTERVAL_MS a build-time variable. Default
1000ms here gives near-live transcript updates while claude writes
the jsonl line-by-line.
* Add MC_HOST_SESSION_MODE env (see follow-up commit).
- Makefile (new)
Replaces the previous mc-up.sh / mc-reset-db.sh helpers with a single
entrypoint for the operator: `up`, `down`, `restart`, `recreate`,
`build`, `rebuild` (no-cache), `ps`, `logs`, `shell`, `status`,
`wait-ready`, `reset-db`, `nuke`. URL is fixed to 127.0.0.1:7012.
- .gitignore
Ignore WALKTHROUGH.md (operator-local notes file kept on disk, never
committed).
The strict CSP set in src/proxy.ts uses `script-src 'nonce-X' 'strict-dynamic'`
which blocks every inline <script> that does not carry the matching nonce.
next-themes injects a tiny inline script at render time to set the
light/dark class before first paint (anti-FOUC). Without an explicit
`nonce` prop it ships that script with no nonce attribute, so the
browser blocks it. Strict-dynamic then refuses to load any further
script (because the boot script is the trust root), and React never
hydrates — the chat input field has no working Send button, fetch from
client code throws TypeError, etc.
Fix: pass the per-request nonce (already available in layout via the
x-nonce request header that the proxy/middleware sets) into
ThemeProvider's `nonce` prop. next-themes >=0.4 propagates this to
its inline script tag.
Also leaves two `// console.log('[DEBUG csp] ...')` lines (commented out)
in proxy.ts and layout.tsx — useful for the next person who has to debug
the nonce flow end-to-end.
When MC runs in Docker and the operator already has a live `claude` CLI
on the host attached to a project session, /chat → "Send" used to fail
with "spawn claude ENOENT" or "No conversation found" depending on
which root cause was hit first. This commit makes the endpoint actually
usable for that shared-session case.
What was broken:
1. `claude --resume <id>` only finds the transcript if the process cwd
matches the project path encoded in `~/.claude/projects/<encoded>/`.
MC's runCommand defaulted to /app inside the container, so resume
silently picked up no conversation.
2. Decoding the directory name back to a path is unreliable: claude
collapses both `/` and `_` to `-`, so `/foo/bar_baz` and
`/foo/bar/baz` both round-trip to `-foo-bar-baz`. Naively decoding
gave wrong cwd → cd failed → claude never ran.
3. Direct `spawn('claude', ...)` from the Next.js standalone server
process produced ENOENT even though the binary was reachable from
`which claude` and `node -e 'spawn(...)'` in the same container.
Likely a Next 16 standalone runtime / process.env quirk; the exact
root cause was not pinned down.
4. There was no policy for the operator-visible race when MC --resumes
into a session that already has a live host CLI writing to the
same jsonl.
Fix:
- resolveClaudeSessionCwd() now reads the actual `cwd` field from the
first JSON line of the session jsonl. That field is authoritative
and survives any encoding ambiguity.
- resolveExecutable() walks PATH with fs.access X_OK to pin the
absolute path of `claude` before spawn, eliminating bare-name
resolution as a variable.
- Spawn goes through `sh -c "cd <cwd> && exec <bin> --print --resume
<id>"` instead of direct execvp. This sidesteps the ENOENT
observed under Next 16 standalone, and shQuote() keeps the cd /
bin paths safe.
- New env MC_HOST_SESSION_MODE controls how MC handles a session
that may have a live host CLI:
coexist (default) — both write to the same jsonl. Each side
picks up the other's writes on its next prompt.
block-active — return 409 if jsonl mtime < 60s ago, so the
operator only --resumes idle sessions.
nudge — coexist + best-effort utimes() after the
reply, so a tail-watching host CLI sees a fresh mtime.
- TODO marker for the proper fix to "long wait on Send": replace
this blocking `claude --print` call with an SSE endpoint backed
by `--output-format stream-json`, so the chat UI can render
tokens incrementally.
Two related issues caused the /chat session list to drift from reality. 1. Upstream considered any jsonl touched in the last 90 minutes as "active". For an operator with several claude sessions across projects this surfaces almost everything as live, including ones they finished with hours of think-time ago. Tightened to 15 minutes in both layers (the scanner-side `ACTIVE_THRESHOLD_MS` and the API-side derived recovery window `LOCAL_SESSION_ACTIVE_WINDOW_MS`) so they stay coherent. 2. Once a row was inserted into `claude_sessions`, it lived forever even after the underlying jsonl was deleted. The API still surfaced those orphans because the derived-active recovery in /api/sessions would happily flag them as "active" off the stale `last_message_at` column. After a scan cycle, also DELETE rows whose session_id is not present in the freshly-scanned set. The result message now reports `removed N orphan(s)` when this happens.
Three small UX fixes in the chat workspace: - Stop excluding `session:*` conversations from the polling fallback. Without this, `/chat` would freeze until the operator hit F5 every time SSE dropped. The polling cadence is parameterized via NEXT_PUBLIC_CHAT_POLL_INTERVAL_MS (default 1500ms in code, 1000ms in the docker-compose file) so it can be tuned without touching the component. - Drop the standalone "lastReply" panel that used to show the most recent assistant message above the input row. It duplicated the transcript and, on long replies, pushed the input field below the viewport. The reply now lands in the transcript via onRefreshTranscript() — same SessionMessage rendering, same formatting, same scroll behaviour as the rest of the conversation. This keeps the prompt input row anchored to the bottom regardless of reply size, matching how the host claude CLI itself displays things. - Leave the [DEBUG chat] console.log lines commented in place — these were how we traced the CSP/spawn pipeline end-to-end and the next person debugging a regression here will want them.
Self-contained Docker stack + /chat fixes for shared host claude sessions
When the OpenClaw gateway isn't available (Linux: it's macOS-only), the
direct-API dispatch path was Anthropic-only — agents configured for OpenAI
or a local model couldn't run. This adds two more direct providers without
touching the existing Anthropic path.
Routing is by `dispatchModel` prefix on the agent config:
`claude-*`, `anthropic/*` → ANTHROPIC_API_KEY
`gpt-*`, `o1-*`, `o3-*`, `openai/*` → OPENAI_API_KEY
`local/*`, `ollama/*`, `lmstudio/*`, `litellm/*` → LOCAL_LLM_ENDPOINT
(+ optional
LOCAL_LLM_API_KEY)
The "local" path is intentionally generic — it speaks the OpenAI
`/v1/chat/completions` REST shape, which is what LMStudio, Ollama, vLLM,
and liteLLM proxies all expose. Operators who run several local backends
behind a single liteLLM endpoint can point LOCAL_LLM_ENDPOINT at it and
fan everything out from one place; the rest of MC stays unchanged.
Both Aegis review and the main task dispatch now go through `callDirectly()`
which picks the provider. The Aegis review path also now passes through
the agent's `agent_config` so per-agent `dispatchModel` overrides reach
the reviewer (previously hardcoded to `null`, which forced Anthropic).
Token usage from OpenAI-compatible responses (`usage.prompt_tokens` /
`usage.completion_tokens`) is recorded in the same `token_usage` table the
Anthropic path uses, with the model id verbatim — so cost reports cover
all three providers without further plumbing.
`docker-compose.yml` adds `LOCAL_LLM_ENDPOINT` (default
`http://host.docker.internal:1234/v1`, LMStudio's stock listener on the
docker host) and `LOCAL_LLM_API_KEY`.
…host session modes Cover the operator-path features from the docker-stack PR and this branch: - New "Self-contained Operator Setup" section explaining host $HOME bind-mounts, baked CLI fallback, the uid-1000 image constraint and the workaround for non-1000 hosts, the 2G memory floor, and why Makefile defaults to MC_PORT=7012. - "Direct API dispatch" subsection with the dispatchModel → provider routing table (Anthropic / OpenAI / OpenAI-compatible local via LMStudio, Ollama, or liteLLM proxy). - "Shared host Claude Code session" subsection describing the MC_HOST_SESSION_MODE env (coexist / block-active / nudge). - Environment Variables table extended with MC_PORT, ANTHROPIC_API_KEY, OPENAI_API_KEY, LOCAL_LLM_ENDPOINT, LOCAL_LLM_API_KEY, MC_HOST_SESSION_MODE, NEXT_PUBLIC_CHAT_POLL_INTERVAL_MS. No existing content removed or contradicted; all additions go before "Production Hardening" so the canonical hardened-compose flow is unchanged.
Step-by-step demo showing how to wire a 4-agent team with three different providers — architect on Claude Opus, implementor on gpt-4o-mini, linter on a local model via LMStudio, Aegis reviewer on Claude Sonnet — and run a single master task end to end through the dispatch / review loop. The walkthrough is intentionally exhaustive on per-field values (Display Name, Role, Soul, Settings → Agent Runtimes, dispatchModel, temperature, sandbox/network options) so an operator can copy-paste through it without guessing what the form expects. Includes: - .env preparation for all three providers - Workspace + project setup - Per-agent screens with full Soul prompt text - Master task description (login → JWT migration) the architect decomposes - Acceptance checklist (where to look, what to expect) - Troubleshooting for the common LMStudio / OPENAI_API_KEY / Aegis failure modes - Variants for Ollama, liteLLM proxy, Anthropic-only setups Lives under examples/ so it stays out of the production build but is discoverable next to the source.
Add `make dev` workflow that bind-mounts src/, public/, messages/, and configs into a Node 22 container, so day-to-day .ts/.tsx edits hit Turbopack hot-reload without rebuilding the image. Image is rebuilt only when package.json / pnpm-lock.yaml / Dockerfile.dev change. The dev compose shares the same `mission-control_mc-data` volume as production so admin user / workspaces / projects / agents created via `make up` are visible in `make dev` and vice versa (still single-writer SQLite — only run one stack at a time). Targets: dev, dev-down, dev-build, dev-rebuild, dev-logs, dev-shell, dev-ps.
The dispatch path silently fell back to `runOpenClaw` even when no gateway
was actually installed, producing `spawn openclaw ENOENT` on every tick
until the task was failed after 5 retries. This change adds the missing
gateway-availability evidence checks and a CLI fallback for the Claude
provider so MC functions standalone on Linux/Docker hosts where
OpenClaw is not installed.
Changes in src/lib/task-dispatch.ts:
- isGatewayAvailable() now requires physical evidence: either a real
openclaw.json on disk, or a registered gateway row whose status is in
the healthy set (online/healthy/ready). Previously a truthy default
config path was sufficient, and an onboarding-seeded gateway row with
status='unknown' falsely satisfied the check.
- Added isClaudeCliAvailable() + callClaudeViaCli() — when the host
Claude Code CLI is bind-mounted into the container (/.local/bin/claude
with ~/.claude.json), the anthropic provider routes through it via
spawn('claude', ['--print', '--output-format', 'json', '--model', X]).
This uses the operator's existing login/plan with no API key.
- callDirectly() prefers CLI for anthropic; falls back to direct API
only if CLI is absent.
- isDirectDispatchAvailable('anthropic') is true when EITHER an API key
OR the CLI is present.
- requeueStaleTasks() now skips the offline-check entirely when MC is in
direct-API mode, since direct-API agents have no heartbeat by design
and would otherwise be failed after 5 stale-cycles before any dispatch
could run.
- scoreAgentForTask() lifts the offline/error/sleeping rejection in
direct-API mode for the same reason.
Changes in src/app/api/agents/[id]/route.ts:
- Save flow no longer attempts to write gateway_config when openclaw.json
doesn't exist — guards prevent the ENOENT error that caused agent
edits to be silently reverted on Linux without OpenClaw.
Changes in src/lib/claude-sessions.ts:
- Oversized session jsonl files are logged once per process per filePath
at INFO instead of WARN-spammed on every 30s sync tick.
Changes in src/components/panels/agent-detail-tabs.tsx:
- Defensive coercion in 4 places against nested model.primary objects
({primary: {primary: "..."}}), which crashed Config tab with React
error builderz-labs#31 when MC and gateway disagreed on the model schema.
…through
Rewrite examples/MULTI-PROVIDER-DEMO.md from a high-level outline into a
13-section step-by-step guide a non-MC operator can follow front-to-back
without needing to read the source. Each UI step lists the exact label,
the value to enter, and the expected result.
Sections:
- 0 prepare .env (API keys, host session mode)
- 1 create workspace (8 fields, OpenClaw template note)
- 2 add project (with expandable card)
- 3-6 create 4 agents via the 3-step wizard with full text:
Architect (Claude Opus), Aegis (Sonnet), Dev (OpenAI), Linter (Local LLM)
- 7 master task with full description copy
- 7.A explain Owner / "Awaiting Owner" gate
- 8 pipeline execution + how to inspect results
Adds a hard-won note at the top: in direct-API mode the agents stay in
'offline' status by design (no heartbeat), but tasks still dispatch via
the direct provider — this is normal and not a failure mode.
Add a parallel docker-compose stack that brings up the OpenClaw gateway
daemon (github.com/openclaw/openclaw) on host ports 18789/18790 — the
same defaults MC has long expected via OPENCLAW_GATEWAY_HOST/PORT in
docker-compose-dev.yml. The integration is strictly additive:
- docker-compose-openclaw.yml is independent of docker-compose.yml and
docker-compose-dev.yml. MC reaches openclaw via host.docker.internal,
no shared network or compose merging required.
- When openclaw is up and registered in MC's gateways table with
status='online', isGatewayAvailable() returns true and dispatch
automatically routes through runOpenClaw — agents get persistent
PTY-backed sessions with full tool-use.
- When openclaw is down (or never started), MC silently falls back to
the direct-API/CLI path introduced in the previous commit. Operators
can adopt or remove the gateway without changing any MC code.
Makefile targets:
openclaw-clone git-clones github.com/openclaw/openclaw to ./openclaw-src
openclaw-build docker builds the gateway image (5-10 min, one-time)
openclaw-up starts the gateway daemon
openclaw-down stops it (MC keeps running on direct-API fallback)
openclaw-restart / -logs / -ps / -status
openclaw-onboard runs the upstream interactive provider/skills wizard
openclaw-shell drops into the CLI sidecar
openclaw-doctor runs 'openclaw doctor' for diagnostics
openclaw-token prints the auto-generated gateway token (for MC .env)
Files:
docker-compose-openclaw.yml — gateway + cli sidecar, /healthz, persistent
.openclaw-data volume, host.docker.internal
alias
.env.openclaw.example — minimal env template (token + provider keys)
examples/OPENCLAW-INTEGRATION.md — 9-step walkthrough: clone, build,
onboard, register gateway in MC UI,
verify dispatch path, rollback procedure
.gitignore — ignore /openclaw-src/, /.openclaw-data/,
/.env.openclaw, /.idea/, /.vscode/
…s, openclaw additive integration
Add openclaw npm package to the dev image so src/lib/command.ts:runOpenClaw finds a binary in PATH when MC is configured to dispatch through a sibling openclaw-gateway container. Also propagate OPENCLAW_GATEWAY_URL, OPENCLAW_GATEWAY_TOKEN, and OPENCLAW_ALLOW_INSECURE_PRIVATE_WS into the dev container so the CLI knows where the gateway lives. Status of the experiment (see docs/openclaw-experiment-notes.md to follow): - openclaw gateway daemon comes up via docker-compose-openclaw.yml - /healthz is reachable from MC, gateway-row in MC db transitions to status='online', isGatewayAvailable() returns true - HTTP /v1/chat/completions with Bearer-token auth WORKS — no pairing required, full operator scope. gpt-5-mini round-trip confirmed. - The CLI/WebSocket path used by runOpenClaw requires per-client device pairing approval (operator.admin scope), which is non-trivial in a cross-container topology. Next step is to add an HTTP fallback to task-dispatch.ts so MC routes via /v1/chat/completions when the gateway is online and a token is available, before falling back to the existing runOpenClaw CLI path or the direct-API/CLI path.
End-to-end success: MC dispatches LOGIN-001 through openclaw gateway with
ZERO MC code changes. Path:
MC task assigned → dispatchAssignedTasks → runOpenClaw → openclaw CLI
→ ws://host.docker.internal:18789 → gateway → openai/gpt-5-mini
→ 4322 chars resolution → MC tasks.resolution → status='review'
Total wall-clock: ~70s.
This commit consolidates the third-variant integration approach (shared
docker bridge, separate compose files, persistent pair state):
scripts/openclaw-auto-pair.py
Patches openclaw's pending → paired pairing files transactionally on
the host filesystem. Pairing tokens are documented as 32-byte base64url
random secrets in openclaw-src/src/infra/pairing-token.ts (no crypto
signing), so producing them client-side is safe and produces the same
end-state as the official `openclaw devices approve` flow which is
unavailable in our cross-container topology (operator.admin scope is
not held by any auto-paired loopback CLI).
Idempotent: if MC's deviceId is already in paired.json with a matching
token in MC's device-auth.json, the script exits 0 without changes.
DeviceId+publicKey match check ensures we only approve the specific
pending request that matches MC's identity, not arbitrary pending
entries.
docker-compose-dev.yml
Bind-mount ./.mc-openclaw/:/home/nextjs/.openclaw — gives MC's openclaw
CLI a stable identity (private key, deviceId, paired token) that
survives container recreate / dev-rebuild. Without this, every restart
generates a new deviceId and re-pairing is needed.
Makefile (openclaw-pair-mc target)
One-shot wrapper: trigger pairing request from MC, run auto-pair.py,
bind MC agent display names ("Architect (Claude Opus)") to openclaw
agent ids declared in openclaw.json ("architect"), verify with a
health call. Agent binding is config-only (writes agents.config JSON,
no code change).
Makefile (openclaw-unpair-mc target)
Tear-down counterpart for CONFIRM=yes — clears MC-side identity files
and removes MC's paired entry from the gateway side. Useful when
experimenting or rotating identities.
examples/OPENCLAW-INTEGRATION.md
Adds Step 7.5 (Auto-pair) with explanation of why pairing is needed
and how the script bypasses the standard interactive admin-approval
flow safely.
.gitignore
Adds /.mc-openclaw/ to the openclaw-related local-only paths.
What is NOT changed:
- src/lib/task-dispatch.ts: untouched. The existing runOpenClaw path
works as designed once a paired CLI is in place.
- src/lib/command.ts: untouched.
- src/lib/openclaw-gateway.ts: untouched.
- Any MC API route, scheduler, or UI component: untouched.
Verified test sequence (LOGIN-001 dispatch through gateway):
$ make openclaw-up && make dev
$ make openclaw-pair-mc # one-shot, idempotent
$ # ...drag LOGIN-001 in UI or reset to status='assigned'
$ # within 60s task transitions through in_progress to review
$ # tasks.resolution contains the gpt-5-mini agent's response
…builds for upgrades
Refactor both stacks so openclaw runs from a bind-mounted ./openclaw-src/
clone instead of a baked image. Updating openclaw is now `git pull` plus
one builder run, no docker image rebuild required.
Architecture before:
- openclaw-gateway: built from openclaw-src via Dockerfile (baked dist
+ node_modules into image, ~3.5 GB image). Update = full image rebuild.
- MC dev: `npm install -g openclaw` baked at image build time. Pinned to
whatever was on npm at build time. Update = full image rebuild.
- Result: MC frontend showed "update available v2026.4.27 (installed
v2026.4.26)" because the gateway was rebuilt from a fresh clone but MC
CLI was still on an older npm publish.
Architecture after:
- Stock node:24-bookworm image for both gateway and CLI sidecar — no
custom build needed.
- ./openclaw-src/ bind-mounted at /app inside both. The gateway runs
`node /app/dist/index.js gateway`. dist + node_modules live on the
host under openclaw-src/ (~3.1 GB, .gitignored).
- New `openclaw-builder` compose service (profile=build) compiles dist
+ installs node_modules into the bind-mount via a one-shot container
that has bun + pnpm. First run ~5 min, incremental rebuilds faster.
- MC dev: dropped `npm install -g openclaw` from Dockerfile.dev. A shim
at /usr/local/bin/openclaw runs `node /opt/openclaw-src/dist/index.js`
where /opt/openclaw-src is a read-only bind-mount of the same host
clone. Both gateway and MC CLI always run the exact same dist.
- Both containers run as uid 1000:1000 so files written into bind-mounts
are readable on the host without sudo.
- Plugin runtime stage moved from a Docker named volume (root-owned by
default, blocked uid 1000) into ./.openclaw-data/plugin-runtime-deps/
so it inherits the same uid 1000 ownership as the rest of state.
Update workflow:
$ make openclaw-update
→ git pulls openclaw-src to latest, runs openclaw-builder to recompile
dist into the bind-mount, restarts the gateway. Takes ~30s-2min for
incremental updates. No docker rebuild, no image churn.
Verified after refactor:
- MC CLI and gateway both report v2026.4.27 (the same dist).
- LOGIN-001 dispatched end-to-end via openclaw gateway in 61s,
outcome=success, resolution=3538 chars from openai/gpt-5-mini, then
transitioned into quality_review (Aegis cycle picked it up).
- make openclaw-pair-mc remained idempotent across the recreate
(paired identity persisted in ./.mc-openclaw/).
Two operator-visible regressions after the openclaw integration:
1. WebSocket spam in /logs:
"Handshake failed on root path. Retrying WebSocket via /gateway-ws."
"Max reconnection attempts reached."
The MC frontend reads gateway.host from the gateways table to build
the WebSocket URL. Our row stored "host.docker.internal" (correct for
the MC backend's HTTP probe) but a browser running on the host can't
resolve that name — it's only injected into containers via Docker's
`extra_hosts: host-gateway` mapping. The browser was hitting
ws://host.docker.internal:18789, getting ENOTFOUND/refused, and
exhausting the reconnect budget every few seconds.
Fix: set NEXT_PUBLIC_GATEWAY_URL=ws://127.0.0.1:18789 in
docker-compose-dev.yml. The /api/gateways/connect route already
honours this env var as a browser-facing override (see
src/app/api/gateways/connect/route.ts:156). MC backend continues to
probe the gateway through host.docker.internal:18789 unchanged.
2. Orchestration → Command tab dropdown showing agents but not letting
any be selected:
<option ... disabled={!a.session_key}>
(src/components/panels/orchestration-bar.tsx:275)
Our agents were created in MC's setup wizard with session_key=null,
and the Command UI disables non-null-session_key options. The /agents
page list looked correct but every option was greyed out.
Fix: make the openclaw-pair-mc target also write session_key="mc-<id>"
for each MC agent it knows about (architect/aegis/dev/linter). The
value lines up with the openclaw agent ids declared in openclaw.json,
so the dropdown becomes selectable and downstream openclaw routing has
a valid identifier to work with.
Also: tighten the venv probe in the openclaw-pair-mc target so it falls
back to system python3 cleanly when ./.venv is absent, instead of
printing a stderr warning.
… shim
MC's source uses an older openclaw CLI shape (`gateway sessions_send
--session X --message Y`) for the wake-agent and agent-message endpoints.
That shape is gone in openclaw 2026.4.x — `sessions_send` is no longer a
gateway subcommand, only an internal RPC name. The current public RPC for
sending into a session is `chat.send`, which additionally requires an
`idempotencyKey` and a `deliver` flag.
Symptom: clicking "Wake up" on a Linter (offline) agent in /agents
returned `Failed to wake agent`. The MC frontend Orchestration → Command
tab Send button would also fail silently with the same root cause.
Fix without modifying MC code: route every `openclaw` invocation in the
MC dev container through `scripts/openclaw-cli-shim.py`, which detects
known retired shapes and rewrites them on the fly:
legacy: gateway sessions_send --session X --message Y
modern: gateway call chat.send --params '{
"sessionKey":"X","message":"Y",
"idempotencyKey":"mc-shim-<pid>-<ts>","deliver":false
}' --json
Also handled: sessions_history (-> sessions.history with key) and
sessions_list (-> sessions.list).
The shim is bind-mounted from ./scripts/openclaw-cli-shim.py into the
container at /usr/local/lib/openclaw-cli-shim.py via docker-compose-dev.yml,
so edits to the rewriter don't need an image rebuild — same live-update
ergonomics we now have for openclaw-src/dist itself.
Verified:
$ curl -X POST .../api/agents/Linter%20%28Local%20LLM%29/wake -d '{"message":"Wake up Linter"}'
{"success":true,"session_key":"mc-linter","stdout":"{\"runId\":\"mc-shim-...\",\"status\":\"started\"}"}
Pass-through of unknown shapes (`gateway call ...`, `agent ...`,
`devices ...`, etc.) is unchanged — the shim just `os.execvp`s straight
to the real `node /opt/openclaw-src/dist/index.js` for those.
…Send
Two operator-visible issues seen on /agents and the Orchestration → Command
tab after the openclaw integration:
1. /agents → Command → Send returned "Validation failed" for any message.
Root cause is a pre-existing MC bug, not openclaw-related: the form in
src/components/panels/orchestration-bar.tsx:94 sends
{ to, content: message, from }
while POST /api/agents/message validates against
{ to, message, from } // src/lib/validation.ts:137
So zod rejects every payload as "Validation failed" before the request
ever reaches the agent. Strictly the smallest possible MC code edit:
rename the request field `content` -> `message` to match the API
contract. The Wake button on /agents already worked because it goes
through a separate endpoint that reads `body.message` itself.
2. /logs continued to spam:
"WebSocket error occurred"
"Max reconnection attempts reached. Please reconnect manually."
even after fixing NEXT_PUBLIC_GATEWAY_URL. Gateway logs show why:
[ws] closed before connect ... peer=192.168.48.1 remote=192.168.48.1
The browser opens a WS to ws://127.0.0.1:18789, but from the gateway's
side that connection arrives from the docker bridge IP, NOT loopback,
so openclaw treats the browser as an unpaired device and aborts the
handshake with code 1006. There's no clean way to pair every browser
session non-interactively.
Fix: set NEXT_PUBLIC_GATEWAY_OPTIONAL=true in docker-compose-dev.yml.
The MC websocket client already has special-case behaviour for that
flag (src/lib/websocket.ts:771): it gives up reconnecting silently and
falls back to HTTP polling for live updates. Backend dispatch through
openclaw is unaffected — the gateway is still the dispatch path; only
the browser's live event stream is off.
Add a local-only auto-approval worker for pending Control UI device requests so Connect succeeds after restart and new request IDs. Tune security scan messaging for Docker localhost topology and HTTPS-only flags to reduce false local warnings.
Make up/restart/down/status now run against MC_MODE and include OpenClaw when OPENCLAW_ENABLED=1, with compatibility aliases preserved. Update env examples, deployment docs, and integration plan notes to document the minimal operator workflow.
Project Telegram dmPolicy/allowlists/owner allowlists from .env/.env.openclaw into gateway and MC CLI state idempotently, while preserving legacy TELEGRAM_NUMERIC_USER_ID behavior. Add an env toggle to hide non-actionable doctor security info lines without masking real warnings/errors.
Drop MC_OPENCLAW_DOCTOR_HIDE_INFO plumbing so Mission Control always returns full OpenClaw doctor output, including informational security lines.
…ompose
Two compose bugs blocking sandbox skill execution:
1. Path-equivalence for state dir. Gateway runs in a container but asks
the host docker daemon to bind-mount ${state}/sandboxes/agent-X to
/workspace in each sandbox. With ./.openclaw-data:/home/node/.openclaw,
the source path the gateway passed was /home/node/... — nonexistent
on the host, so docker silently mounted an empty dir and skills /
AGENTS.md were invisible inside /workspace. Now state is also mounted
at the same absolute path the host has; legacy /home/node/.openclaw
alias preserved.
2. Drop ${VAR:-} declarations for TELEGRAM_*, OPENCLAW_GATEWAY_TOKEN,
OPENCLAW_TOOLS_PROFILE in environment blocks. Compose merges
environment ON TOP OF env_file, so empty fallbacks were blanking
real values coming from .env.openclaw whenever the top-level .env
(which compose interpolation reads) didn't define them.
The proxy is a separate, independent microservice — its source lives in
its own repo (github.com/nnnet/gpu-coordinator-proxy) and is checked out
here as ./gpu-coordinator-proxy-src/, mirroring the openclaw-src/ pattern
we already use. Both clones are gitignored; only the compose entry that
references them lives in this tree.
The service fronts LMStudio (:1235) and Ollama (:11435) on separate ports
with a shared VRAM lock so a 20B-class model in one runtime doesn't
collide with a 20B in the other on a single GPU. Behaviour is fully
configurable via env (defaults are safest):
GPU_AUTO_FREE_ENABLED default 1 (master switch)
GPU_FREE_STRATEGY default spare-target (keep target warm)
alternative wipe-all (cold reload, gated)
GPU_WIPE_ALL_ALLOWED default 0 (safety gate for wipe-all)
GPU_FREE_SETTLE_MS default 800 (driver reclaim pause)
If the sibling clone is missing the service block can be commented out
without affecting anything else in the MC stack.
…anner
Mission Control's `mission-control-dev` container couldn't reach the host
docker daemon, so the in-MC `openclaw doctor` (called by
src/app/api/openclaw/doctor/route.ts via runOpenClaw) failed its
isDockerAvailable() check and emitted "Sandbox mode is enabled but Docker
is not available" — surfacing as a permanent warning banner in the UI even
though docker on the host was healthy and the gateway sandbox flow worked.
Container-side docker access:
- Add docker.io to Dockerfile.dev so `docker` is on PATH inside the dev
container; bind /var/run/docker.sock in docker-compose-dev.yml.
- Make uid 1000 reach the socket via `group_add` driven by a new
DOCKER_SOCKET_GID env, auto-detected by the Makefile via
`stat -c %g /var/run/docker.sock` (falls back to 994 for stock
Debian/Ubuntu hosts; manually overridable on Fedora/Arch/colima/Rancher
Desktop where the gid differs).
Mask host's stale ~/.openclaw from the container view:
- openclaw doctor's findOtherStateDirs() scans /home/*/.openclaw and trips
"Multiple state directories detected" whenever the dev container's broad
${HOME}:${HOME}:rw bind exposes the host user's pre-fork .openclaw dir.
- Bind-mount an empty stub file (.docker-mask/openclaw-stub.empty) over
${HOME}/.openclaw so existsDir() returns false (the path is now a regular
file, not a directory). Tmpfs is unsuitable here — it would still appear
as a directory.
Doctor-banner suppression in the openclaw CLI shim:
- openclaw doctor unconditionally prints `Run "openclaw doctor --fix" to
apply changes.` whenever shouldRepair is false, regardless of whether
there is anything actually fixable (see openclaw-src
flows/doctor-health-contributions.ts:580-582). The word "fix" trips MC's
parseOpenClawDoctorOutput mentionsWarnings regex.
- The Plugins panel always prints "Errors: 0" — the substring "error"
trips MC's level=error escalation regex.
- Both treated as upstream quirks. The shim (which already exists for
legacy CLI compat) now intercepts plain `doctor` invocations, drops the
footer line, and rewrites "Errors: 0" → "Errs: 0" only when the count
is zero. Real plugin error counts pass through untouched so a genuine
banner still surfaces if something breaks.
Brew-enabled gateway + sandbox images (carried over from prior pending
work, finally committed):
- Dockerfile.openclaw.dockercli: gateway image now ships docker CLI +
Linuxbrew so skills.install RPC can `brew install <formula>` for skills
declaring brew deps.
- Dockerfile.openclaw.sandbox: overlays brew on the upstream
openclaw-sandbox:bookworm-slim base (kept separate from openclaw-src to
preserve read-only update flow via `make openclaw-update`).
Makefile UX simplification (589 lines → 124):
- Replace the legacy multi-mode lifecycle wrapper with a minimal docker
compose driver: `make`, `make build [SVC...]`, `make up [SVC...]`,
`make down`, `make logs`, `make ps`, `make clean`. Positional service
args via $(filter-out $(firstword $(MAKECMDGOALS)),$(MAKECMDGOALS)).
- MODE=dev|prod toggles which compose file pair is used.
- The pre-rewrite Makefile is preserved as Makefile.legacy for anyone who
still relies on the older recipe names.
Drop tracked .beads/dolt-monitor.pid.lock — it's a runtime lock file and
is already covered by .gitignore.
The deliver_notifications() and get_delivery_stats() helpers in scripts/notification-daemon.sh issue HTTP requests against /api/notifications/deliver and /api/notifications without any Authorization header. Both endpoints use requireRole(request, 'operator') (src/lib/auth.ts), which accepts the global API key via the `Authorization: Bearer ...` or `x-api-key` header. Without it every poll silently returns 401 and the daemon reports "delivery failed" without explaining why. Add an MC_API_KEY env var (with API_KEY as a fallback so a single .env value works for both MC and this daemon), validate it before the run/stats path, and pass it as a Bearer header on both the POST deliver call and the GET stats call. Also fix the help text default URL (3005 -> 3000) to match the script's actual MISSION_CONTROL_URL default. Behavior change: the daemon now prints an actionable error and exits 1 if MC_API_KEY is not set, instead of running batches that silently 401. No new dependencies; no other behavior changes.
0xbrainkid
left a comment
There was a problem hiding this comment.
Thanks for fixing the notification daemon auth path; that specific shell change looks directionally right, and the targeted doctor test passes: pnpm vitest run src/lib/__tests__/openclaw-doctor.test.ts.
Blocking before merge: this PR contains a large amount of unrelated, machine-specific runtime/config material that should not be committed as part of a notification-daemon auth fix. In particular .opencode/opencode.json enables many third-party plugins and includes local absolute paths such as /home/uadmin/.local/bin/codebase-memory-mcp and /mnt/9/gt/.../node_modules/@playwright/mcp/cli.js. Committing that repo-level config will be broken for other contributors and may cause nondeterministic external plugin execution for anyone using opencode in this repo.
Please reduce this PR to the auth fix and directly related docs/tests, or move the opencode/beads/docker/demo additions into separate intentionally reviewed PRs with portable, opt-in config. Once the unrelated machine-local config is removed, I can re-review the notification-daemon change.
Note: full typecheck could not be used as a clean gate in this checkout because the worktree dependency set is missing unrelated deps/types (next-intl, xterm, node-pty, radix/cva), causing broad failures outside this PR.
|
Thanks — the But the title says "authenticate notification daemon requests" and the diff has 57 files / 6,450 added lines covering:
This PR is structurally three or four other PRs stacked together (and most of those overlap with #647, #648, #649 which I just reviewed/merged separately). Could you split this into:
Closing pending the focused split. |
Summary
The notification daemon's curl invocations against
/api/notifications/deliverand/api/notifications(stats) are issued without anyAuthorizationheader. Both endpoints userequireRole(request, 'operator')(src/lib/auth.ts), which accepts the global API key viaAuthorization: Bearer …orx-api-key. Without it every poll silently returns 401 and the script reportsDelivery failed: HTTP 401with no hint about why.Fix
MC_API_KEY(withAPI_KEYfallback so a single.envvalue works for both MC and the daemon) from env.1if unset, instead of silently 401-ing in a loop.Authorization: Bearer …on both the POSTdelivercall and the GET stats call.3005→3000).No new dependencies. No behavior changes outside the auth path.
Test plan
unset MC_API_KEY API_KEY; ./scripts/notification-daemon.sh→ exits 1 with the actionable error message.MC_API_KEY=<wrong> ./scripts/notification-daemon.sh --dry-run→ 401 logged with response body (was silent in2>/dev/null).MC_API_KEY=<right> ./scripts/notification-daemon.sh --dry-run→ 200 with dry-run report.bash -n scripts/notification-daemon.shclean.