feat: direct tools — realtime agent owns read/write/edit/bash (RC)#459
Merged
Conversation
Mobile previously parsed tool.progress events but dropped them on the
floor — users running a long tool only saw a spinner and elapsed time.
Now the chat screen accumulates streamed progress and the tool call row
renders it.
User-visible changes:
- While a tool is in-progress, the latest step/summary label appears as
a small italic caption next to the spinner.
- When the tool streams textDelta chunks, the collapsed row shows a
"▸ streaming…" affordance; tapping the row reveals a live tail of the
accumulated stream (auto-scrolls to the latest output).
- After completion, the streamed history stays available in the
expanded view so the user can scroll back.
- On error, the row auto-expands (existing behavior) and the streamed
output is shown above the final error — useful for debugging tools
that print stderr before crashing.
Implementation:
- ToolCallItem gains optional progressText (accumulated textDelta) and
progressStep (latest step or summary label) — session-scope only, no
persistence.
- useRealtime's onToolProgress callback now receives the full
{textDelta, step, summary} shape instead of just summary.
- ChatScreen wires onToolProgress to mutate the toolCalls Map: append
textDelta to progressText, overwrite progressStep with the latest
label.
Tool-agnostic: no rendering branches on tool name.
Rework the inline tool-call row so it reads well for the existing ask_brain flow and for any future tool added to the realtime session. Changes are tool-agnostic; no special cases per tool name. User-visible changes: - Parameters render as a structured key/value list instead of one raw JSON dump. String values with newlines become small code blocks; long inline values get a per-row "more/less" toggle; non-object args fall back to pretty JSON. - A "blocking" vs "streaming" pill appears next to the tool name while the call is in progress, driven by a new streaming flag on ToolCallEntry (set the first time a textDelta arrives). - During a streaming response the body is wrapped in a max-height scroll container that auto-scrolls to the tail, so the latest output stays visible instead of being buried below the fold. - Errored calls open the response section and the "What went wrong" upstream panel by default. - The step caption and streamed text are guaranteed to coexist so whichever event arrives first no longer hides the other.
Tools now declare a latencyClass — "fast" | "medium" | "slow" | "streaming" — instead of a single blocking boolean. "fast" and "medium" map to the legacy blocking path; "slow" and "streaming" to the placeholder + injectContext path. Existing tools mapped: echo_tool=fast, web_search=medium, ask_brain=slow. Pure refactor, no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Opt-in flag on SessionConfigEvent that gates the direct-tools experiment (read/write/edit/bash exposed to the realtime model, ask_brain removed from the tool list). Type-only at this stage — no behavior change until the tools and dispatch wiring land in later commits. Also lands the experiment spec alongside the implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds relay-server/src/workspace.ts with: - ensureWorkspace(): creates ~/.voiceclaw/workspace/, memory/ subdir, and a default AGENTS.md if missing. - resolveInsideWorkspace(): canonicalizes a path against the workspace root, realpathing the parent (writes) or the full path (reads) to catch symlink escapes. - verifyWrittenPathInside(): post-write TOCTOU verifier. - loadRecentMemory(): returns existing memory/YYYY-MM-DD.md files for today + previous N days, oldest-first. - checkBashCommand(): day-one denylist (rm -r, sudo/doas, curl|wget→sh, credential dirs, mkfs/dd/fdisk/mount). VOICECLAW_WORKSPACE overrides the workspace root for tests. No tools or session wiring yet — those land in later commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the read tool — line-numbered file reads, 1-indexed offset/limit, default 2000-line limit, per-line truncation at 2000 chars, total output capped at 100KB. Allowed anywhere on the machine (read-only). Registered with latencyClass=fast and gated on experimentalDirectTools; session dispatch routes it through the blocking path so the model receives the real file contents inside the turn. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the write tool — workspace-scoped file writes with parent dir creation. Relative paths anchor to the workspace root; absolute paths must already point inside the workspace. Each write realpaths the parent before opening and the final file after writing so a freshly-installed escaping symlink can't shuttle the payload elsewhere (if post-write verification fails, the file is unlinked). Registered with latencyClass=fast and gated on experimentalDirectTools. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the edit tool — exact-string find/replace on workspace files, mirroring pi-mono / Claude Code Edit semantics: - old_string must be present verbatim (no fuzzy matching at day one). - old_string must be unique unless replace_all=true; otherwise the call errors with the occurrence count so the model can add context. - Empty old_string and old_string==new_string are explicit errors. - File contents (including trailing-newline state) are preserved naturally by literal string replacement. - Path scope reuses the workspace realpath check; on a verification failure post-write, the original contents are restored. Registered with latencyClass=fast and gated on experimentalDirectTools. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the bash tool — runs a shell command, streams stdout/stderr back as tool.progress.textDelta, returns exit code + tail in the final result. Day-one safety: - Pre-spawn denylist (rm -r, sudo/doas, pipe-to-shell, credential dirs, mkfs/dd/fdisk/mount). Lifted into workspace.checkBashCommand. - Workspace root as default cwd. - Per-stream tail cap (16 KB) so a `gh pr diff` on a huge PR can't flood the model's context. Older bytes are dropped with a "truncated" flag in the result. - Timeout: 30s default, 120s hard cap. SIGTERM then SIGKILL. - External abort signal (session shutdown) kills the child. Registered with latencyClass=streaming; dispatch goes through the async path: immediate placeholder, stream progress, injectContext on completion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When experimentalDirectTools is true, buildInstructions now appends: - A direct-tools preamble naming the four tools, the streaming semantics for bash, and the delegate-to-imperative-agent guidance for multi-step work. - A "Workspace context (preloaded)" section containing AGENTS.md plus today's + last 7 days' memory files. Memory absence renders an inline stub so the model knows the file doesn't exist yet. Session.handleSessionConfig calls ensureWorkspace() before buildInstructions when the flag is on, so the first session on a fresh machine still gets a default AGENTS.md and empty memory dir. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Flag-on exposes exactly the five experiment tools — read, write, edit, bash, web_search (when a Tavily key is set) — alongside the test echo_tool. The ask_brain tool is dropped because in direct-tools mode the realtime model handles memory and files itself; pointing the model at ask_brain would actively mislead it. Instructions match: BRAIN_INTRO / BRAIN_ASYNC_RULES / BRAIN_MEMORY_RULES are skipped when the flag is on (the agent identity is still loaded). Flag-off remains byte-identical to today's behavior. Adds an integration test that asserts both the flag-on tool list and the flag-off equivalence to baseline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds OpenAI Realtime as a third voice provider option in onboarding step 04, between Gemini Live and Grok Voice. The 'openai' provider id already existed in the ProviderId union and validateProviderKey() already handled it via GET https://api.openai.com/v1/models, so only the UI card was missing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
With EXPERIMENTAL_DIRECT_TOOLS on, the realtime model gets read/write/edit/ bash/web_search directly, so the standalone brain picker has no meaning. Filter the brain step out of the wizard sequence and propagate the visible step number + total to each step so the counter and eyebrow renumber accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…test The smoke test was returning "Empty response from Gemini." because gemini-2.5-flash reasons by default and consumed the 64-token output budget on hidden thinking tokens before emitting any visible text. Set thinkingConfig.thinkingBudget=0, bump maxOutputTokens to 256, and include the candidate finishReason or promptFeedback.blockReason in the empty-response error so future failures point at the real cause. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Direct tools (read/write/edit/bash/web_search) are now always-on. ask_brain removed from tool list. Settings Connection panel hidden. Workspace + memory preload run unconditionally. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Lets the user enter or update Gemini / OpenAI / xAI realtime keys directly from Settings without re-running onboarding. Each row validates against the upstream provider and persists via the existing Keychain-backed provider-keys vault. Surfaces "configured" state from listConfigured() but never reads the stored secret back into the UI. Leaves a disabled "Managed key — coming soon" row as a placement marker for the future hosted-key option. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Settings used to ask the user two separate questions on two separate cards: which model to use, and which keys to paste. Pasting keys for providers you'll never run was clutter; the connection between a model and its required key was implicit. The new Voice Model card answers one question — "pick a voice model, VoiceClaw shows whether it can run." Each model row carries provider + status; missing-key rows expose Add key (configured rows expose Manage) which expands a single inline key editor in place under the row. Saving an OpenAI key updates both GPT rows immediately. Selecting a keyless model is allowed and auto-opens that provider's editor with "<Provider> key required to use <Model>." The "Managed by VoiceClaw" option moves into a quieter "Key source" toggle at the bottom of the card. File: desktop/src/renderer/src/pages/SettingsPage.tsx Demo: pick GPT Realtime 2 → OpenAI editor pops open under the row. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Desktop wrote IDENTITY.md into userData/openclaw/workspace, the relay read it from BRAIN_WORKSPACE only when brainAgent was enabled, and the direct- tools workspace lived at ~/.voiceclaw/workspace — three paths, one of them never coinciding, so the user's chosen agent name never reached the realtime model in direct-tools mode. Collapse onto a single workspace root. The relay now seeds IDENTITY.md, SOUL.md, FACTS.md, and AGENTS.md on first run, loads identity + soul unconditionally (no brainAgent gate, no BRAIN_WORKSPACE), and preloads FACTS.md into the system prompt alongside the rolling memory window. AGENTS.md now teaches the model the two-layer memory model: durable facts go in FACTS.md (always loaded), temporal notes in memory/YYYY-MM-DD.md (7-day window). Desktop writes to homedir()/.voiceclaw/workspace and drops a default SOUL.md next to IDENTITY.md when none exists. VOICECLAW_WORKSPACE is forwarded to the bundled relay so packaged builds stay in sync. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Server-side tool calls (read/write/edit/bash/web_search) were intercepted and dispatched without forwarding the tool.call start event to the client, so the desktop/mobile transcript never created the tool-call row — only the later tool_call.completed arrived, for a callId the UI never saw. Forward the event before handling. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Track scroll position; only re-anchor when the user is already at the bottom (40px threshold). Surface a small "Jump to latest" button when new content arrives while scrolled up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tool-call rows are now click-to-toggle: in-progress stays expanded so streaming output is visible, and completed rows auto-collapse to a one-line summary (name + status + clock time + duration). Header shows absolute HH:MM:SS next to the duration. Burst separators now show absolute clock time instead of "just now" / "N min ago" so the user can see exactly when each group landed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…elled - Add EXPO_PUBLIC_REALTIME_SERVER_URL env override, defaulting to ws://100.82.61.115:8080/ws so a phone on the tailnet reaches the desktop relay (and the direct tools that run on it) without per-device setup. Users can still type ws://localhost:8080/ws into Settings for same-Mac dev. - Parse the relay's tool.cancelled event in useRealtime, expose an onToolCancelled(callIds) callback, and flip matching ToolCallRow items from in-progress to cancelled so the spinner clears when Gemini drops a tool mid-call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Walks through deploying this branch to a paired iPhone — the connected device, the xcode-select / DEVELOPER_DIR situation, the CocoaPods on Ruby 4 failure mode that blocked the local prebuild attempt, and the EAS/TestFlight fallback path with the credentials it needs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the first entry in `relay-server/workspace-defaults/skills/` — a step-by-step playbook for reviewing a job posting, drafting a tailored cover letter + tailoring notes, staging artifacts under `jobs/<slug>/`, and submitting via the browser. The voice agent reads it when the user asks for help on a job application. `ensureWorkspace()` now creates `skills/` and `jobs/` under the user's workspace and copies any packaged `*.md` from `workspace-defaults/skills/` that the user hasn't already customized. Seeding follows the same "never overwrite" semantics as AGENTS.md / IDENTITY.md. DEFAULT_AGENTS_MD gains a Skills section pointing at the new playbook so the agent learns the skill exists at session start without an extra tool call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
xAI's beta dialect does not reliably emit response.done after a
placeholder tool result (e.g. {"status":"running"}) is returned and the
real result is later injected via conversation.item.create. The deferred
response.create that injectContext queues then strands, and the model
never speaks the injected result — the conversation stalls after the
placeholder bridge ("let me check…").
Add response.audio.done / response.output_audio.done as a backup flush
trigger, gated to the xAI sessionFormat so OpenAI GA continues to rely on
the canonical response.done lifecycle. Mirrors the OpenAI adapter's
existing defer-and-flush pattern.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mobile no longer has a brain-gateway URL; the user pairs via desktop QR. Distinguish 401 (pair required, cancels reconnect) from generic connection failures, and de-dupe both so auto-reconnect bursts don't spam the chat. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… pair-needed guidance
…g card Chat tab: clear spinner + call state on 401 and generic terminal failures; turn the call button into a destructive retry icon and show an inline "Try again" banner. Reset dedupe guards and terminal-error state on manual retry / new conversation. Settings: new "Desktop pairing" card with Test (5s WS session.auth probe → paired / unauthorized / unreachable), token tail readout (last 8 chars), and Forget pairing action. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove the pre-pairing-era VoiceClaw Desktop card (URL/API key inputs,
Setup instructions, Connected/Test pill) — the Desktop pairing card now
owns connectivity. Keep model/voice/volume controls and rename the card
to Voice Mode.
Mirror the chat's session.auth handshake in the pairing Test: read the
stored URL and token fresh from SQLite, send {type,apiKey,deviceName},
resolve only on session.auth.ok, treat error 401 / close 1008 / 4401 as
pair-needed, and stretch the timeout to 7s so a healthy bridge round
trip does not false-negative. Surface the URL the probe used inline.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ring test handshake
… file
Standalone `yarn dev` relays had no way to call the desktop's device-token
bridge: the URL + nonce were only injected by buildRelayEnv() on the
desktop-managed spawn path. When a developer ran `yarn dev` against an
already-running desktop, every paired-device session.auth rejected because
checkDeviceToken silently returned ok:false (no bridge env). Mobile saw
"Connection failed — try re-pairing" and the relay log showed
"session.auth failed" with no reason attached.
Desktop now writes <userData>/device-token-bridge.json on bridge start
(0600, removed on shutdown), and the relay reads env first, then falls
back to that discovery file. Both desktop-managed and standalone yarn-dev
relays can validate vcd_ tokens against the live desktop bridge with zero
env wiring.
Rejection logs now carry a token preview and a structured reason
(no-credential | master-key-mismatch-no-bridge | master-key-mismatch-token-unknown)
plus a human-readable description, so future "every auth 401s" incidents
are diagnosable from a single log line. Relay startup also prints the
bridge source ("env" / "discovery") or a loud warning when neither is
present.
Verified on the wire via scripts/repro-pairing-401.mjs:
Scenario A (master key) before=ok → after=ok
Scenario B (paired device token) before=401 → after=ok
Scenario C (garbage) before=401 → after=401
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…n bridge via discovery file
A standalone `yarn dev` relay can carry stale VOICECLAW_DEVICE_TOKEN_CHECK_URL and _NONCE in the developer's shell or .env from a previous launch. The old getBridgeConfig() returned the env source unconditionally when both were set, shadowing the live discovery file the running desktop had just written. Every paired-device session.auth POSTed the plaintext token to the dead URL, the bridge call timed out, checkDeviceToken returned ok:false, and mobile saw 401 "unauthorized" — even though the desktop was up and the token was valid. The discovery file is the live source of truth: the desktop writes it on bridge start and removes it on shutdown, and it contains the pid of the writer. Use it whenever its pid is alive (process.kill(pid, 0) succeeds or raises EPERM, both signaling a live process). Fall through to env vars only when no fresh discovery file exists, so the symmetric case — a stale discovery file left by a crashed desktop + a live env-injected URL — still works. scripts/repro-pairing-401-e2e.mjs exercises the full surface end-to-end against a real relay process: four launch configurations × three auth scenarios (master key, paired device token, garbage). The counterexample the codex reviewer flagged (group-3: stale env + live discovery) goes from 401 to ok; the symmetric counterexample (group-4: stale discovery + live env) stays ok via the env fallback. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… in relay bridge lookup Counterexample to the previous fix: a `yarn dev` relay carrying stale VOICECLAW_DEVICE_TOKEN_CHECK_URL / _NONCE in its shell / .env would ignore the discovery file and POST every paired-device token to a dead URL, returning 401 to the mobile client. Discovery file now wins when its writer pid is alive. Env is the fallback path only — for desktop-spawned bundled relays whose env is known fresh, and for the symmetric "stale discovery file from a crashed desktop + live env" case. scripts/repro-pairing-401-e2e.mjs verifies all 12 scenarios (4 launch configurations × 3 auth scenarios) end-to-end against a real relay process. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drops the prefix/suffix preview in favor of an 8-char sha256 fingerprint so failed-auth log lines stay correlatable without exposing any bytes of the rejected credential. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The sha256 fingerprint is a log correlator, not a password digest; it is never stored or used for verification. CodeQL's password-hash query can only see "apiKey -> sha256" so it flags the call as an unsalted password hash. Suppress with justification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n log fingerprint
Any reference to apiKey from a log call site keeps tripping CodeQL — clear-text-logging on the raw preview, insufficient-password-hash on the sha256 correlator. Suppression comments aren't honored by GitHub code-scanning. The reason code + describeRejectReason already give operators the actionable signal; drop the credential reference entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replace the
ask_braindelegation pattern with direct tools on the realtime session. The voice model now ownsread,write,edit,bash, andweb_searchdirectly — no brain gateway hop, no OpenClaw/Hermes dependency for the default path.relay-server/src/tools/direct/) — read (fs), write (fs, workspace-scoped), edit (exact-string replace, workspace-scoped), bash (child_process.spawn, streamed stdout via tool.progress, denylist, timeout)~/.voiceclaw/workspace/with file-based memory (memory/YYYY-MM-DD.md) andAGENTS.mdprotocol. Preloaded into system prompt at session start (today + last 7 days).latencyClassreplacesblocking: booleanonRelayToolDefinition—fast | medium | slow | streamingdrives adapter behavior and instruction language.rm -rf,sudo, pipe-to-shell), write/edit path-scoped to workspace root with realpath + symlink-escape check, bash output cap (16KB tail), timeout (30s default, 120s hard cap).tool.progressevents end-to-end: progressText/progressStep on ToolCallItem, step caption, streaming affordance, tail-scrolling.Spec
See
pi-mono-experiment-spec.mdin the repo root.Test plan
cd relay-server && yarn test— 221 tests pass, brain-e2e unchangedcd desktop && yarn typecheck— cleanyarn dev, open desktop, start a callreadtool call in transcriptbashtool call with streamed output~/.voiceclaw/workspace/memory/🤖 Generated with Claude Code