Skip to content

feat: direct tools — realtime agent owns read/write/edit/bash (RC)#459

Merged
yagudaev merged 85 commits into
mainfrom
experiment/direct-tools-pi-mono
May 31, 2026
Merged

feat: direct tools — realtime agent owns read/write/edit/bash (RC)#459
yagudaev merged 85 commits into
mainfrom
experiment/direct-tools-pi-mono

Conversation

@yagudaev

Copy link
Copy Markdown
Owner

Summary

Replace the ask_brain delegation pattern with direct tools on the realtime session. The voice model now owns read, write, edit, bash, and web_search directly — no brain gateway hop, no OpenClaw/Hermes dependency for the default path.

  • 4 new tools (relay-server/src/tools/direct/) — read (fs), write (fs, workspace-scoped), edit (exact-string replace, workspace-scoped), bash (child_process.spawn, streamed stdout via tool.progress, denylist, timeout)
  • Workspace at ~/.voiceclaw/workspace/ with file-based memory (memory/YYYY-MM-DD.md) and AGENTS.md protocol. Preloaded into system prompt at session start (today + last 7 days).
  • latencyClass replaces blocking: boolean on RelayToolDefinitionfast | medium | slow | streaming drives adapter behavior and instruction language.
  • Safety: bash denylist (rm -rf, sudo, pipe-to-shell), write/edit path-scoped to workspace root with realpath + symlink-escape check, bash output cap (16KB tail), timeout (30s default, 120s hard cap).
  • Desktop transcript improvements — structured key/value args (not raw JSON), blocking/streaming badge, auto-scrolling streaming tail, errors expanded by default, step + stream coexistence.
  • Mobile transcript improvements — wired the dropped tool.progress events end-to-end: progressText/progressStep on ToolCallItem, step caption, streaming affordance, tail-scrolling.
  • Onboarding: OpenAI Realtime added as 3rd provider, Brain step skipped (harness IS the brain), smoke test fixed (Gemini thinking budget), Settings Connection panel hidden.

Spec

See pi-mono-experiment-spec.md in the repo root.

Test plan

  • cd relay-server && yarn test — 221 tests pass, brain-e2e unchanged
  • cd desktop && yarn typecheck — clean
  • Start yarn dev, open desktop, start a call
  • "Read package.json and tell me the dev script" → visible read tool call in transcript
  • "Run ls" → visible bash tool call with streamed output
  • "Remember that we decided to ship X" → writes to ~/.voiceclaw/workspace/memory/
  • "What did we talk about yesterday?" → answered from preloaded memory, no tool call
  • Path traversal attempt → structured error, no execution
  • Denylisted bash command → structured error, no execution

🤖 Generated with Claude Code

yagudaev and others added 15 commits May 22, 2026 13:40
Mobile previously parsed tool.progress events but dropped them on the
floor — users running a long tool only saw a spinner and elapsed time.
Now the chat screen accumulates streamed progress and the tool call row
renders it.

User-visible changes:
- While a tool is in-progress, the latest step/summary label appears as
  a small italic caption next to the spinner.
- When the tool streams textDelta chunks, the collapsed row shows a
  "▸ streaming…" affordance; tapping the row reveals a live tail of the
  accumulated stream (auto-scrolls to the latest output).
- After completion, the streamed history stays available in the
  expanded view so the user can scroll back.
- On error, the row auto-expands (existing behavior) and the streamed
  output is shown above the final error — useful for debugging tools
  that print stderr before crashing.

Implementation:
- ToolCallItem gains optional progressText (accumulated textDelta) and
  progressStep (latest step or summary label) — session-scope only, no
  persistence.
- useRealtime's onToolProgress callback now receives the full
  {textDelta, step, summary} shape instead of just summary.
- ChatScreen wires onToolProgress to mutate the toolCalls Map: append
  textDelta to progressText, overwrite progressStep with the latest
  label.

Tool-agnostic: no rendering branches on tool name.
Rework the inline tool-call row so it reads well for the existing
ask_brain flow and for any future tool added to the realtime session.
Changes are tool-agnostic; no special cases per tool name.

User-visible changes:
- Parameters render as a structured key/value list instead of one raw
  JSON dump. String values with newlines become small code blocks; long
  inline values get a per-row "more/less" toggle; non-object args fall
  back to pretty JSON.
- A "blocking" vs "streaming" pill appears next to the tool name while
  the call is in progress, driven by a new streaming flag on
  ToolCallEntry (set the first time a textDelta arrives).
- During a streaming response the body is wrapped in a max-height
  scroll container that auto-scrolls to the tail, so the latest output
  stays visible instead of being buried below the fold.
- Errored calls open the response section and the "What went wrong"
  upstream panel by default.
- The step caption and streamed text are guaranteed to coexist so
  whichever event arrives first no longer hides the other.
Tools now declare a latencyClass — "fast" | "medium" | "slow" |
"streaming" — instead of a single blocking boolean. "fast" and "medium"
map to the legacy blocking path; "slow" and "streaming" to the
placeholder + injectContext path. Existing tools mapped: echo_tool=fast,
web_search=medium, ask_brain=slow. Pure refactor, no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Opt-in flag on SessionConfigEvent that gates the direct-tools experiment
(read/write/edit/bash exposed to the realtime model, ask_brain removed
from the tool list). Type-only at this stage — no behavior change until
the tools and dispatch wiring land in later commits.

Also lands the experiment spec alongside the implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds relay-server/src/workspace.ts with:
- ensureWorkspace(): creates ~/.voiceclaw/workspace/, memory/ subdir,
  and a default AGENTS.md if missing.
- resolveInsideWorkspace(): canonicalizes a path against the workspace
  root, realpathing the parent (writes) or the full path (reads) to
  catch symlink escapes.
- verifyWrittenPathInside(): post-write TOCTOU verifier.
- loadRecentMemory(): returns existing memory/YYYY-MM-DD.md files for
  today + previous N days, oldest-first.
- checkBashCommand(): day-one denylist (rm -r, sudo/doas,
  curl|wget→sh, credential dirs, mkfs/dd/fdisk/mount).

VOICECLAW_WORKSPACE overrides the workspace root for tests. No tools
or session wiring yet — those land in later commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the read tool — line-numbered file reads, 1-indexed offset/limit,
default 2000-line limit, per-line truncation at 2000 chars, total
output capped at 100KB. Allowed anywhere on the machine (read-only).

Registered with latencyClass=fast and gated on experimentalDirectTools;
session dispatch routes it through the blocking path so the model
receives the real file contents inside the turn.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the write tool — workspace-scoped file writes with parent dir
creation. Relative paths anchor to the workspace root; absolute paths
must already point inside the workspace. Each write realpaths the
parent before opening and the final file after writing so a
freshly-installed escaping symlink can't shuttle the payload elsewhere
(if post-write verification fails, the file is unlinked).

Registered with latencyClass=fast and gated on experimentalDirectTools.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the edit tool — exact-string find/replace on workspace files,
mirroring pi-mono / Claude Code Edit semantics:

- old_string must be present verbatim (no fuzzy matching at day one).
- old_string must be unique unless replace_all=true; otherwise the
  call errors with the occurrence count so the model can add context.
- Empty old_string and old_string==new_string are explicit errors.
- File contents (including trailing-newline state) are preserved
  naturally by literal string replacement.
- Path scope reuses the workspace realpath check; on a verification
  failure post-write, the original contents are restored.

Registered with latencyClass=fast and gated on experimentalDirectTools.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the bash tool — runs a shell command, streams stdout/stderr back
as tool.progress.textDelta, returns exit code + tail in the final
result.

Day-one safety:
- Pre-spawn denylist (rm -r, sudo/doas, pipe-to-shell, credential
  dirs, mkfs/dd/fdisk/mount). Lifted into workspace.checkBashCommand.
- Workspace root as default cwd.
- Per-stream tail cap (16 KB) so a `gh pr diff` on a huge PR can't
  flood the model's context. Older bytes are dropped with a
  "truncated" flag in the result.
- Timeout: 30s default, 120s hard cap. SIGTERM then SIGKILL.
- External abort signal (session shutdown) kills the child.

Registered with latencyClass=streaming; dispatch goes through the
async path: immediate placeholder, stream progress, injectContext on
completion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When experimentalDirectTools is true, buildInstructions now appends:
- A direct-tools preamble naming the four tools, the streaming
  semantics for bash, and the delegate-to-imperative-agent guidance
  for multi-step work.
- A "Workspace context (preloaded)" section containing AGENTS.md plus
  today's + last 7 days' memory files. Memory absence renders an
  inline stub so the model knows the file doesn't exist yet.

Session.handleSessionConfig calls ensureWorkspace() before
buildInstructions when the flag is on, so the first session on a
fresh machine still gets a default AGENTS.md and empty memory dir.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Flag-on exposes exactly the five experiment tools — read, write,
edit, bash, web_search (when a Tavily key is set) — alongside the
test echo_tool. The ask_brain tool is dropped because in direct-tools
mode the realtime model handles memory and files itself; pointing
the model at ask_brain would actively mislead it.

Instructions match: BRAIN_INTRO / BRAIN_ASYNC_RULES /
BRAIN_MEMORY_RULES are skipped when the flag is on (the agent
identity is still loaded). Flag-off remains byte-identical to
today's behavior.

Adds an integration test that asserts both the flag-on tool list and
the flag-off equivalence to baseline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds OpenAI Realtime as a third voice provider option in onboarding
step 04, between Gemini Live and Grok Voice. The 'openai' provider
id already existed in the ProviderId union and validateProviderKey()
already handled it via GET https://api.openai.com/v1/models, so only
the UI card was missing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
With EXPERIMENTAL_DIRECT_TOOLS on, the realtime model gets read/write/edit/
bash/web_search directly, so the standalone brain picker has no meaning.
Filter the brain step out of the wizard sequence and propagate the visible
step number + total to each step so the counter and eyebrow renumber
accordingly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…test

The smoke test was returning "Empty response from Gemini." because
gemini-2.5-flash reasons by default and consumed the 64-token output
budget on hidden thinking tokens before emitting any visible text. Set
thinkingConfig.thinkingBudget=0, bump maxOutputTokens to 256, and
include the candidate finishReason or promptFeedback.blockReason in the
empty-response error so future failures point at the real cause.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Direct tools (read/write/edit/bash/web_search) are now always-on.
ask_brain removed from tool list. Settings Connection panel hidden.
Workspace + memory preload run unconditionally.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@vercel

vercel Bot commented May 27, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
voiceclaw Ready Ready Preview, Comment May 31, 2026 5:42pm

yagudaev and others added 4 commits May 28, 2026 01:37
Lets the user enter or update Gemini / OpenAI / xAI realtime keys
directly from Settings without re-running onboarding. Each row
validates against the upstream provider and persists via the
existing Keychain-backed provider-keys vault.

Surfaces "configured" state from listConfigured() but never reads
the stored secret back into the UI. Leaves a disabled "Managed
key — coming soon" row as a placement marker for the future
hosted-key option.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Settings used to ask the user two separate questions on two separate
cards: which model to use, and which keys to paste. Pasting keys for
providers you'll never run was clutter; the connection between a model
and its required key was implicit.

The new Voice Model card answers one question — "pick a voice model,
VoiceClaw shows whether it can run." Each model row carries provider +
status; missing-key rows expose Add key (configured rows expose Manage)
which expands a single inline key editor in place under the row.
Saving an OpenAI key updates both GPT rows immediately. Selecting a
keyless model is allowed and auto-opens that provider's editor with
"<Provider> key required to use <Model>." The "Managed by VoiceClaw"
option moves into a quieter "Key source" toggle at the bottom of the
card.

File: desktop/src/renderer/src/pages/SettingsPage.tsx
Demo: pick GPT Realtime 2 → OpenAI editor pops open under the row.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Desktop wrote IDENTITY.md into userData/openclaw/workspace, the relay read
it from BRAIN_WORKSPACE only when brainAgent was enabled, and the direct-
tools workspace lived at ~/.voiceclaw/workspace — three paths, one of them
never coinciding, so the user's chosen agent name never reached the
realtime model in direct-tools mode.

Collapse onto a single workspace root. The relay now seeds IDENTITY.md,
SOUL.md, FACTS.md, and AGENTS.md on first run, loads identity + soul
unconditionally (no brainAgent gate, no BRAIN_WORKSPACE), and preloads
FACTS.md into the system prompt alongside the rolling memory window.
AGENTS.md now teaches the model the two-layer memory model: durable facts
go in FACTS.md (always loaded), temporal notes in memory/YYYY-MM-DD.md
(7-day window).

Desktop writes to homedir()/.voiceclaw/workspace and drops a default
SOUL.md next to IDENTITY.md when none exists. VOICECLAW_WORKSPACE is
forwarded to the bundled relay so packaged builds stay in sync.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Server-side tool calls (read/write/edit/bash/web_search) were
intercepted and dispatched without forwarding the tool.call start
event to the client, so the desktop/mobile transcript never created
the tool-call row — only the later tool_call.completed arrived, for a
callId the UI never saw. Forward the event before handling.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
yagudaev and others added 8 commits May 28, 2026 12:56
Track scroll position; only re-anchor when the user is already at the
bottom (40px threshold). Surface a small "Jump to latest" button when
new content arrives while scrolled up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tool-call rows are now click-to-toggle: in-progress stays expanded so
streaming output is visible, and completed rows auto-collapse to a
one-line summary (name + status + clock time + duration). Header shows
absolute HH:MM:SS next to the duration.

Burst separators now show absolute clock time instead of "just now" /
"N min ago" so the user can see exactly when each group landed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…elled

- Add EXPO_PUBLIC_REALTIME_SERVER_URL env override, defaulting to
  ws://100.82.61.115:8080/ws so a phone on the tailnet reaches the
  desktop relay (and the direct tools that run on it) without
  per-device setup. Users can still type ws://localhost:8080/ws into
  Settings for same-Mac dev.
- Parse the relay's tool.cancelled event in useRealtime, expose an
  onToolCancelled(callIds) callback, and flip matching ToolCallRow
  items from in-progress to cancelled so the spinner clears when
  Gemini drops a tool mid-call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Walks through deploying this branch to a paired iPhone — the connected
device, the xcode-select / DEVELOPER_DIR situation, the CocoaPods on
Ruby 4 failure mode that blocked the local prebuild attempt, and the
EAS/TestFlight fallback path with the credentials it needs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the first entry in `relay-server/workspace-defaults/skills/` — a
step-by-step playbook for reviewing a job posting, drafting a tailored
cover letter + tailoring notes, staging artifacts under `jobs/<slug>/`,
and submitting via the browser. The voice agent reads it when the user
asks for help on a job application.

`ensureWorkspace()` now creates `skills/` and `jobs/` under the user's
workspace and copies any packaged `*.md` from `workspace-defaults/skills/`
that the user hasn't already customized. Seeding follows the same
"never overwrite" semantics as AGENTS.md / IDENTITY.md.

DEFAULT_AGENTS_MD gains a Skills section pointing at the new playbook so
the agent learns the skill exists at session start without an extra tool
call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
xAI's beta dialect does not reliably emit response.done after a
placeholder tool result (e.g. {"status":"running"}) is returned and the
real result is later injected via conversation.item.create. The deferred
response.create that injectContext queues then strands, and the model
never speaks the injected result — the conversation stalls after the
placeholder bridge ("let me check…").

Add response.audio.done / response.output_audio.done as a backup flush
trigger, gated to the xAI sessionFormat so OpenAI GA continues to rely on
the canonical response.done lifecycle. Mirrors the OpenAI adapter's
existing defer-and-flush pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
yagudaev and others added 2 commits May 30, 2026 23:22
Mobile no longer has a brain-gateway URL; the user pairs via desktop QR.
Distinguish 401 (pair required, cancels reconnect) from generic connection
failures, and de-dupe both so auto-reconnect bursts don't spam the chat.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
yagudaev and others added 2 commits May 30, 2026 23:29
…g card

Chat tab: clear spinner + call state on 401 and generic terminal failures;
turn the call button into a destructive retry icon and show an inline
"Try again" banner. Reset dedupe guards and terminal-error state on
manual retry / new conversation.

Settings: new "Desktop pairing" card with Test (5s WS session.auth probe →
paired / unauthorized / unreachable), token tail readout (last 8 chars),
and Forget pairing action.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
yagudaev and others added 2 commits May 31, 2026 00:24
Remove the pre-pairing-era VoiceClaw Desktop card (URL/API key inputs,
Setup instructions, Connected/Test pill) — the Desktop pairing card now
owns connectivity. Keep model/voice/volume controls and rename the card
to Voice Mode.

Mirror the chat's session.auth handshake in the pairing Test: read the
stored URL and token fresh from SQLite, send {type,apiKey,deviceName},
resolve only on session.auth.ok, treat error 401 / close 1008 / 4401 as
pair-needed, and stretch the timeout to 7s so a healthy bridge round
trip does not false-negative. Surface the URL the probe used inline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
yagudaev and others added 2 commits May 31, 2026 01:36
… file

Standalone `yarn dev` relays had no way to call the desktop's device-token
bridge: the URL + nonce were only injected by buildRelayEnv() on the
desktop-managed spawn path. When a developer ran `yarn dev` against an
already-running desktop, every paired-device session.auth rejected because
checkDeviceToken silently returned ok:false (no bridge env). Mobile saw
"Connection failed — try re-pairing" and the relay log showed
"session.auth failed" with no reason attached.

Desktop now writes <userData>/device-token-bridge.json on bridge start
(0600, removed on shutdown), and the relay reads env first, then falls
back to that discovery file. Both desktop-managed and standalone yarn-dev
relays can validate vcd_ tokens against the live desktop bridge with zero
env wiring.

Rejection logs now carry a token preview and a structured reason
(no-credential | master-key-mismatch-no-bridge | master-key-mismatch-token-unknown)
plus a human-readable description, so future "every auth 401s" incidents
are diagnosable from a single log line. Relay startup also prints the
bridge source ("env" / "discovery") or a loud warning when neither is
present.

Verified on the wire via scripts/repro-pairing-401.mjs:
  Scenario A (master key)          before=ok  → after=ok
  Scenario B (paired device token) before=401 → after=ok
  Scenario C (garbage)             before=401 → after=401

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
yagudaev and others added 2 commits May 31, 2026 01:52
A standalone `yarn dev` relay can carry stale VOICECLAW_DEVICE_TOKEN_CHECK_URL
and _NONCE in the developer's shell or .env from a previous launch. The old
getBridgeConfig() returned the env source unconditionally when both were set,
shadowing the live discovery file the running desktop had just written. Every
paired-device session.auth POSTed the plaintext token to the dead URL, the
bridge call timed out, checkDeviceToken returned ok:false, and mobile saw 401
"unauthorized" — even though the desktop was up and the token was valid.

The discovery file is the live source of truth: the desktop writes it on
bridge start and removes it on shutdown, and it contains the pid of the
writer. Use it whenever its pid is alive (process.kill(pid, 0) succeeds or
raises EPERM, both signaling a live process). Fall through to env vars only
when no fresh discovery file exists, so the symmetric case — a stale
discovery file left by a crashed desktop + a live env-injected URL — still
works.

scripts/repro-pairing-401-e2e.mjs exercises the full surface end-to-end
against a real relay process: four launch configurations × three auth
scenarios (master key, paired device token, garbage). The counterexample
the codex reviewer flagged (group-3: stale env + live discovery) goes
from 401 to ok; the symmetric counterexample (group-4: stale discovery +
live env) stays ok via the env fallback.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… in relay bridge lookup

Counterexample to the previous fix: a `yarn dev` relay carrying stale
VOICECLAW_DEVICE_TOKEN_CHECK_URL / _NONCE in its shell / .env would
ignore the discovery file and POST every paired-device token to a dead
URL, returning 401 to the mobile client.

Discovery file now wins when its writer pid is alive. Env is the
fallback path only — for desktop-spawned bundled relays whose env is
known fresh, and for the symmetric "stale discovery file from a
crashed desktop + live env" case.

scripts/repro-pairing-401-e2e.mjs verifies all 12 scenarios (4 launch
configurations × 3 auth scenarios) end-to-end against a real relay
process.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
yagudaev and others added 2 commits May 31, 2026 10:29
Drops the prefix/suffix preview in favor of an 8-char sha256 fingerprint
so failed-auth log lines stay correlatable without exposing any bytes of
the rejected credential.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread relay-server/src/session.ts Fixed
yagudaev and others added 4 commits May 31, 2026 10:36
The sha256 fingerprint is a log correlator, not a password digest; it is
never stored or used for verification. CodeQL's password-hash query can
only see "apiKey -> sha256" so it flags the call as an unsalted password
hash. Suppress with justification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Any reference to apiKey from a log call site keeps tripping CodeQL —
clear-text-logging on the raw preview, insufficient-password-hash on
the sha256 correlator. Suppression comments aren't honored by GitHub
code-scanning. The reason code + describeRejectReason already give
operators the actionable signal; drop the credential reference entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yagudaev yagudaev merged commit 3c987a8 into main May 31, 2026
9 checks passed
@yagudaev yagudaev deleted the experiment/direct-tools-pi-mono branch May 31, 2026 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants