Skip to content

feat(desktop): macOS Accessibility API text capture as second channel to Gemini (NAN-707)#367

Draft
yagudaev wants to merge 1 commit into
mainfrom
michael/nan-707-ax-text-capture
Draft

feat(desktop): macOS Accessibility API text capture as second channel to Gemini (NAN-707)#367
yagudaev wants to merge 1 commit into
mainfrom
michael/nan-707-ax-text-capture

Conversation

@yagudaev

@yagudaev yagudaev commented May 1, 2026

Copy link
Copy Markdown
Owner

Why

Gemini hallucinates text from screen captures regularly — code, terminal output, dense tables, and small UI labels are the worst offenders even at our 1536px / 0.85q JPEG. Adding a parallel channel that sends the actual on-screen text via macOS Accessibility API alongside each image gives the model both the picture and the ground-truth text temporally aligned, so it stops guessing on the parts that vision can't read reliably.

Closes NAN-707.

Architecture (decisions table)

Decision Choice Why
Capture mechanism Swift sidecar binary, JSON-line stdio AX is C-only (ApplicationServices.framework); sidecar is what serious mac-AX tools (Raycast/Rewind) do — easier to iterate on and notarize than a node-gyp native module.
Scope per capture Frontmost window only Bounds payload size; matches user attention.
Tree shape Flat list of {role, text, frame, app} Easier for the model to consume; tree depth rarely adds signal for text.
Cadence Captured fresh per image frame (1 FPS) Keeps image + text temporally aligned without separate sync logic.
Delivery to Gemini Sibling realtimeInput.text immediately after each realtimeInput.video Gemini Live has no combined video+text part, but adjacent sends in the same WS tick are treated as one moment.
Tracing ax_text field on each videoFrames[] entry in the per-turn timings.json Lets us A/B image-only vs image+text accuracy on text-heavy interfaces.
Permission UX Pre-flight probe on screen-share start; confirm dialog opens System Settings deeplink if denied Mirrors existing screen-recording flow; capture continues vision-only either way (graceful degradation).
Test plan
  • Swift sidecar builds as universal binary (arm64 + x86_64)
  • 4 sidecar protocol integration tests (ping / permission / capture / bad JSON) pass against the real binary
  • 5 main-process formatter tests + 4 renderer formatter tests
  • 4 relay session tests for frame.appendsendFrame + sendAxText routing, including legacy-adapter compatibility
  • 3 MediaCapture tests for ax_text persistence and 8KB truncation
  • All 113 desktop tests + 79 relay-server tests + typecheck both packages
  • Docs page renders into the Starlight build (33 pages built, new sidebar entry visible)
  • Manual: run a session against text-heavy UIs (VS Code, dense web table, terminal) — verify tracer rows now contain ax_text and Gemini's transcripts show fewer hallucinations
  • Manual: revoke Accessibility permission, start a share, verify the confirm dialog appears and capture continues vision-only
  • Manual: yarn dist:mac smoke test that the sidecar lands in Contents/Resources/bin/ax-capture and is signed + notarized
Implementation notes
  • Wire format: FrameAppendEvent gets an optional axText?: string. Old desktop builds simply omit it; the field is additive.
  • Adapter interface: ProviderAdapter.sendAxText is optional. OpenAI and xAI adapters don't implement it — only Gemini sees the channel, and that's correct since they're the only adapter accepting video.
  • Reconnect queue: AX text is classified as "video" in Gemini's send-upstream queue so it shares oldest-drop discipline with its paired image. At 1 FPS, drift between image and AX text after a rotation is bounded to one second.
  • Watchdog: like sendFrame, sendAxText does NOT pet the watchdog. The "are you still there?" prompt should still fire correctly during silent screen sharing.
  • Permission inheritance: the sidecar is bundled inside Contents/Resources/bin/ax-capture and gets signed with the app's Developer ID via electron-builder's afterSign hook. macOS attributes the AX request to the parent app bundle, so the user adds VoiceClaw to Privacy & Security → Accessibility, not the sidecar binary directly. This needs verification on a real dist:mac build. If macOS shows the sidecar as a separate entry, we'll either need to relaunch it via the parent's process tree differently or add explicit per-binary entitlements.
  • Truncation: client-side cap is 8 KB at format time (formatAxText in main, formatAxTextRenderer in renderer); relay re-caps defensively to 8 KB on persist. UTF-8 byte-counted, not chars.
  • Build pipeline: desktop/scripts/build-services.mjs now runs build-ax-capture.mjs after the relay/openclaw bundles, producing the universal binary into desktop/resources/bin/ax-capture. electron-builder.yml's existing extraResources: from: resources/ picks it up automatically.
  • Gitignore: built binary + Swift .build/ and .swiftpm/ directories added.

🤖 Generated with Claude Code

@vercel

vercel Bot commented May 1, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
voiceclaw Ready Ready Preview, Comment May 5, 2026 5:17am

Request Review

@yagudaev

yagudaev commented May 1, 2026

Copy link
Copy Markdown
Owner Author

Review summary

Codex skipped — codex CLI flagged the worktree as untrusted and refused without --skip-git-repo-check. Skill bug, not a code issue. Falling back to Gemini-only.

Gemini findings

Critical (fixed in 79026a4):

  • Force-casts (as!) on AX results in three Swift sites would crash the sidecar if a buggy app returned a non-AXUIElement / non-AXValue. Replaced with CFGetTypeID guards that bail with ax_failed / skip the frame.

Should-fix (fixed in 79026a4):

  • Sidecar now lazy-starts on first capture/permission call, not at app launch.
  • stdin.write backpressure honored — drop request when the pipe buffer is full instead of buffering unbounded data in Node.
  • Build script now verifies lipo / chmod exit codes and asserts the output binary exists.
  • Swift Package.swift minimum lowered from .v12 to .v11 to match Electron 41's floor.

Should-fix (deferred with rationale):

  • Duplicated formatter (main + renderer): The two copies are 18 lines of pure string manipulation, both unit-tested separately. Consolidating would require a third bundle target or a preload-mediated path; not worth the bundling complexity for the deduplication. Added cross-reference comments in both copies.

Nits (skipped):

  • Redundant as [Any] casts in Swift readableText loop — keeping for clarity.
  • Unused _reason arg in failAllPending — kept for symmetry with the resolve-callback pattern; trivial.
  • window.confirm for permission dialog — acceptable for v1; can upgrade to a custom modal once we have a banner system.

All 113 desktop + 79 relay-server tests still pass after the fixes; both packages typecheck.

yagudaev added a commit that referenced this pull request May 1, 2026
Critical:
- AX capture is now gated to display sources only. When sharing a single
  window, AX text comes from the *frontmost* window which can diverge from
  the captured pixels (window sources track the share even when not
  focused). Window-source shares ship vision-only until we wire CGWindowID
  alignment in a follow-up.

Should-fix:
- ScreenCapture sets a stopped flag and rechecks it after the AX await, so
  a frame can no longer land after stop().
- Relay sanitizes axText at the trust boundary: rejects non-strings,
  re-caps at 8KB UTF-8. The renderer cap is not a trust boundary.
- Sidecar restart counter only resets after 30s of stable runtime. A crash
  loop can no longer forever-retry at the smallest backoff.
- shuttingDown flag prevents the exit handler from resurrecting the sidecar
  after app quit kills it.
- Distribution builds (yarn dist:mac sets AX_REQUIRE_UNIVERSAL=1) now hard-
  fail when the x86_64 toolchain is missing instead of silently shipping
  arm64-only. Dev builds still degrade gracefully.
- Stripped NAN-707 references from source comments — they belong in the
  PR body, not the codebase.

Nits:
- Type-literal semicolons swapped for commas in three call sites.
- Dropped unused REQUEST_TIMEOUT_REASON constant and _reason arg.
- Sidecar fallback-window failure now reports werr.rawValue (the actual
  failed call) instead of err.rawValue.

Tests:
- 2 new relay sanitizer tests (non-string rejection, 8KB cap).
- All 113 desktop + 81 relay-server tests pass; both packages typecheck.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yagudaev

yagudaev commented May 1, 2026

Copy link
Copy Markdown
Owner Author

Codex review on this branch (latest commit ca9b7f2)

GitHub doesn't let me post a request-changes review on my own PR, so this is a comment instead. Reviewed by Codex CLI (codex-cli 0.128.0), not by the original author model.

Critical (resolved)

  • AX capture leaked text from non-shared windows — capture read NSWorkspace.shared.frontmostApplication, so sharing window A while focused on window B sent B's text alongside A's screenshot.
    • Fix: AX is now gated to display-source shares only (sourceId.startsWith('screen:')). For window shares, screen capture continues vision-only until we wire CGWindowID-aware AX traversal.

Should-fix (resolved)

  • captureFrame race after stop() — the AX await widened the window where a frame could land after the user ended sharing. Fix: stopped flag set in stop() and re-checked after the await before calling onFrame.
  • Server-side axText validation missing — relay forwarded event.axText to Gemini without checking type or size. Fix: sanitizeAxText in session.ts rejects non-strings and re-caps at 8 KB UTF-8. Two new tests cover both cases.
  • Restart-backoff counter reset on every spawn — a crash-looping sidecar would forever-retry at the smallest delay. Fix: counter only resets after RESTART_STABLE_RUNTIME_MS (30 s) of stable runtime.
  • Quit + scheduleRestart racebefore-quit killed the sidecar; the exit handler then scheduled a restart. Fix: shuttingDown flag short-circuits both ensureSidecar() / startSidecar() and scheduleRestart().
  • Universal-binary silent fallback — x86_64 build failures dropped to arm64-only without error, even on dist:mac. Fix: AX_REQUIRE_UNIVERSAL=1 set in dist:mac; build aborts on x86_64 failure under that mode. Dev builds still degrade gracefully.
  • NAN-707 referenced in source comments — per CLAUDE.md, ticket references belong in PR bodies, not the codebase. Fix: removed from relay-server/src/types.ts and relay-server/src/session.ts.

Nits (resolved)

  • Type-literal semicolons swapped for commas (style: no semicolons).
  • Dropped unused REQUEST_TIMEOUT_REASON constant and _reason arg.
  • Sidecar fallback-window failure now reports werr.rawValue (the actual failed AX call) instead of err.rawValue.

Test results after fixes

  • 113/113 desktop tests
  • 81/81 relay-server tests (+2 new sanitizer tests)
  • Both packages typecheck clean.

… to Gemini (NAN-707)

Adds a second input channel that captures the actual text of the frontmost
macOS window via the Accessibility API and records it alongside each 1 FPS
image frame. Goal: reduce hallucinated text on code, terminals, and dense
UIs where vision OCR struggles.

Architecture:
- Swift sidecar (desktop/native/ax-capture) speaks JSON-line stdio to the
  Electron main process. Universal arm64+x86_64, AXIsProcessTrusted-gated,
  returns a flat list of {role, text, frame, app}.
- Main process module manages sidecar lifecycle (lazy-start, restart with
  stable-runtime-gated backoff, shutting-down flag, stdin backpressure).
- Renderer screen-capture loop calls AX in parallel with each JPEG and
  passes axText through. Gated to display-source shares only — for window
  shares the AX frontmost can diverge from the captured pixels.
- Stop() sets a flag rechecked after the AX await so a frame can't land
  after stop.
- Wire format: optional `axText` on FrameAppendEvent, sanitized at the
  relay trust boundary (type + 8KB cap).
- Per-frame ax_text written to per-turn timings.json for tracer A/B.
- Permission UX: pre-flight probe on screen-share start; confirm dialog
  opens System Settings → Privacy → Accessibility deeplink. Capture
  continues vision-only if denied.
- Build: `yarn build:ax-capture` invoked by `yarn dev` and dist:mac;
  idempotent (mtime-skip), AX_FORCE_REBUILD=1 forces.
  AX_REQUIRE_UNIVERSAL=1 set on dist:mac so x86_64 failures are fatal.

Upstream forwarding to Gemini is OFF in this PR. Sending axText as a
sibling realtimeInput.text per frame triggered Gemini Live to treat each
frame as a fresh user turn, generating a tool-call storm + 1007 + TPM
quota burn. Tracer logging is preserved so the A/B value remains; a
turn-bounded re-enable strategy is the natural follow-up.

Tests:
- 5 main-process formatter tests
- 4 sidecar protocol integration tests against the real binary
- 4 renderer formatter tests
- 4 relay frame.append → adapter wiring tests (incl. trust-boundary cases)
- 3 MediaCapture ax_text persistence tests
- All desktop + relay-server tests pass; both packages typecheck.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant