feat(desktop): macOS Accessibility API text capture as second channel to Gemini (NAN-707)#367
Draft
yagudaev wants to merge 1 commit into
Draft
feat(desktop): macOS Accessibility API text capture as second channel to Gemini (NAN-707)#367yagudaev wants to merge 1 commit into
yagudaev wants to merge 1 commit into
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Owner
Author
Review summaryCodex skipped — Gemini findingsCritical (fixed in 79026a4):
Should-fix (fixed in 79026a4):
Should-fix (deferred with rationale):
Nits (skipped):
All 113 desktop + 79 relay-server tests still pass after the fixes; both packages typecheck. |
yagudaev
added a commit
that referenced
this pull request
May 1, 2026
Critical: - AX capture is now gated to display sources only. When sharing a single window, AX text comes from the *frontmost* window which can diverge from the captured pixels (window sources track the share even when not focused). Window-source shares ship vision-only until we wire CGWindowID alignment in a follow-up. Should-fix: - ScreenCapture sets a stopped flag and rechecks it after the AX await, so a frame can no longer land after stop(). - Relay sanitizes axText at the trust boundary: rejects non-strings, re-caps at 8KB UTF-8. The renderer cap is not a trust boundary. - Sidecar restart counter only resets after 30s of stable runtime. A crash loop can no longer forever-retry at the smallest backoff. - shuttingDown flag prevents the exit handler from resurrecting the sidecar after app quit kills it. - Distribution builds (yarn dist:mac sets AX_REQUIRE_UNIVERSAL=1) now hard- fail when the x86_64 toolchain is missing instead of silently shipping arm64-only. Dev builds still degrade gracefully. - Stripped NAN-707 references from source comments — they belong in the PR body, not the codebase. Nits: - Type-literal semicolons swapped for commas in three call sites. - Dropped unused REQUEST_TIMEOUT_REASON constant and _reason arg. - Sidecar fallback-window failure now reports werr.rawValue (the actual failed call) instead of err.rawValue. Tests: - 2 new relay sanitizer tests (non-string rejection, 8KB cap). - All 113 desktop + 81 relay-server tests pass; both packages typecheck. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Owner
Author
Codex review on this branch (latest commit ca9b7f2)GitHub doesn't let me post a request-changes review on my own PR, so this is a comment instead. Reviewed by Codex CLI (codex-cli 0.128.0), not by the original author model. Critical (resolved)
Should-fix (resolved)
Nits (resolved)
Test results after fixes
|
… to Gemini (NAN-707)
Adds a second input channel that captures the actual text of the frontmost
macOS window via the Accessibility API and records it alongside each 1 FPS
image frame. Goal: reduce hallucinated text on code, terminals, and dense
UIs where vision OCR struggles.
Architecture:
- Swift sidecar (desktop/native/ax-capture) speaks JSON-line stdio to the
Electron main process. Universal arm64+x86_64, AXIsProcessTrusted-gated,
returns a flat list of {role, text, frame, app}.
- Main process module manages sidecar lifecycle (lazy-start, restart with
stable-runtime-gated backoff, shutting-down flag, stdin backpressure).
- Renderer screen-capture loop calls AX in parallel with each JPEG and
passes axText through. Gated to display-source shares only — for window
shares the AX frontmost can diverge from the captured pixels.
- Stop() sets a flag rechecked after the AX await so a frame can't land
after stop.
- Wire format: optional `axText` on FrameAppendEvent, sanitized at the
relay trust boundary (type + 8KB cap).
- Per-frame ax_text written to per-turn timings.json for tracer A/B.
- Permission UX: pre-flight probe on screen-share start; confirm dialog
opens System Settings → Privacy → Accessibility deeplink. Capture
continues vision-only if denied.
- Build: `yarn build:ax-capture` invoked by `yarn dev` and dist:mac;
idempotent (mtime-skip), AX_FORCE_REBUILD=1 forces.
AX_REQUIRE_UNIVERSAL=1 set on dist:mac so x86_64 failures are fatal.
Upstream forwarding to Gemini is OFF in this PR. Sending axText as a
sibling realtimeInput.text per frame triggered Gemini Live to treat each
frame as a fresh user turn, generating a tool-call storm + 1007 + TPM
quota burn. Tracer logging is preserved so the A/B value remains; a
turn-bounded re-enable strategy is the natural follow-up.
Tests:
- 5 main-process formatter tests
- 4 sidecar protocol integration tests against the real binary
- 4 renderer formatter tests
- 4 relay frame.append → adapter wiring tests (incl. trust-boundary cases)
- 3 MediaCapture ax_text persistence tests
- All desktop + relay-server tests pass; both packages typecheck.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bf27ac5 to
424bcca
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Gemini hallucinates text from screen captures regularly — code, terminal output, dense tables, and small UI labels are the worst offenders even at our 1536px / 0.85q JPEG. Adding a parallel channel that sends the actual on-screen text via macOS Accessibility API alongside each image gives the model both the picture and the ground-truth text temporally aligned, so it stops guessing on the parts that vision can't read reliably.
Closes NAN-707.
Architecture (decisions table)
ApplicationServices.framework); sidecar is what serious mac-AX tools (Raycast/Rewind) do — easier to iterate on and notarize than a node-gyp native module.{role, text, frame, app}realtimeInput.textimmediately after eachrealtimeInput.videoax_textfield on eachvideoFrames[]entry in the per-turntimings.jsonTest plan
frame.append→sendFrame+sendAxTextrouting, including legacy-adapter compatibilityax_textpersistence and 8KB truncationax_textand Gemini's transcripts show fewer hallucinationsyarn dist:macsmoke test that the sidecar lands inContents/Resources/bin/ax-captureand is signed + notarizedImplementation notes
FrameAppendEventgets an optionalaxText?: string. Old desktop builds simply omit it; the field is additive.ProviderAdapter.sendAxTextis optional. OpenAI and xAI adapters don't implement it — only Gemini sees the channel, and that's correct since they're the only adapter accepting video."video"in Gemini's send-upstream queue so it shares oldest-drop discipline with its paired image. At 1 FPS, drift between image and AX text after a rotation is bounded to one second.sendFrame,sendAxTextdoes NOT pet the watchdog. The "are you still there?" prompt should still fire correctly during silent screen sharing.Contents/Resources/bin/ax-captureand gets signed with the app's Developer ID via electron-builder's afterSign hook. macOS attributes the AX request to the parent app bundle, so the user adds VoiceClaw toPrivacy & Security → Accessibility, not the sidecar binary directly. This needs verification on a realdist:macbuild. If macOS shows the sidecar as a separate entry, we'll either need to relaunch it via the parent's process tree differently or add explicit per-binary entitlements.formatAxTextin main,formatAxTextRendererin renderer); relay re-caps defensively to 8 KB on persist. UTF-8 byte-counted, not chars.desktop/scripts/build-services.mjsnow runsbuild-ax-capture.mjsafter the relay/openclaw bundles, producing the universal binary intodesktop/resources/bin/ax-capture.electron-builder.yml's existingextraResources: from: resources/picks it up automatically..build/and.swiftpm/directories added.🤖 Generated with Claude Code