Install from Chrome Web Store →
An autonomous web agent Chrome extension that uses the Set-of-Mark visual prompting technique and multimodal LLMs to navigate the web, analyze pages via screenshots, and execute actions via hardware-level simulation through the Chrome DevTools Protocol.
- Overview
- High-Level Architecture
- Extension Components
- The Agent Graph
- LLM Integration
- Agent Tools
- Set-of-Mark Annotation
- Hardware Input Simulation
- Persistence & State
- Screenshot Capture
- File Handling
- Safety & Loop Detection
- Directory Structure
- Sandbox Environment
- Development
Opticlick is a Manifest V3 Chrome Extension that acts as a fully autonomous web agent. Given a natural-language task, the agent:
- Annotates the live page with numbered bounding boxes (Set-of-Mark)
- Takes a screenshot of the annotated page
- Sends the screenshot + task context to an LLM
- Parses the LLM's structured tool-call response
- Executes the chosen action via CDP hardware simulation
- Repeats until the task is complete
The agent supports Gemini cloud models (including extended thinking) and locally-running Ollama models.
flowchart TB
subgraph Extension ["Chrome Extension"]
direction TB
SP["Side Panel (React UI)"]
BG["Background Service Worker (Orchestrator)"]
CS["Content Script (All Frames)"]
DB[("IndexedDB (VFS, Memory, Chats)")]
end
subgraph WebTab ["Active Web Tab"]
WT["Active Page (DOM)"]
end
subgraph Models ["LLM Provider APIs"]
direction LR
Gemini["Gemini Cloud Models"]
Ollama["Ollama Local Daemon"]
end
SP <-->|"Bidirectional Messages"| BG
BG -->|"Tab Injection / Messaging"| CS
CS -->|"Set-of-Mark Overlay"| WT
BG -->|"CDP Hardware Events"| WT
BG <-->|"IndexedDB Reads/Writes"| DB
BG -->|"Secure Requests"| Models
Entry: src/entrypoints/background.ts
The MV3 service worker is the orchestration hub. It:
- Listens for
START_AGENTandSTOP_AGENTmessages from the side panel - Intercepts
chrome.downloadsevents during active sessions, routing files into VFS instead of the Downloads folder - Manages the side panel lifecycle (
chrome.sidePanel.open) - Delegates agent execution to
runAgentLoop()in src/entrypoints/background/loop.ts
The loop sets up the full session context before handing off to the LangGraph state machine:
runAgentLoop(tabId, userPrompt, sessionId?, attachments?, modelId?)
├─ Create / resume session in IndexedDB
├─ Seed VFS with user-attached files
├─ Load persisted todo / memory / scratchpad
├─ Create LLM model instance
├─ Navigate away from restricted pages (chrome://, etc.)
├─ Inject content script + block user input
├─ Attach Chrome Debugger (CDP)
├─ Install file-chooser intercept guard
├─ Build LangGraph and stream to completion
└─ Finally: unblock input, detach debugger, clear temp VFS files
State that must survive service-worker restarts (MV3 workers are ephemeral) is persisted either in chrome.storage.session (transient agent status, log entries) or IndexedDB (conversation history, VFS, memory).
To adhere to the Single Responsibility (SRP), Open/Closed (OCP), and Interface Segregation (ISP) principles, the background orchestrator has been redesigned using segregated Action Registries and specialized contexts:
- Segregated Contexts & Registries: Instead of a monolithic context and registry, the orchestrator divides actions into UI Actions and Side Effects, using
uiActionRegistry(handlingUIActionContext) andsideEffectRegistry(handlingSideEffectContext). This ensures that actions only depend on the specific context fields they require. - Parser Map: In src/utils/tools/index.ts, the large
switch-caseinparseToolCallis replaced by a lookup map of dedicated parser functions. - Registry Execution: Graph nodes
uiActionandsideEffectsdynamically query their respective registries (uiActionRegistryandsideEffectRegistry) to execute handlers, decoupling orchestration flow from concrete action implementation details.
classDiagram
class UIActionContext {
+number tabId
+number sessionId
+number step
+string userPrompt
+string toolCallId
+string toolName
+CoordinateEntry[] coordinateMap
+ActionRecord[] actionHistory
+tabIdRef
}
class SideEffectContext {
+number sessionId
+number tabId
+string base64Image
+number step
+CoordinateEntry[] coordinateMap
+string userPrompt
+string toolCallId
+string toolName
+AgentState state
}
class UIActionRegistry {
-Map handlers
+register(handler)
+get(type)
}
class SideEffectRegistry {
-Map handlers
+register(handler)
+get(type)
}
uiActionRegistry ..|> UIActionRegistry
sideEffectRegistry ..|> SideEffectRegistry
uiActionNode --> uiActionRegistry : queries
sideEffectsNode --> sideEffectRegistry : queries
Entry: src/entrypoints/content.ts
Injected into every frame (all_frames: true) on every URL. Handles messages from the background:
| Message | Handler |
|---|---|
DRAW_MARKS |
Annotate interactables, return coordinate map |
DESTROY_MARKS |
Remove canvas overlay |
BLOCK_INPUT |
Install capturing event listeners to prevent user clicks |
UNBLOCK_INPUT |
Remove input blockers |
GET_ELEMENT_DOM |
Return outerHTML of element at given coordinates |
UPLOAD_FILE |
Inject file into <input type="file"> via CDP |
PING |
Confirm content script is alive |
The annotation and visibility logic lives in src/entrypoints/content/:
- overlay.ts — Discovers elements, renders canvas, returns coordinate map
- interactables.ts — Classifies elements as interactive (tags, ARIA roles, tabindex, cursor, event listeners)
- visibility.ts — Computes visible rects and checks for occlusion
- blocker.ts — Installs/removes capturing event listeners
- theme.ts — Detects dark/light mode for annotation colors
Entry: src/entrypoints/sidepanel/App.tsx
A React application rendered in Chrome's native side panel. Provides:
- API key setup — First-run Gemini key entry
- Model selection — Dropdown populated with Gemini models + auto-detected Ollama models
- Chat interface — Task prompt input with file attachment support
- Live agent stream — Real-time logs, thinking tokens, step progress
- Session history — Past sessions with conversation replay
The side panel communicates bidirectionally with the background via chrome.runtime.sendMessage / chrome.runtime.onMessage.
The agent loop is implemented as a LangGraph state machine defined in src/entrypoints/background/agent-graph.ts.
flowchart TD
START([Start]) --> stepSetup
stepSetup["stepSetup"] -->|Stopped| END([END])
stepSetup -->|Normal| drawAnnotations["drawAnnotations"]
drawAnnotations -->|Retry| stepSetup
drawAnnotations -->|Normal| captureAndDestroy["captureAndDestroy"]
captureAndDestroy -->|Retry| stepSetup
captureAndDestroy -->|Normal| reason["reason (LLM Call)"]
reason -->|LLM Fail| stepSetup
reason -->|Normal| sideEffects["sideEffects (Registry Dispatch)"]
sideEffects -->|ask_user| awaitUser["awaitUser"]
sideEffects -->|finish & no UI action| complete["complete"]
sideEffects -->|UI action present| uiAction["uiAction (Registry Dispatch)"]
sideEffects -->|No action / Side-effects only| stepSetup
uiAction -->|Done / Stopped| complete
uiAction -->|Continue| stepSetup
awaitUser -->|Stopped| END
awaitUser -->|Continue| stepSetup
complete --> END
| Node | File | Responsibility |
|---|---|---|
stepSetup |
nodes/setup.ts | Check stop flag, increment step counter, re-attach debugger, wait for DOM idle |
drawAnnotations |
nodes/setup.ts | Send DRAW_MARKS to content script, retry with backoff if zero elements found, return coordinate map |
captureAndDestroy |
nodes/observe.ts | Capture annotated screenshot via CDP, save to VFS as step_N.png, destroy overlay |
reason |
nodes/observe.ts | Assemble LLM context (system prompt + history + screenshot), call model, persist turns to IndexedDB |
sideEffects |
nodes/side-effects.ts | Execute all non-UI actions in order via the actionRegistry polymorphic dispatcher (VFS ops, todo updates, memory, scratchpad, DOM inspection, wait, ask_user, finish) |
uiAction |
nodes/ui-action.ts | Dispatch the single UI action via the actionRegistry polymorphic dispatcher (click / type / navigate / scroll / press_key / drag_and_drop); update tabIdRef if a new tab opened |
awaitUser |
nodes/control.ts | Suspend execution; the loop resumes when the user replies |
complete |
nodes/control.ts | Log completion, clear session VFS (preserving todo/scratchpad), broadcast finish to side panel |
After sideEffects, the router checks AgentState to choose the next node:
ask_usertool called →awaitUserfinishtool called →complete- UI action present →
uiAction→ back tostepSetup - No UI action → back to
stepSetup(sideEffects-only turn)
The loop continues until complete is reached, the stop flag is set (chrome.storage.session), or the step counter exceeds MAX_STEPS (500).
src/utils/llm.ts provides a unified model factory:
| Model | Class | Notes |
|---|---|---|
gemini-3.1-flash-lite-preview (default) |
ChatGoogleGenerativeAI |
Cloud, requires API key |
gemma-4-31b-it |
ChatGoogleGenerativeAI |
Cloud, requires API key |
ollama:<name> |
ChatOllama |
Local, http://localhost:11434, no key needed |
Gemini models are configured with thinkingConfig: { thinkingLevel: 'HIGH' } to enable extended reasoning. All models use temperature: 0.1 for deterministic outputs.
Model selection and API keys are persisted in chrome.storage.local. On extension load, the side panel queries Ollama at http://localhost:11434/api/tags (3 s timeout) to auto-populate local models.
Each LLM call is built by src/utils/prompt.ts:
SystemMessage(SYSTEM_INSTRUCTIONS) ← ~260-line cognitive framework
+ buildHistory(indexedDB turns) ← Full conversation so far
+ HumanMessage:
Task: {userPrompt} ← Original user request
[CONTEXT: started on <url>] ← URL anchor for navigation recovery
VFS: {file listings} ← Available files
Todo: {status icon per task} ← Current plan
Memory: {grouped by category} ← Cross-session facts
Scratchpad: {working notes} ← In-session state
CoordinateMap: {id → tag/text/rect} ← Interactable elements on page
Screenshot (base64 inline image) ← Annotated page view
History is reconstructed from IndexedDB conversation turns into LangChain message types (HumanMessage, AIMessage, ToolMessage) with proper tool_call_id chaining so the LLM can track which tool call produced which result.
src/utils/llm-stream.ts streams the model response:
- Accumulates thinking/reasoning tokens and broadcasts
AGENT_THINKING_DELTAmessages to the side panel in real time - Parses
tool_callsarray from the stream into typedAgentActionobjects viaparseToolCall() - Returns
{ reasoning, thinking, actions, done, rawToolCalls }to thereasonnode
The raw LangChain tool call objects are stored alongside the AI turn in IndexedDB so that buildHistory() can reconstruct valid ToolMessage pairs in subsequent turns.
Tools are defined per-category in src/utils/tools/ as LangChain tool objects with Zod schemas, and aggregated in src/utils/tools/index.ts.
| Tool | Description |
|---|---|
click |
Hardware click on an annotated element by ID. Supports modifier keys and uploadFileId for file injection |
type |
Type text into the focused element. clearField: true selects all before typing |
navigate |
Load a full URL in the current tab |
scroll |
Wheel-scroll the page or a specific element in a direction |
press_key |
Dispatch a raw key event (Enter, Escape, Tab, ArrowDown, etc.) |
| Tool | Description |
|---|---|
fetch_dom |
Return up to 40 KB of outerHTML for an element by ID — used when the screenshot lacks detail |
| Tool | Description |
|---|---|
vfs_save_screenshot |
Save the current step's screenshot to VFS under a given filename |
vfs_write |
Create or overwrite a VFS file with given content and MIME type |
vfs_delete |
Remove a VFS file by UUID |
vfs_download |
Fetch a remote URL directly into VFS, bypassing the OS download dialog |
| Tool | Description |
|---|---|
memory_upsert |
Save or merge a fact into long-term IndexedDB memory (key, values[], category) |
memory_delete |
Remove a memory entry by key |
| Tool | Description |
|---|---|
note_write |
Write or update a keyed note in the in-session scratchpad |
note_delete |
Remove a scratchpad note by key |
| Tool | Description |
|---|---|
todo_create |
Create the full task plan (mandatory on turn 1) |
todo_update |
Apply partial status/notes updates to existing items |
todo_add |
Append new tasks discovered mid-execution |
| Tool | Description |
|---|---|
wait |
Pause for 100–10,000 ms |
ask_user |
Pause and display a clarification question; resume on user reply |
finish |
Declare task complete; summary is shown to the user |
src/entrypoints/content/interactables.ts classifies elements as interactive if they match any of:
- Semantic HTML tags:
a,button,input,select,textarea,label,summary,details - ARIA roles:
button,link,menuitem,tab,checkbox,radio,combobox,listbox,option,switch,treeitem - Non-negative
tabindex - Computed style
cursor: pointer - Direct
onclickattribute
src/entrypoints/content/overlay.ts walks the full DOM with TreeWalker and recursively pierces open Shadow DOMs to discover components inside web components and custom elements.
Once elements are collected:
- Each element's bounding box is computed and clipped to the visible viewport via
getVisibleRect() - Occluded elements (covered by overlays, modals, or higher z-index siblings) are filtered out using
document.elementFromPoint() - A single fixed-position
<canvas>(z-index: max) is created — no DOM mutation with thousands of divs - Each visible element gets a numbered bounding box (blue rectangle) and a badge with its numeric ID
- The coordinate map
CoordinateEntry[]is returned to the background for inclusion in the LLM prompt
The LLM sees both the annotated screenshot (visual) and the coordinate map (structured metadata) and responds with the numeric ID of the element to interact with.
src/utils/cdp/input.ts dispatches true hardware-level events via Chrome DevTools Protocol — never synthetic DOM events — which is essential for modern SPAs (React/Vue/Angular) that check isTrusted.
Input.dispatchMouseEvent (mouseMoved → center of element)
Input.dispatchMouseEvent (mousePressed → button: left)
Input.dispatchMouseEvent (mouseReleased)
Critical: Coordinates from the LLM are in CSS pixels at the current device pixel ratio. Before dispatching CDP commands, coordinates are divided by window.devicePixelRatio to correct for high-DPI / Retina displays.
Modifier keys (ctrl, meta, shift, alt) are passed through the CDP modifiers bitmask, enabling Ctrl+Click to open links in a new tab.
Text is typed character-by-character via Runtime.evaluate using Input.insertText (or Input.dispatchKeyEvent for special characters). clearField: true first dispatches Ctrl+A to select all existing content before typing.
Input.dispatchScrollEvent with delta vectors, optionally targeted to a specific element's center coordinates.
Opened via src/utils/db/core.ts with DB_VERSION = 4:
| Object Store | Key | Content |
|---|---|---|
sessions |
id (UUID) |
Session metadata: title, URL, model, timestamps |
conversations |
id (UUID) |
Turns: role, content, toolCalls, toolCallId, toolName, sessionId |
VFS_STORE |
id (UUID) |
Files: name, mimeType, base64 data, sessionId, timestamps |
memory |
id (UUID) |
Memory entries: key, values[], category, sourceUrl, timestamps |
src/utils/db/vfs.ts — An IndexedDB-backed virtual filesystem scoped to each session.
Files are identified by UUID and looked up by name within a session. Key reserved filenames:
| File | Purpose |
|---|---|
step_N.png |
Annotated screenshot for step N |
__todo.json |
Persisted task list (excluded from cleanup) |
__scratchpad.json |
Session working notes (excluded from cleanup) |
The VFS provides the agent with a persistent workspace for: user-attached files, downloaded resources, extracted data, and intermediate outputs — all accessible across service-worker restarts.
Download interception in background.ts hooks chrome.downloads.onCreated: when a download is triggered during an active session, the download is cancelled and the file content is fetched and stored in VFS instead.
src/utils/db/memory.ts — Cross-session persistence in the memory object store.
interface MemoryEntry {
key: string; // Namespaced, e.g. "github/username" or "amazon/default_address"
values: string[]; // Array for multi-account support
category: string; // "account" | "preference" | "fact" | "other"
sourceUrl?: string;
}memory_upsert merges new values into the existing array (deduplicated), so the agent naturally accumulates multiple accounts or addresses under one key.
All entries are injected into every LLM prompt via formatMemoryForPrompt() in src/utils/memory.ts as a ── Long-term Memory ── context block grouped by category.
Security constraint: The system prompt and tool schema explicitly prohibit storing passwords, tokens, API keys, full card numbers, or SSNs.
src/utils/scratchpad.ts — Short-term working memory for accumulating intermediate findings (extracted prices, issue lists, form values, API responses) during a single task.
Backed by __scratchpad.json in VFS so it survives service-worker restarts. Cleared automatically when the session completes.
Injected into every LLM prompt as a ── Scratchpad ── context block.
src/utils/todo.ts — A structured task decomposition persisted as __todo.json in VFS.
interface TodoItem {
id: string; // Kebab-case identifier
title: string;
status: 'pending' | 'in_progress' | 'done' | 'skipped';
notes?: string;
}The agent must call todo_create on turn 1 with the full decomposed plan, then call todo_update every turn to mark progress. This gives the LLM a persistent view of what remains, preventing goal drift across many steps.
src/utils/screenshot.ts uses a two-strategy approach:
Strategy 1: CDP compositor (no flicker)
chrome.debugger → Page.captureScreenshot({ fromSurface: true })
└─ Accept if image size >= 6 KB (valid frame)
Strategy 2: Fallback (may briefly activate tab)
chrome.tabs.update({ active: true })
chrome.tabs.captureVisibleTab()
Restore previously-active tab
Retry up to 3× with backoff: 300 ms → 800 ms → 1500 ms
Using fromSurface: true reads from the GPU compositor buffer, producing a screenshot without flickering the visible tab — critical for non-disruptive background operation.
Files attached in the side panel arrive in the START_AGENT message as AttachedFile[] with name, mimeType, and base64 data. They are immediately seeded into the session's VFS.
On step 1 only, image attachments are also injected into the LLM prompt as inline multimodal content so the agent can see what the user uploaded.
When the agent calls click with an uploadFileId parameter, the flow is:
- Background retrieves the file from VFS by UUID
- Writes it to a temporary disk path via CDP
IOdomain - Uses
DOM.setFileInputFilesto inject the file path directly into the<input type="file">element - The OS file picker never opens
A preemptive guard is also installed via Page.setInterceptFileChooserDialog + JS-level overrides of HTMLInputElement.prototype.click and window.showOpenFilePicker to suppress any unexpected file dialogs.
src/utils/navigation-guard.ts tracks the action history per session. If the same click or scroll action appears 3+ times consecutively (shouldPivot()), the agent is flagged to change strategy rather than repeat the same failing action.
The system prompt includes explicit guidance for these situations:
- Try a different element or interaction path
- Navigate to a reconstructed URL directly
- Call
ask_userif the ambiguity requires human judgment
The agent is also constrained to one UI action per turn, which makes each step individually auditable and provides a clear retry boundary.
Opticlick includes a standalone web sandbox (located in the sandbox/ directory) that allows developers to run, preview, and test the sidepanel UI inside a mock browser environment directly in the browser—perfect for Pull Request previews, CI diagnostics, and local web-based testing.
- Mock Browser Pane: Simulates chrome tab navigation, history (back/forward), refresh, and tab locking.
- Service Worker Proxy: Intercepts iframe network requests dynamically to bypass CORS limits on standard web pages.
- Self-Hosted CORS Proxy: Routes all network requests through a custom Cloudflare Worker proxy, fully supporting
POST,PUT, and other HTTP methods. - Settings Dashboard: Configure your self-hosted Cloudflare Worker URL and LangSmith tracing variables directly in the UI.
To adhere to OCP and SRP, the sandbox debugger mock (sandbox/src/chrome-mock/debugger.ts) delegates all simulated CDP calls to a CDPCommandRegistry (defined in sandbox/src/chrome-mock/cdp-handlers.ts). Individual CDP commands are implemented as discrete handler classes (e.g. CaptureScreenshotHandler, DispatchMouseEventHandler), ensuring command routing is decoupled from implementation details and open for extension.
classDiagram
class CDPContext {
+Window win
+Document doc
+Map objectIdMap
+Map virtualFiles
+getHtml2Canvas()
}
class CDPCommandHandler {
<<interface>>
+string method
+execute(params, ctx)
}
class CDPCommandRegistry {
-Map handlers
+register(handler)
+get(method)
}
class CaptureScreenshotHandler {
+string method
+execute(params, ctx)
}
class DispatchMouseEventHandler {
+string method
+execute(params, ctx)
}
CDPCommandHandler <|.. CaptureScreenshotHandler
CDPCommandHandler <|.. DispatchMouseEventHandler
CDPCommandRegistry o--> CDPCommandHandler
debuggerShim --> CDPCommandRegistry : delegates sendCommand
To start the sandbox development server:
-
Install root dependencies and run preparation scripts:
npm install npm run copy-icons
-
Navigate to the
sandbox/directory and install its dependencies:cd sandbox npm install -
Start the dev server:
npm run dev
This will spin up the Vite development server at
http://localhost:5174. -
Build the sandbox:
npm run build
The static build assets will be generated in
sandbox/dist/.
Since the sandbox runs in a standard HTTPS origin (e.g., GitHub Pages or custom preview domains) and cannot directly access cross-origin resources, it requires a CORS proxy. To prevent exposing secrets in GitHub Actions for forks, you must deploy your own proxy worker.
-
Navigate to the
cors-proxy/folder:cd cors-proxy -
Deploy to Cloudflare Workers (Free Tier, 100k requests/day):
npx wrangler deploy
-
Log in to your Cloudflare account via the CLI prompt when prompted.
-
Once deployed, copy your worker URL (e.g.,
https://opticlick-cors-proxy.<your-subdomain>.workers.dev). -
Open the Sandbox in your browser, scroll to the bottom of the sidebar settings, expand CORS Proxy Settings, paste the URL, and click Save. This will dynamically configure the Service Worker to route all network traffic through your custom proxy.
- Node.js 20+
- A Gemini API key (for cloud models) or Ollama running locally
npm install
# Development (hot reload)
npm run dev
# Production build
npm run build
# Package for submission
npm run zipLoad the unpacked extension from .output/chrome-mv3/ in chrome://extensions with Developer Mode enabled.
# Unit + integration + DOM + e2e tests
npm test
# Lint
npm run lint
npm run lint:fixTests are organized under tests/:
tests/unit/— Pure logic: tool parsing, todo mutations, scratchpad, memory formatting, navigation guardtests/integration/— Chrome API stubs: CDP input, screenshots, IndexedDB, agent statetests/dom/— jsdom: element discovery, visibility, occlusion detectiontests/e2e/— Real Chromium: full agent loop
Optional LangSmith tracing (for debugging LLM calls):
VITE_LANGSMITH_TRACING=true
VITE_LANGSMITH_ENDPOINT=https://api.smith.langchain.com
VITE_LANGSMITH_API_KEY=<your key>
VITE_LANGSMITH_PROJECT=opticlick