This repository contains a Python-based desktop automation agent built on top of the UI-TARS v1.5 (7B) multimodal UI interaction model. The project explores how large UI-capable models can be used to drive scriptable, headless desktop automation without relying on heavyweight desktop clients or manual intervention.
The agent is designed to interpret high-level task instructions, reason over the current screen state, and execute corresponding UI actions in a repeatable and extensible manner.
-
CLI-first automation
The agent is designed to run entirely from the command line, making it suitable for server-side execution and integration into larger automation pipelines. -
Model-driven UI interaction
Instead of hard-coded selectors or brittle UI rules, the system relies on a UI interaction model to reason about screen state and determine the next action. -
Prompt-centric control layer
A dedicated prompt engineering module converts human-readable instructions into structured prompts optimised for UI reasoning and action selection. -
Workflow-friendly execution
The agent can be executed non-interactively, enabling orchestration by external automation or scheduling tools.
The agent mirrors the core logic of the UI-TARS desktop system, but without the GUI. Its components include:
- Prompt Layer – Constructs structured prompts for UI-TARS reasoning.
- Action Parser – Converts model output into executable desktop actions.
- Agent Core – Manages state, retries, and execution loop.
- Desktop Controller – Executes UI actions (mouse, keyboard, window focus) via PyAutoGUI.
- CLI Wrappers –
run_with_arguments.pyfor single instructions andrun_agent_loop.pyfor batch execution.
This modular design keeps decision-making model-driven while maintaining deterministic execution at the system level.
At a high level, the automation loop follows this pattern:
- Capture the current screen state
- Construct a structured prompt using task instructions and screen context
- Invoke the UI-TARS v1.5 (7B) model
- Parse the model’s response into executable actions
- Perform the corresponding desktop interactions
- Repeat until the task is completed
desktop_agent_core.py– Main agent loop and state managementprompts.py– System and user prompt definitions for UI-TARSaction_parser.py– Converts model output into structured actionsdesktop_controller.py– Executes UI actions on the desktoprun_with_arguments.py– CLI entry point for single instructionsrun_agent_loop.py– Batch instruction orchestrator
The scripts use consistent exit codes for integration and orchestration:
| Code | Meaning | Notes |
|---|---|---|
| 0 | Success | Instruction completed successfully |
| 1 | User intervention required | Model requested manual action |
| 2 | Agent error | Parsing or execution failure |
| 3 | Authentication required | OTP / mobile input needed |
| 130 | User cancelled | Ctrl+C (CLI only) |
A typical automated run may involve:
- Initialising a clean execution environment
- Loading task-specific context or input data
- Interpreting a high-level instruction
- Driving UI navigation via UI-TARS reasoning
- Extracting structured information from the UI
- Returning results or execution status
- Minimise tight coupling to any single UI or application
- Avoid brittle, pixel-perfect automation logic
- Favour modularity and composability
- Enable easy experimentation with prompt strategies and model behaviour
This short demo shows the UI-TARS v1.5 (7B) model driving a mobile browser-based chat interface to perform automated tasks via ADB.
It illustrates the same underlying reasoning and prompt-driven execution used in the desktop agent, adapted for a lightweight phone workflow.
Note: This demo uses a simplified interface and does not contain any personal data.
- This is a proof-of-concept implementation, not a production-hardened system
- UI automation remains inherently sensitive to layout and rendering changes
- Additional safeguards would be required for large-scale or long-running workloads
All proprietary systems, credentials, datasets, and business-specific logic have been removed or abstracted.
This repository focuses solely on the technical approach, architecture, and automation strategy.
This project is provided for demonstration and experimentation purposes only. It is not intended for direct production use without further testing, security review, and operational hardening.