Online-Mind2Web Example (Rebased) #156

Genteki · 2025-10-01T13:22:56Z

Online-Mind2Web Evaluation

The file changes here have been rebased and modified for hud-python==0.4.50. Original PR: #145

Update

browser/server/computer_tools/:
- AnthropicComputerToolWithRecord: Claude computer tool with screenshot recording, using callback function
- OpenAIComputerToolWithRecord OpenAI computer tool with screenshot recording, using callback function
browser/server/evaluate/online_mind2web/:
- webjudge_online_mind2web.py] Evaluating method, using GPT-4o, based on problem description, screenshot history and action history. [Reference: Online-Mind2Web]
- autonomous_eval.py: Evaluating method, using GPT-4o, based on problem description, final screenshot and action history. [Reference: Online-Mind2Web]
browser/server/main.py:
- replace AnthropicComputerTool, OpenAIComputerTool with the ones with history recording

Updates Oct 8, 2025

Online-Mind2Web supported in remote-browser env

Note

Introduce Anthropic/OpenAI computer tools that auto-save screenshots and record actions, and add Online‑Mind2Web evaluators (autonomous_eval, webjudge, overall_judge) wired into both browser and remote-browser environments.

Evaluation (Online‑Mind2Web):
- Add evaluate/online_mind2web with autonomous_eval and webjudge using GPT‑4o, leveraging screenshot and action history.
- Wire router into environments/browser/server/evaluate/__init__.py and include in server.
- Remote browser: add evaluate/autonomous_eval.py, evaluate/webjudge.py, and evaluate/overall_judge.py; register via evaluate hub.
Computer Tools (screenshot/action recording):
- New AnthropicComputerToolWithRecord and OpenAIComputerToolWithRecord that:
  - Save screenshots to /screenshot on key actions.
  - Append unified action logs to /action_history/action_history.txt.
- Replace prior tools with recording variants in browser/server/main.py and remote server wiring; export from local tool packages.
Server wiring:
- Register new tools in both environments and ensure Playwright/browser executor integration.
- Add imports/routers; keep hidden evaluate inclusion where applicable.
Dependencies:
- Update hud-python pin to @main in browser/server/pyproject.toml.

^{Written by Cursor Bugbot for commit a2cfa28. This will update automatically on new commits. Configure here.}

promptless · 2025-10-01T13:32:08Z

📝 Documentation updates detected!

Updated existing suggestion: Add comprehensive Mind2Web evaluation documentation (updated for PR #156)

cursor · 2025-10-09T17:32:55Z

environments/remote_browser/src/hud_controller/evaluate/webjudge.py

+    if openai_api_key:
+        logging.info(
+            f"DEBUG: Raw key repr: {repr(openai_api_key[:10])}"
+        )  # Show first 50 chars with repr to see any weird characters


Bug: Logging Mismatch: Key Truncation

The logging statement for openai_api_key shows only the first 10 characters, but its inline comment indicates it should display the first 50. This mismatch between the code's behavior and its description can be confusing during debugging.

Additional Locations (1)

environments/browser/server/evaluate/online_mind2web/webjudge.py#L33-L36

cursor · 2025-10-09T17:32:55Z

environments/remote_browser/src/hud_controller/evaluate/webjudge.py

+        if main_score >= score_threshold:
+            # Include high-scoring screenshots in final evaluation
+            final_images = []
+            for screenshot_b64 in screenshot_history:  # All screenshots


Bug: Screenshot Count Mismatch Causes API Overuse

The webjudge function uses the last 10 screenshots for image judgment, which is inconsistent with the browser version's use of the last 5. This difference may increase API usage and affect performance.

Online-Web2Mind example

0d1a940

This comment was marked as outdated.

Sign in to view

Genteki added 2 commits October 3, 2025 09:31

fix bug

2c375cc

ruff format

5109c33

This comment was marked as outdated.

Sign in to view

Genteki and others added 3 commits October 3, 2025 10:04

Remove redundant code

5af8ffc

Merge branch 'hud-evals:main' into Example--Online-Mind2Web

c8b17f3

remote browser update

c84b3a8

This comment was marked as outdated.

Sign in to view

detailed output message

a2cfa28

cursor bot reviewed Oct 9, 2025

View reviewed changes

Genteki mentioned this pull request Oct 13, 2025

Online-Mind2Web Folder #168

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Online-Mind2Web Example (Rebased) #156

Online-Mind2Web Example (Rebased) #156

Uh oh!

Genteki commented Oct 1, 2025 •

edited by cursor bot

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

promptless bot commented Oct 1, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

cursor bot Oct 9, 2025

Uh oh!

cursor bot Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Online-Mind2Web Example (Rebased) #156

Are you sure you want to change the base?

Online-Mind2Web Example (Rebased) #156

Uh oh!

Conversation

Genteki commented Oct 1, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Online-Mind2Web Evaluation

Update

Updates Oct 8, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

promptless bot commented Oct 1, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

cursor bot Oct 9, 2025

Choose a reason for hiding this comment

Bug: Logging Mismatch: Key Truncation

Uh oh!

cursor bot Oct 9, 2025

Choose a reason for hiding this comment

Bug: Screenshot Count Mismatch Causes API Overuse

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Genteki commented Oct 1, 2025 •

edited by cursor bot

Loading