Skip to content

Conversation

Genteki
Copy link
Contributor

@Genteki Genteki commented Oct 1, 2025

Online-Mind2Web Evaluation

The file changes here have been rebased and modified for hud-python==0.4.50. Original PR: #145

Update

  • browser/server/computer_tools/:
    • AnthropicComputerToolWithRecord: Claude computer tool with screenshot recording, using callback function
    • OpenAIComputerToolWithRecord OpenAI computer tool with screenshot recording, using callback function
  • browser/server/evaluate/online_mind2web/:
    • webjudge_online_mind2web.py] Evaluating method, using GPT-4o, based on problem description, screenshot history and action history. [Reference: Online-Mind2Web]
    • autonomous_eval.py: Evaluating method, using GPT-4o, based on problem description, final screenshot and action history. [Reference: Online-Mind2Web]
  • browser/server/main.py:
    • replace AnthropicComputerTool, OpenAIComputerTool with the ones with history recording

Updates Oct 8, 2025

Online-Mind2Web supported in remote-browser env


Note

Introduce Anthropic/OpenAI computer tools that auto-save screenshots and record actions, and add Online‑Mind2Web evaluators (autonomous_eval, webjudge, overall_judge) wired into both browser and remote-browser environments.

  • Evaluation (Online‑Mind2Web):
    • Add evaluate/online_mind2web with autonomous_eval and webjudge using GPT‑4o, leveraging screenshot and action history.
    • Wire router into environments/browser/server/evaluate/__init__.py and include in server.
    • Remote browser: add evaluate/autonomous_eval.py, evaluate/webjudge.py, and evaluate/overall_judge.py; register via evaluate hub.
  • Computer Tools (screenshot/action recording):
    • New AnthropicComputerToolWithRecord and OpenAIComputerToolWithRecord that:
      • Save screenshots to /screenshot on key actions.
      • Append unified action logs to /action_history/action_history.txt.
    • Replace prior tools with recording variants in browser/server/main.py and remote server wiring; export from local tool packages.
  • Server wiring:
    • Register new tools in both environments and ensure Playwright/browser executor integration.
    • Add imports/routers; keep hidden evaluate inclusion where applicable.
  • Dependencies:
    • Update hud-python pin to @main in browser/server/pyproject.toml.

Written by Cursor Bugbot for commit a2cfa28. This will update automatically on new commits. Configure here.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

Copy link
Contributor

promptless bot commented Oct 1, 2025

📝 Documentation updates detected!

Updated existing suggestion: Add comprehensive Mind2Web evaluation documentation (updated for PR #156)

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

if openai_api_key:
logging.info(
f"DEBUG: Raw key repr: {repr(openai_api_key[:10])}"
) # Show first 50 chars with repr to see any weird characters
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Logging Mismatch: Key Truncation

The logging statement for openai_api_key shows only the first 10 characters, but its inline comment indicates it should display the first 50. This mismatch between the code's behavior and its description can be confusing during debugging.

Additional Locations (1)

Fix in Cursor Fix in Web

if main_score >= score_threshold:
# Include high-scoring screenshots in final evaluation
final_images = []
for screenshot_b64 in screenshot_history: # All screenshots
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Screenshot Count Mismatch Causes API Overuse

The webjudge function uses the last 10 screenshots for image judgment, which is inconsistent with the browser version's use of the last 5. This difference may increase API usage and affect performance.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant