paramaterize agent execution model in eval runner #1093

tkattkat · 2025-09-23T19:02:48Z

why

currently the agent eval run by default, does not use an execution model

what changed

a default execution model is now set, which can be overridden with env var AGENT_EVAL_EXECUTION_MODEL

test plan

tested locally

# why solves #1060 patch regression of playwright arguments being removed from agent execute response # what changed agent.execute now returns playwright arguments in its response # test plan tested locally

…ms to docs (#1065) # why reflect project id changes in docs # what changed advanced configuration comments # test plan reviewed via mintlify on localhost

# why Easier to use for Custom LLM Clients and keep users up to date with our aisdk file # what changed added export of aisdk to lib/index.ts # test plan build local stagehand, import local AISdkClient, run Azure Stagehand session

…onfigu… (#1073) …ration settings # why Updated docs to match the new fingerprint params in the Browserbase docs here: https://docs.browserbase.com/guides/stealth-customization#customization-options # what changed Update browser configuration docs to reflect the docs changes. # test plan

# why Updating docs to reflect aisdk can be imported directly # what changed The model page # test plan Reviewed page with mintlify dev locally

# why # what changed # test plan

# why Currently, we do not support stagehand agent within the api # what changed When api is enabled, stagehand agent now routes through the api # test plan Tested locally

# why Currently, using playwright screenshot command is not available when the execution environment is Stagehand. A customer has indicated they would prefer to use Playwright's native screenshot command instead of CDP when using Browserbase as CDP screenshot causes unexpected behavior for their target site. # what changed - added a StagehandScreenshotOptions type with useCDP argument added - extended page type to accept custom stagehand screeenshot options - update screenshot proxy to default useCDP to true if the env is browserbase and use playwright screenshot if false - added eval for screenshot with and without cdp # test plan - tested and confirmed functionality with eval and external example script (not committed)

…1057) # why We want to build a best in class agent in stagehand. Therefore, we need more eval benchmarks. # what changed - Added Web-bench evals dataset - Added a subset of OS World evals - those that can be run in a chrome browser (desktop-based tasks omitted) - added LICENSE noticed to the copied evals tasks - Added ground truth / expected result to some WebVoyager tasks using reference_answer.json from Browser Use public evals repo. Improvements to `pnpm run evals -man` to better describe how to run evals. # test plan Evals should run locally and bb for these new benchmarks.

# why Initial instructions didn't mention uv or pip prerequisites and also didn't mention venv. Fix reduces friction on first timers. # what changed - added link to install uv - added details for initializing venv - adjusted code example respectively # test plan docs change

# why - webpage structure changed, needed to update the xpath in the expected locator

… with LanguageModelV1 + LiteLLM works for python (#1086) # why 1. aisdk not yet available through npm package 2. customLLM provider only works with LanguageModelV1 3. LiteLLM compatible providers are supported in python # what changed 1. change docs to install stagehand from git repo 2. pin versions that use LanguageModelV1 # test plan local test

# why currently we pass stagehand page to agent, this results in our page management having issues when facing new tabs # what changed the stagehand object is now passed instead of stagehandPage # test plan tested locally

# why Our existing screenshot service is a dummy time-based triggered service. It also does not trigger based on any actions of the agent. # what changed Added img hash diff algo (quick check with MSE, verify with SSIM algo) to see if there was an actual UI change and only store ss in the buffer if that is so. Added ss interceptor which copies each screenshot the agent is taking to a buffer (if different enough from the previous ss) to be later used for evals. - There's also a small refactor of the agent initialization config to enable the screenshot collector service to be attached # test plan Tests pass locally --------- Co-authored-by: Miguel <[email protected]> Co-authored-by: miguel <[email protected]>

# why To help make sense of eval test cases and results # what changed Added metadata to eval runs, cleaned deprecated code # test plan

changeset-bot · 2025-09-23T19:02:51Z

🦋 Changeset detected

Latest commit: f340c37

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

greptile-apps

Greptile Overview

Summary

This PR adds parameterization for the agent execution model in eval runners. Previously, agent evaluations ran without an execution model set. Now, non-CUA models get a default execution model (google/gemini-2.5-flash) that can be overridden via the AGENT_EVAL_EXECUTION_MODEL environment variable.

Added environment variable AGENT_EVAL_EXECUTION_MODEL for execution model configuration
Set default execution model to google/gemini-2.5-flash for non-CUA models
CUA models (computer-use-preview and claude) continue to work without execution model

Confidence Score: 4/5

This PR is safe to merge with minimal risk
The change is straightforward and well-contained, adding proper fallback behavior for execution models. The logic correctly differentiates between CUA and non-CUA models, and the default value choice is reasonable
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
evals/initStagehand.ts	4/5	Added parameterized execution model with environment variable fallback and default value for non-CUA models

Sequence Diagram

sequenceDiagram
    participant Eval as Eval Runner
    participant Init as initStagehand
    participant Env as Environment
    participant Agent as Stagehand Agent

    Eval->>Init: Call initStagehand with modelName
    Init->>Init: Check if isCUAModel(modelName)
    
    alt CUA Model (computer-use-preview or claude)
        Init->>Agent: Create agentConfig with model & provider
        Note right of Init: executionModel not set for CUA models
    else Non-CUA Model
        Init->>Env: Check AGENT_EVAL_EXECUTION_MODEL
        Env-->>Init: Return env var or undefined
        Init->>Init: Set executionModel = env ?? "google/gemini-2.5-flash"
        Init->>Agent: Create agentConfig with model & executionModel
    end
    
    Init->>Agent: stagehand.agent(agentConfig)
    Agent-->>Eval: Return configured agent

_{1 file reviewed, no comments}

_{Edit Code Review Bot Settings | Greptile}

tkattkat and others added 19 commits September 10, 2025 13:40

add playwright arguments to agent (#1066)

9daa584

# why solves #1060 patch regression of playwright arguments being removed from agent execute response # what changed agent.execute now returns playwright arguments in its response # test plan tested locally

[docs] add info on not needing project id in browserbase session para…

f6f05b0

…ms to docs (#1065) # why reflect project id changes in docs # what changed advanced configuration comments # test plan reviewed via mintlify on localhost

Export aisdk (#1058)

c886544

# why Easier to use for Custom LLM Clients and keep users up to date with our aisdk file # what changed added export of aisdk to lib/index.ts # test plan build local stagehand, import local AISdkClient, run Azure Stagehand session

[docs] export aisdk (#1074)

3c39a05

# why Updating docs to reflect aisdk can be imported directly # what changed The model page # test plan Reviewed page with mintlify dev locally

Fix zod peer dependency support (#1032)

bf2d0e7

# why # what changed # test plan

add stagehand agent to api (#1077)

7f38b3a

# why Currently, we do not support stagehand agent within the api # what changed When api is enabled, stagehand agent now routes through the api # test plan Tested locally

update xpath in observe_vantechjournal (#1088)

b9c8102

# why - webpage structure changed, needed to update the xpath in the expected locator

Fix session create logs on api (#1089)

536f366

Improve failed act logs (#1090)

8ff5c5a

pass stagehand, instead of stagehandPage to agent (#1082)

8c0fd01

# why currently we pass stagehand page to agent, this results in our page management having issues when facing new tabs # what changed the stagehand object is now passed instead of stagehandPage # test plan tested locally

Eval metadata (#1092)

f89b13e

# why To help make sense of eval test cases and results # what changed Added metadata to eval runs, cleaned deprecated code # test plan

paramaterize agent execution model in eval runner

61d1472

update example env

f340c37

greptile-apps bot reviewed Sep 23, 2025

View reviewed changes

miguelg719 force-pushed the main branch from 4994eab to bd0a799 Compare October 29, 2025 16:15

tkattkat closed this Oct 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

paramaterize agent execution model in eval runner #1093

paramaterize agent execution model in eval runner #1093

Uh oh!

tkattkat commented Sep 23, 2025

Uh oh!

changeset-bot bot commented Sep 23, 2025 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

paramaterize agent execution model in eval runner #1093

paramaterize agent execution model in eval runner #1093

Uh oh!

Conversation

tkattkat commented Sep 23, 2025

why

what changed

test plan

Uh oh!

changeset-bot bot commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

changeset-bot bot commented Sep 23, 2025 •

edited

Loading