Skip to content

docs: Add guide about integrating Stagehand #1290

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

Mantisus
Copy link
Collaborator

@Mantisus Mantisus commented Jul 8, 2025

Description

  • Add guide about integrating stagehand-python v.0.4.0

Issues

@Mantisus Mantisus requested review from vdusek and Pijukatel July 8, 2025 02:31
@Mantisus
Copy link
Collaborator Author

Mantisus commented Jul 8, 2025

I had to use cast to avoid bloating the guide for the sake of typing.

@Mantisus Mantisus self-assigned this Jul 8, 2025
Copy link
Collaborator

@Pijukatel Pijukatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool tool and nice guide. I have just small comments about the CrawleeStagehandPage wrapper

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job Max!

The integration itself is not as easy as I expected. Maybe this could show us the direction in which we could improve/simplify the browsers/Playwright-related interface.

And/or we could introduce a dedicated crawler to this directly in Crawlee, something like PlaywrightStagehandCrawler. Then the guide could focus solely on its usage, showing how to use AI-based selectors for web scraping.

Let's further discuss it with @B4nan and maybe @janbuchar once they're back from their vacations.

@@ -0,0 +1,66 @@
---
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we can't use "Run on Apify" for these examples as it contains more than 1 file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right.

And in one file, it would look very cumbersome

@Mantisus
Copy link
Collaborator Author

Mantisus commented Jul 8, 2025

The integration itself is not as easy as I expected. Maybe this could show us the direction in which we could improve/simplify the browsers/Playwright-related interface.

I think the integration comes out more complicated because of the current Stagehand API. Even though it's a wrapper around Playwright and the documentation says it's the same Playwright but with AI capabilities. The current code doesn't match that.

I hope that they will improve their API and then the guide can be simplified

Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me, just some minor things to address at will.


self._total_opened_pages += 1

# Wrap StagehandPage to provide Playwright Page interface
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems inaccurate

Comment on lines +59 to +68
pw_page = page._page # noqa: SLF001

# Handle page close event
pw_page.on(event='close', f=self._on_page_close)

# Update internal state
self._pages.append(pw_page)
self._last_page_opened_at = datetime.now(timezone.utc)

self._total_opened_pages += 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is quite a bit of code copied over from PlaywrightBrowserController, isn't it? Any chance we could improve the PlaywrightBrowserController internal API so that integrating libraries that extend Playwright is easier?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps context creation should be put into a separate public method. As well as updating states. That would make the same thing a bit cleaner.

But I would say that the main problem with this integration is that you have to do for example, this - pw_page = page._page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integrate Stagehand into PlaywrightCrawler
4 participants