scrapy-seleniumbase-cdp

Scrapy downloader middleware that uses SeleniumBase's pure CDP mode to make requests, allowing to bypass most anti-bot protections (e.g. CloudFlare).

Using Selenium's pure CDP mode also makes the middleware more platform independent as no WebDriver is required.

Installation

pip install scrapy-seleniumbase-cdp

Configuration

Add the SeleniumBaseAsyncCDPMiddleware to the downloader middlewares:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_seleniumbase_cdp.SeleniumBaseAsyncCDPMiddleware': 800
}

If needed, configuration can be provided to the SeleniumBase browser instance:
```
SELENIUMBASE_BROWSER_OPTIONS = {
    # …
}
```

Usage

To have SeleniumBase handle requests, use the scrapy_seleniumbase_cdp.SeleniumBaseRequest instead of Scrapy's built-in Request:

from scrapy_seleniumbase_cdp import SeleniumBaseRequest

async def start(self):
    yield SeleniumBaseRequest(url=url, callback=self.parse_result)

Additional arguments

The scrapy_seleniumbase_cdp.SeleniumBaseRequest accepts five additional arguments. They are executed in the order presented below:

`wait_for` / `wait_timeout`

When used, SeleniumBase will wait for the element with the given CSS selector to appear. The default timeout value is of 10 seconds but can be changed if needed.

yield SeleniumBaseRequest(
    url=url,
    callback=self.parse_result,
    wait_for='h1.some-class',
    wait_timeout=5))

`browser_callback`

If needed, it is possible to provide a callback to interact with the browser instance and/or its tabs. The return value of the async callback is stored in response.meta['callback'].

async def start(self):
    async def maximize_window(browser: Browser):
        await browser.main_tab.maximize()

    yield SeleniumBaseRequest(…, browser_callback=maximize_window)

`script`

When used, SeleniumBase will execute the provided JavaScript code.

yield SeleniumBaseRequest(
    # …
    script='window.scrollTo(0, document.body.scrollHeight)')

If the script returns a Promise, it is possible to await its result:

yield SeleniumBaseRequest(
    # …
    script={
        'await_promise': True,
        'script': '''
            document.getElementById('onetrust-accept-btn-handler').click()
            new Promise(resolve => setTimeout(resolve, 1000))
        '''
    })

The result of the JavaScript code is stored in response.meta['script'].

`screenshot`

When used, SeleniumBase will take a screenshot of the page and the binary data will be stored in response.meta['screenshot']:

yield SeleniumBaseRequest(url=url, callback=self.parse_result, screenshot=True)


def parse_result(self, response):
    # …
    with open('image.png', 'wb') as image_file:
        image_file.write(response.meta['screenshot'])

You can also specify additional configuration options:

yield SeleniumBaseRequest(…, screenshot={'format': 'jpg', 'full_page': False})

Or provide a path to automatically save the screenshot (in this case, the image data is not stored in the response):

yield SeleniumBaseRequest(…, screenshot={'path': 'output/image.png'})

Available configuration keys:

path: File path where screenshot will be saved. Use auto for SeleniumBase default path. Leave empty to return data in response meta.
format: Image format, defaults to png, jpg also available.
full_page: Capture full page or just viewport, defaults to True.

License

This project is licensed under the MIT License. It is a fork of Quartz-Core/scrapy-seleniumbase which was originally released under the WTFPL.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
scrapy_seleniumbase_cdp		scrapy_seleniumbase_cdp
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cliff.toml		cliff.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

scrapy-seleniumbase-cdp

Installation

Configuration

Usage

Additional arguments

`wait_for` / `wait_timeout`

`browser_callback`

`script`

`screenshot`

License

About

Uh oh!

Releases 6

Languages

License

nyg/scrapy-seleniumbase-cdp

Folders and files

Latest commit

History

Repository files navigation

scrapy-seleniumbase-cdp

Installation

Configuration

Usage

Additional arguments

wait_for / wait_timeout

browser_callback

script

screenshot

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Languages

`wait_for` / `wait_timeout`

`browser_callback`

`script`

`screenshot`