Scrapy downloader middleware that uses SeleniumBase's pure CDP mode to make requests, allowing to bypass most anti-bot protections (e.g. CloudFlare).
Using Selenium's pure CDP mode also makes the middleware more platform independent as no WebDriver is required.
pip install scrapy-seleniumbase-cdp
-
Add the
SeleniumBaseAsyncCDPMiddlewareto the downloader middlewares:DOWNLOADER_MIDDLEWARES = { 'scrapy_seleniumbase_cdp.SeleniumBaseAsyncCDPMiddleware': 800 }
-
If needed, configuration can be provided to the SeleniumBase browser instance:
SELENIUMBASE_BROWSER_OPTIONS = { # … }
To have SeleniumBase handle requests, use the
scrapy_seleniumbase_cdp.SeleniumBaseRequest instead of Scrapy's built-in
Request:
from scrapy_seleniumbase_cdp import SeleniumBaseRequest
async def start(self):
yield SeleniumBaseRequest(url=url, callback=self.parse_result)The scrapy_seleniumbase_cdp.SeleniumBaseRequest accepts five additional
arguments. They are executed in the order presented below:
When used, SeleniumBase will wait for the element with the given CSS selector to appear. The default timeout value is of 10 seconds but can be changed if needed.
yield SeleniumBaseRequest(
url=url,
callback=self.parse_result,
wait_for='h1.some-class',
wait_timeout=5))If needed, it is possible to provide a callback to interact with the browser
instance and/or its tabs. The return value of the async callback is stored in
response.meta['callback'].
async def start(self):
async def maximize_window(browser: Browser):
await browser.main_tab.maximize()
yield SeleniumBaseRequest(…, browser_callback=maximize_window)When used, SeleniumBase will execute the provided JavaScript code.
yield SeleniumBaseRequest(
# …
script='window.scrollTo(0, document.body.scrollHeight)')If the script returns a Promise, it is possible to await its result:
yield SeleniumBaseRequest(
# …
script={
'await_promise': True,
'script': '''
document.getElementById('onetrust-accept-btn-handler').click()
new Promise(resolve => setTimeout(resolve, 1000))
'''
})The result of the JavaScript code is stored in response.meta['script'].
When used, SeleniumBase will take a screenshot of the page and the binary data
will be stored in response.meta['screenshot']:
yield SeleniumBaseRequest(url=url, callback=self.parse_result, screenshot=True)
def parse_result(self, response):
# …
with open('image.png', 'wb') as image_file:
image_file.write(response.meta['screenshot'])You can also specify additional configuration options:
yield SeleniumBaseRequest(…, screenshot={'format': 'jpg', 'full_page': False})Or provide a path to automatically save the screenshot (in this case, the image data is not stored in the response):
yield SeleniumBaseRequest(…, screenshot={'path': 'output/image.png'})Available configuration keys:
path: File path where screenshot will be saved. Useautofor SeleniumBase default path. Leave empty to return data in responsemeta.format: Image format, defaults topng,jpgalso available.full_page: Capture full page or just viewport, defaults toTrue.
This project is licensed under the MIT License. It is a fork of Quartz-Core/scrapy-seleniumbase which was originally released under the WTFPL.