Skip to content
This repository was archived by the owner on Dec 28, 2023. It is now read-only.

Rebranding #17

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Copyright 2020 Scrapinghub
Copyright (c) 2021 Zyte Group Ltd

Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:
Expand Down
68 changes: 34 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Scrapy Middleware for Crawlera Simple Fetch API
[![actions](https://github.com/scrapy-plugins/scrapy-crawlera-fetch/workflows/Build/badge.svg)](https://github.com/scrapy-plugins/scrapy-crawlera-fetch/actions)
[![codecov](https://codecov.io/gh/scrapy-plugins/scrapy-crawlera-fetch/branch/master/graph/badge.svg)](https://codecov.io/gh/scrapy-plugins/scrapy-crawlera-fetch)
# Scrapy Middleware for Zyte Smart Proxy Manager Simple Fetch API
[![actions](https://github.com/scrapy-plugins/scrapy-zyte-proxy-fetch/workflows/Build/badge.svg)](https://github.com/scrapy-plugins/scrapy-zyte-proxy-fetch/actions)
[![codecov](https://codecov.io/gh/scrapy-plugins/scrapy-zyte-proxy-fetch/branch/master/graph/badge.svg)](https://codecov.io/gh/scrapy-plugins/scrapy-zyte-proxy-fetch)

This package provides a Scrapy
[Downloader Middleware](https://docs.scrapy.org/en/latest/topics/downloader-middleware.html)
to transparently interact with the
[Crawlera Fetch API](https://doc.scrapinghub.com/crawlera-fetch-api.html).
[Zyte Smart Proxy Manager Fetch API](https://docs.zyte.com/smart-proxy-manager/fetch-api.html).


## Requirements
Expand All @@ -18,70 +18,70 @@ to transparently interact with the

Not yet available on PyPI. However, it can be installed directly from GitHub:

`pip install git+ssh://[email protected]/scrapy-plugins/scrapy-crawlera-fetch.git`
`pip install git+ssh://[email protected]/scrapy-plugins/scrapy-zyte-proxy-fetch.git`

or

`pip install git+https://github.com/scrapy-plugins/scrapy-crawlera-fetch.git`
`pip install git+https://github.com/scrapy-plugins/scrapy-zyte-proxy-fetch.git`


## Configuration

Enable the `CrawleraFetchMiddleware` via the
Enable the `SmartProxyManagerFetchMiddleware` via the
[`DOWNLOADER_MIDDLEWARES`](https://docs.scrapy.org/en/latest/topics/settings.html#downloader-middlewares)
setting:

```
DOWNLOADER_MIDDLEWARES = {
"crawlera_fetch.CrawleraFetchMiddleware": 585,
"zyte_proxy_fetch.SmartProxyManagerFetchMiddleware": 585,
}
```

Please note that the middleware needs to be placed before the built-in `HttpCompressionMiddleware`
middleware (which has a priority of 590), otherwise incoming responses will be compressed and the
Crawlera middleware won't be able to handle them.
Smart Proxy Manager middleware won't be able to handle them.

### Settings

* `CRAWLERA_FETCH_ENABLED` (type `bool`, default `False`). Whether or not the middleware will be enabled,
i.e. requests should be downloaded using the Crawlera Fetch API. The `crawlera_fetch_enabled` spider
* `ZYTE_PROXY_FETCH_ENABLED` (type `bool`, default `False`). Whether or not the middleware will be enabled,
i.e. requests should be downloaded using the Smart Proxy Manager Fetch API. The `zyte_proxy_fetch_enabled` spider
attribute takes precedence over this setting.

* `CRAWLERA_FETCH_APIKEY` (type `str`). API key to be used to authenticate against the Crawlera endpoint
* `ZYTE_PROXY_FETCH_APIKEY` (type `str`). API key to be used to authenticate against the Smart Proxy Manager endpoint
(mandatory if enabled)

* `CRAWLERA_FETCH_URL` (Type `str`, default `"http://fetch.crawlera.com:8010/fetch/v2/"`).
The endpoint of a specific Crawlera instance
* `ZYTE_PROXY_FETCH_URL` (Type `str`, default `"http://fetch.crawlera.com:8010/fetch/v2/"`).
The endpoint of a specific Smart Proxy Manager instance

* `CRAWLERA_FETCH_RAISE_ON_ERROR` (type `bool`, default `True`). Whether or not the middleware will
* `ZYTE_PROXY_FETCH_RAISE_ON_ERROR` (type `bool`, default `True`). Whether or not the middleware will
raise an exception if an error occurs while downloading or decoding a request. If `False`, a
warning will be logged and the raw upstream response will be returned upon encountering an error.

* `CRAWLERA_FETCH_DOWNLOAD_SLOT_POLICY` (type `enum.Enum` - `crawlera_fetch.DownloadSlotPolicy`,
* `ZYTE_PROXY_FETCH_DOWNLOAD_SLOT_POLICY` (type `enum.Enum` - `zyte_proxy_fetch.DownloadSlotPolicy`,
default `DownloadSlotPolicy.Domain`).
Possible values are `DownloadSlotPolicy.Domain`, `DownloadSlotPolicy.Single`,
`DownloadSlotPolicydefault` (Scrapy default). If set to `DownloadSlotPolicy.Domain`, please
consider setting `SCHEDULER_PRIORITY_QUEUE="scrapy.pqueues.DownloaderAwarePriorityQueue"` to
make better usage of concurrency options and avoid delays.

* `CRAWLERA_FETCH_DEFAULT_ARGS` (type `dict`, default `{}`)
Default values to be sent to the Crawlera Fetch API. For instance, set to `{"device": "mobile"}`
* `ZYTE_PROXY_FETCH_DEFAULT_ARGS` (type `dict`, default `{}`)
Default values to be sent to the Smart Proxy Manager Fetch API. For instance, set to `{"device": "mobile"}`
to render all requests with a mobile profile.

### Spider attributes

* `crawlera_fetch_enabled` (type `bool`, default `False`). Whether or not the middleware will be enabled.
Takes precedence over the `CRAWLERA_FETCH_ENABLED` setting.
* `zyte_proxy_fetch_enabled` (type `bool`, default `False`). Whether or not the middleware will be enabled.
Takes precedence over the `ZYTE_PROXY_FETCH_ENABLED` setting.

### Log formatter

Since the URL for outgoing requests is modified by the middleware, by default the logs will show
the URL for the Crawlera endpoint. To revert this behaviour you can enable the provided
the URL for the Smart Proxy Manager endpoint. To revert this behaviour you can enable the provided
log formatter by overriding the [`LOG_FORMATTER`](https://docs.scrapy.org/en/latest/topics/settings.html#log-formatter)
setting:

```
LOG_FORMATTER = "crawlera_fetch.CrawleraFetchLogFormatter"
LOG_FORMATTER = "zyte_proxy_fetch.SmartProxyManagerLogFormatter"
```

Note that the ability to override the error messages for spider and download errors was added
Expand All @@ -92,7 +92,7 @@ to the `Request.flags` attribute, which is shown in the logs by default.
## Usage

If the middleware is enabled, by default all requests will be redirected to the specified
Crawlera Fetch endpoint, and modified to comply with the format expected by the Crawlera Fetch API.
Smart Proxy Manager Fetch endpoint, and modified to comply with the format expected by the Smart Proxy Manager Fetch API.
The three basic processed arguments are `method`, `url` and `body`.
For instance, the following request:

Expand All @@ -103,7 +103,7 @@ Request(url="https://httpbin.org/post", method="POST", body="foo=bar")
will be converted to:

```python
Request(url="<Crawlera Fetch API endpoint>", method="POST",
Request(url="<Smart Proxy Manager Fetch API endpoint>", method="POST",
body='{"url": "https://httpbin.org/post", "method": "POST", "body": "foo=bar"}',
headers={"Authorization": "Basic <derived from APIKEY>",
"Content-Type": "application/json",
Expand All @@ -112,12 +112,12 @@ Request(url="<Crawlera Fetch API endpoint>", method="POST",

### Additional arguments

Additional arguments could be specified under the `crawlera_fetch.args` `Request.meta` key. For instance:
Additional arguments could be specified under the `zyte_proxy_fetch.args` `Request.meta` key. For instance:

```python
Request(
url="https://example.org",
meta={"crawlera_fetch": {"args": {"region": "us", "device": "mobile"}}},
meta={"zyte_proxy_fetch": {"args": {"region": "us", "device": "mobile"}}},
)
```

Expand All @@ -127,26 +127,26 @@ is translated into the following body:
'{"url": "https://example.org", "method": "GET", "body": "", "region": "us", "device": "mobile"}'
```

Arguments set for a specific request through the `crawlera_fetch.args` key override those
set with the `CRAWLERA_FETCH_DEFAULT_ARGS` setting.
Arguments set for a specific request through the `zyte_proxy_fetch.args` key override those
set with the `ZYTE_PROXY_FETCH_DEFAULT_ARGS` setting.

### Accessing original request and raw Crawlera response
### Accessing original request and raw Zyte Smart Proxy Manager response

The `url`, `method`, `headers` and `body` attributes of the original request are available under
the `crawlera_fetch.original_request` `Response.meta` key.
the `zyte_proxy_fetch.original_request` `Response.meta` key.

The `status`, `headers` and `body` attributes of the upstream Crawlera response are available under
the `crawlera_fetch.upstream_response` `Response.meta` key.
The `status`, `headers` and `body` attributes of the upstream Smart Proxy Manager response are available under
the `zyte_proxy_fetch.upstream_response` `Response.meta` key.

### Skipping requests

You can instruct the middleware to skip a specific request by setting the `crawlera_fetch.skip`
You can instruct the middleware to skip a specific request by setting the `zyte_proxy_fetch.skip`
[Request.meta](https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.meta)
key:

```python
Request(
url="https://example.org",
meta={"crawlera_fetch": {"skip": True}},
meta={"zyte_proxy_fetch": {"skip": True}},
)
```
2 changes: 0 additions & 2 deletions crawlera_fetch/__init__.py

This file was deleted.

12 changes: 6 additions & 6 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,15 @@


setuptools.setup(
name="scrapy-crawlera-fetch",
name="scrapy-zyte-proxy-fetch",
version="0.0.1",
license="BSD",
description="Scrapy downloader middleware to interact with Crawlera Simple Fetch API",
description="Scrapy downloader middleware to interact with Zyte Smart Proxy Manager Fetch API",
long_description=long_description,
author="Scrapinghub",
author_email="info@scrapinghub.com",
url="https://github.com/scrapy-plugins/scrapy-crawlera-fetch",
packages=["crawlera_fetch"],
author="Zyte",
author_email="opensource@zyte.com",
url="https://github.com/scrapy-plugins/scrapy-zyte-proxy-fetch",
packages=["zyte_proxy_fetch"],
classifiers=[
"Development Status :: 1 - Planning",
"License :: OSI Approved :: BSD License",
Expand Down
8 changes: 4 additions & 4 deletions tests/data/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
SETTINGS = {
"CRAWLERA_FETCH_ENABLED": True,
"CRAWLERA_FETCH_URL": "https://example.org",
"CRAWLERA_FETCH_APIKEY": "secret-key",
"CRAWLERA_FETCH_APIPASS": "secret-pass",
"ZYTE_PROXY_FETCH_ENABLED": True,
"ZYTE_PROXY_FETCH_URL": "https://example.org",
"ZYTE_PROXY_FETCH_APIKEY": "secret-key",
"ZYTE_PROXY_FETCH_APIPASS": "secret-pass",
}
18 changes: 9 additions & 9 deletions tests/data/requests.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ def get_test_requests():
url="https://httpbin.org/anything",
method="GET",
meta={
"crawlera_fetch": {
"zyte_proxy_fetch": {
"args": {
"render": "no",
"region": "us",
Expand All @@ -26,19 +26,19 @@ def get_test_requests():
},
)
expected1 = Request(
url=SETTINGS["CRAWLERA_FETCH_URL"],
url=SETTINGS["ZYTE_PROXY_FETCH_URL"],
callback=foo_spider.foo_callback,
method="POST",
headers={
"Authorization": basic_auth_header(
SETTINGS["CRAWLERA_FETCH_APIKEY"], SETTINGS["CRAWLERA_FETCH_APIPASS"]
SETTINGS["ZYTE_PROXY_FETCH_APIKEY"], SETTINGS["ZYTE_PROXY_FETCH_APIPASS"]
),
"Content-Type": "application/json",
"Accept": "application/json",
"X-Crawlera-JobId": "1/2/3",
},
meta={
"crawlera_fetch": {
"zyte_proxy_fetch": {
"args": {
"render": "no",
"region": "us",
Expand Down Expand Up @@ -72,22 +72,22 @@ def get_test_requests():
original2 = FormRequest(
url="https://httpbin.org/post",
callback=foo_spider.foo_callback,
meta={"crawlera_fetch": {"args": {"device": "desktop"}}},
meta={"zyte_proxy_fetch": {"args": {"device": "desktop"}}},
formdata={"foo": "bar"},
)
expected2 = FormRequest(
url=SETTINGS["CRAWLERA_FETCH_URL"],
url=SETTINGS["ZYTE_PROXY_FETCH_URL"],
method="POST",
headers={
"Authorization": basic_auth_header(
SETTINGS["CRAWLERA_FETCH_APIKEY"], SETTINGS["CRAWLERA_FETCH_APIPASS"]
SETTINGS["ZYTE_PROXY_FETCH_APIKEY"], SETTINGS["ZYTE_PROXY_FETCH_APIPASS"]
),
"Content-Type": "application/json",
"Accept": "application/json",
"X-Crawlera-JobId": "1/2/3",
},
meta={
"crawlera_fetch": {
"zyte_proxy_fetch": {
"args": {"device": "desktop"},
"original_request": request_to_dict(original2, spider=foo_spider),
"timing": {"start_ts": mocked_time()},
Expand Down Expand Up @@ -116,7 +116,7 @@ def get_test_requests():
"original": Request(
url="https://example.org",
method="HEAD",
meta={"crawlera_fetch": {"skip": True}},
meta={"zyte_proxy_fetch": {"skip": True}},
),
"expected": None,
}
Expand Down
24 changes: 12 additions & 12 deletions tests/data/responses.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
test_responses.append(
{
"original": HtmlResponse(
url=SETTINGS["CRAWLERA_FETCH_URL"],
url=SETTINGS["ZYTE_PROXY_FETCH_URL"],
status=200,
headers={
"Content-Type": "application/json",
Expand All @@ -26,9 +26,9 @@
"Connection": "close",
},
request=Request(
url=SETTINGS["CRAWLERA_FETCH_URL"],
url=SETTINGS["ZYTE_PROXY_FETCH_URL"],
meta={
"crawlera_fetch": {
"zyte_proxy_fetch": {
"timing": {"start_ts": mocked_time()},
"original_request": request_to_dict(
Request("https://fake.host.com"),
Expand All @@ -51,7 +51,7 @@
test_responses.append(
{
"original": HtmlResponse(
url=SETTINGS["CRAWLERA_FETCH_URL"],
url=SETTINGS["ZYTE_PROXY_FETCH_URL"],
status=200,
headers={
"Content-Type": "application/json",
Expand All @@ -62,9 +62,9 @@
"Connection": "close",
},
request=Request(
url=SETTINGS["CRAWLERA_FETCH_URL"],
url=SETTINGS["ZYTE_PROXY_FETCH_URL"],
meta={
"crawlera_fetch": {
"zyte_proxy_fetch": {
"timing": {"start_ts": mocked_time()},
"original_request": request_to_dict(
Request("https://httpbin.org/get"),
Expand Down Expand Up @@ -97,7 +97,7 @@
test_responses.append(
{
"original": HtmlResponse(
url=SETTINGS["CRAWLERA_FETCH_URL"],
url=SETTINGS["ZYTE_PROXY_FETCH_URL"],
status=200,
headers={
"Content-Type": "application/json",
Expand All @@ -108,9 +108,9 @@
"Connection": "close",
},
request=Request(
url=SETTINGS["CRAWLERA_FETCH_URL"],
url=SETTINGS["ZYTE_PROXY_FETCH_URL"],
meta={
"crawlera_fetch": {
"zyte_proxy_fetch": {
"timing": {"start_ts": mocked_time()},
"original_request": request_to_dict(
Request("https://example.org"),
Expand Down Expand Up @@ -164,17 +164,17 @@
test_responses.append(
{
"original": HtmlResponse(
url=SETTINGS["CRAWLERA_FETCH_URL"],
url=SETTINGS["ZYTE_PROXY_FETCH_URL"],
status=200,
headers={
"Content-Type": "application/json",
"Content-Encoding": "gzip",
"Date": "Fri, 24 Apr 2020 18:22:10 GMT",
},
request=Request(
url=SETTINGS["CRAWLERA_FETCH_URL"],
url=SETTINGS["ZYTE_PROXY_FETCH_URL"],
meta={
"crawlera_fetch": {
"zyte_proxy_fetch": {
"timing": {"start_ts": mocked_time()},
"original_request": request_to_dict(
Request("http://httpbin.org/ip"),
Expand Down
Loading