-
Notifications
You must be signed in to change notification settings - Fork 28
integration for web-poet's support on additional requests and Meta #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 76 commits
Commits
Show all changes
77 commits
Select commit
Hold shift + click to select a range
d43e8e6
add basic integration for web-poet's support on additional requests
BurnzZ 9bc60d0
create provider for web-poet's new HttpClient and GenericRequest
BurnzZ 7505176
enable tox dep in draft branch of web-poet for CI
BurnzZ c30546f
use the new status and headers from ResponseData
BurnzZ c7918eb
accept either web_poet.GenericRequest and scrapy.Request
BurnzZ 7f539bb
create provider for web_poet.page_inputs.Meta
BurnzZ ed2c489
use 'po_args' inside a Request meta instead of using the entire meta
BurnzZ 0bd3b80
use web-poet's new Request container
BurnzZ 5488504
sync dep to WIP branch to run tox tests
BurnzZ b5e9c56
add tests
BurnzZ 4dd19b8
remove ContextVar approach and use Dependency Injection in Provider i…
BurnzZ 2a155f5
update CHANGELOG to new support on additional requests
BurnzZ e8f4c10
add docs for supporting web-poet's HttpClient and Meta
BurnzZ 8340ced
Update to use HttpReponse which replaces ResponseData
BurnzZ ae4d8a5
remove unused imports
BurnzZ ba0d8fe
add basic integration for web-poet's support on additional requests
BurnzZ 81df664
create provider for web-poet's new HttpClient and GenericRequest
BurnzZ 1316090
add tests
BurnzZ f8a7efe
remove ContextVar approach and use Dependency Injection in Provider i…
BurnzZ cc97213
update CHANGELOG to new support on additional requests
BurnzZ a25b61e
update callback_for() to have async support
BurnzZ eb3e837
add docs mentioning async support in callback_for()
BurnzZ 7b2d4cf
force callback_for() to have 'is_async' to be keyword-only param
BurnzZ 5c4326f
update async test spider to use async PO as well
BurnzZ b79dfa8
remove 'is_async' param in callback_for
BurnzZ a74d264
remove duplicated test
BurnzZ af0b802
remove unrelated file
BurnzZ d23a169
Merge branch 'master' of ssh://github.com/scrapinghub/scrapy-poet int…
BurnzZ 42bf6da
update imports after web_poet refactoring
BurnzZ 9ce67ae
Merge branch 'master' of http://github.com/scrapinghub/scrapy-poet in…
BurnzZ 3983ed6
fix duplicated entry in CHANGELOG
BurnzZ d76db34
Remove implementation details about callback_for() in the docs
BurnzZ f1126fb
remove else block in callback_for()
BurnzZ c2bfe89
Update docs/intro/basic-tutorial.rst
BurnzZ 10b56e3
Merge pull request #66 from scrapinghub/async_callback_for
kmike 75d5a13
Fix tests
Gallaecio 1df25a8
Handle additional request IgnoreError as per web-poet #38
Gallaecio 2d6da5a
Fix the documentation build
Gallaecio 063ef20
Support a non-asyncio Twisted reactor
Gallaecio 2a9e8d0
Fix tests
Gallaecio f47816b
backend: handle unexpected exceptions as HttpRequestError
Gallaecio c1e8b93
Additional requests: prevent HEAD redirects
Gallaecio e7989ed
Additional requests: do not filter out duplicate requests
Gallaecio 6252526
Make the latest tests compatible with Pytho 3.7
Gallaecio 0358146
po_args → po_meta
Gallaecio 0146c83
Document the peculiarities of additional request handling
Gallaecio b4ce395
Test both asyncio and non-asyncio reactors
Gallaecio bedbb68
GitHub Actions: test both reactors
Gallaecio 98a47ba
Use raise-from syntax for additional request exceptions
Gallaecio 4d49428
Fix syntax error
Gallaecio 1d7ef88
Move request conversion into a function
Gallaecio 9f9dfa1
On request conversion, silently ignore unknown attributes
Gallaecio 6f5218f
Contextualize additional request exception handling
Gallaecio 850ad4e
Pass user-defined encoding on response conversion
Gallaecio 5f483c4
Support non-string values as meta keys
Gallaecio 5752664
Merge remote-tracking branch 'origin/master' into po-additional-requests
Gallaecio 19a3283
Remove request convertion TypeError handling
Gallaecio 4dcca57
Provide integration tests for good and bad additional responses
Gallaecio aa09109
Meta → PageParams
Gallaecio dcb6716
Implement test_additional_requests_connection_issue
Gallaecio 381eb25
Implement test_additional_requests_ignored_request
Gallaecio 3429d49
Implement test_additional_requests_unhandled_downloader_middleware_ex…
Gallaecio 902d41c
Fix pre-3.9 syntax error
Gallaecio 776b768
Implement test_additional_requests_dont_filter
Gallaecio a1c52f8
Remove unneeded test
Gallaecio 64eeb72
Update docs/intro/advanced-tutorial.rst
Gallaecio 5210b2a
Update docs/intro/advanced-tutorial.rst
Gallaecio 8aecfab
backend → download_func
Gallaecio dc36f5a
test_additional_requests_dont_filter: ensure additional requests are …
Gallaecio b839eb1
Cast url from HttpRequest and HttpResponse before comparisons
Gallaecio 995e9b9
fix examples when using callback_for()
BurnzZ 6c07ce4
Merge pull request #75 from scrapinghub/callback-for-docs
kmike be786fc
backend → downloader
Gallaecio ded8354
Merge remote-tracking branch 'origin/po-additional-requests' into po-…
Gallaecio 3398dc7
Update instal requirements
Gallaecio 566e727
GitHub Actions: test minimum dependency versions
Gallaecio 98ce454
Revert mypy changes
Gallaecio File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,168 @@ | ||
.. _`intro-advanced-tutorial`: | ||
|
||
================= | ||
Advanced Tutorial | ||
================= | ||
|
||
This section intends to go over the supported features in **web-poet** by | ||
**scrapy-poet**: | ||
|
||
* ``web_poet.HttpClient`` | ||
* ``web_poet.PageParams`` | ||
|
||
These are mainly achieved by **scrapy-poet** implementing **providers** for them: | ||
|
||
* :class:`scrapy_poet.page_input_providers.HttpClientProvider` | ||
* :class:`scrapy_poet.page_input_providers.PageParamsProvider` | ||
|
||
.. _`intro-additional-requests`: | ||
|
||
Additional Requests | ||
=================== | ||
|
||
Using Page Objects using additional requests doesn't need anything special from | ||
the spider. It would work as-is because of the readily available | ||
:class:`scrapy_poet.page_input_providers.HttpClientProvider` that is enabled | ||
out of the box. | ||
|
||
This supplies the Page Object with the necessary ``web_poet.HttpClient`` instance. | ||
|
||
The HTTP client implementation that **scrapy-poet** provides to | ||
``web_poet.HttpClient`` handles requests as follows: | ||
|
||
- Requests go through downloader middlewares, but they do not go through | ||
spider middlewares or through the scheduler. | ||
|
||
- Duplicate requests are not filtered out. | ||
|
||
- In line with the web-poet specification for additional requests, | ||
``Request.meta['dont_redirect']`` is set to ``True`` for requests with the | ||
``HEAD`` HTTP method. | ||
|
||
Suppose we have the following Page Object: | ||
|
||
.. code-block:: python | ||
import attr | ||
import web_poet | ||
@attr.define | ||
class ProductPage(web_poet.ItemWebPage): | ||
http_client: web_poet.HttpClient | ||
async def to_item(self): | ||
item = { | ||
"url": self.url, | ||
"name": self.css("#main h3.name ::text").get(), | ||
"product_id": self.css("#product ::attr(product-id)").get(), | ||
} | ||
# Simulates clicking on a button that says "View All Images" | ||
response: web_poet.HttpResponse = await self.http_client.get( | ||
f"https://api.example.com/v2/images?id={item['product_id']}" | ||
) | ||
item["images"] = response.css(".product-images img::attr(src)").getall() | ||
return item | ||
It can be directly used inside the spider as: | ||
|
||
.. code-block:: python | ||
import scrapy | ||
class ProductSpider(scrapy.Spider): | ||
custom_settings = { | ||
"DOWNLOADER_MIDDLEWARES": { | ||
"scrapy_poet.InjectionMiddleware": 543, | ||
} | ||
} | ||
start_urls = [ | ||
"https://example.com/category/product/item?id=123", | ||
"https://example.com/category/product/item?id=989", | ||
] | ||
async def parse(self, response, page: ProductPage): | ||
return await page.to_item() | ||
Note that we needed to update the ``parse()`` method to be an ``async`` method, | ||
since the ``to_item()`` method of the Page Object we're using is an ``async`` | ||
method as well. | ||
|
||
|
||
Page params | ||
=========== | ||
|
||
Using ``web_poet.PageParams`` allows the Scrapy spider to pass any arbitrary | ||
information into the Page Object. | ||
|
||
Suppose we update the earlier Page Object to control the additional request. | ||
This basically acts as a switch to update the behavior of the Page Object: | ||
|
||
.. code-block:: python | ||
import attr | ||
import web_poet | ||
@attr.define | ||
class ProductPage(web_poet.ItemWebPage): | ||
http_client: web_poet.HttpClient | ||
page_params: web_poet.PageParams | ||
async def to_item(self): | ||
item = { | ||
"url": self.url, | ||
"name": self.css("#main h3.name ::text").get(), | ||
"product_id": self.css("#product ::attr(product-id)").get(), | ||
} | ||
# Simulates clicking on a button that says "View All Images" | ||
if self.page_params.get("enable_extracting_all_images") | ||
response: web_poet.HttpResponse = await self.http_client.get( | ||
f"https://api.example.com/v2/images?id={item['product_id']}" | ||
) | ||
item["images"] = response.css(".product-images img::attr(src)").getall() | ||
return item | ||
Passing the ``enable_extracting_all_images`` page parameter from the spider | ||
into the Page Object can be achieved by using **Scrapy's** ``Request.meta`` | ||
attribute. Specifically, any ``dict`` value inside the ``page_params`` | ||
parameter inside **Scrapy's** ``Request.meta`` will be passed into | ||
``web_poet.PageParams``. | ||
|
||
Let's see it in action: | ||
|
||
.. code-block:: python | ||
import scrapy | ||
class ProductSpider(scrapy.Spider): | ||
custom_settings = { | ||
"DOWNLOADER_MIDDLEWARES": { | ||
"scrapy_poet.InjectionMiddleware": 543, | ||
} | ||
} | ||
start_urls = [ | ||
"https://example.com/category/product/item?id=123", | ||
"https://example.com/category/product/item?id=989", | ||
] | ||
def start_requests(self): | ||
for url in start_urls: | ||
yield scrapy.Request( | ||
url=url, | ||
callback=self.parse, | ||
meta={"page_params": {"enable_extracting_all_images": True}} | ||
) | ||
async def parse(self, response, page: ProductPage): | ||
return await page.to_item() | ||
Gallaecio marked this conversation as resolved.
Show resolved
Hide resolved
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,3 @@ | ||
Scrapy >= 2.1.0 | ||
Scrapy >= 2.6.0 | ||
Sphinx >= 3.0.3 | ||
sphinx-rtd-theme >= 0.4 |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.