-
Notifications
You must be signed in to change notification settings - Fork 18
Add SwitchPage #103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add SwitchPage #103
Changes from 2 commits
f012a0b
62cd1ed
51114e4
31fc060
7817813
36ad056
b5d61ea
6778aa3
073c4ab
b51f056
308c4bf
51bf31f
828a84b
fc7867f
b32f92f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,4 +1,5 @@ | ||
| .. _advanced-requests: | ||
| .. _page-objects: | ||
|
|
||
| =================== | ||
| Additional Requests | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,154 @@ | ||
| .. _layouts: | ||
|
|
||
| =============== | ||
| Webpage layouts | ||
| =============== | ||
|
|
||
| Different webpages may show the same *type* of page, but different *data*. For | ||
| example, in an e-commerce website there are usually many product detail pages, | ||
| each showing data from a different product. | ||
|
|
||
| The code that those webpages have in common is their **webpage layout**. | ||
|
|
||
| Coding for webpage layouts | ||
| ========================== | ||
|
|
||
| Webpage layouts should inform how you organize your data extraction code. | ||
|
|
||
| A good practice to keep your code maintainable is to have a separate :ref:`page | ||
| object class <page-objects>` per webpage layout. | ||
|
|
||
| Trying to support multiple webpage layouts with the same page object class can | ||
| make your class hard to maintain. | ||
|
|
||
|
|
||
| Identifying webpage layouts | ||
| =========================== | ||
|
|
||
| There is no precise way to determine whether 2 webpages have the same or a | ||
| different webpage layout. You must decide based on what you know, and be ready | ||
| to adapt if things change. | ||
|
|
||
| It is also often difficult to identify webpage layouts before you start writing | ||
| extraction code. Completely different webpage layouts can have the same look, | ||
| and very similar webpage layouts can look completely different. | ||
|
|
||
| It can be a good starting point to assume that, for a given combination of | ||
| data type and website, there is going to be a single webpage layout. For | ||
| example, assume that all product pages of a given e-commerce website will have | ||
| the same webpage layout. | ||
|
|
||
| Then, as you write a :ref:`page object class <page-objects>` for that webpage | ||
| layout, you may find out more, and adapt. | ||
|
|
||
| When the same piece of information must be extracted from a different place for | ||
| different webpages, that is a sign that you may be dealing with more than 1 | ||
| webpage layout. For example, if on some webpages the product name is in an | ||
| ``h1`` element, but on some webpages it is in an ``h2`` element, chances are | ||
| there are at least 2 different webpage layouts. | ||
|
|
||
| However, whether you continue to work as if everything uses the same webpage | ||
| layout, or you split your page object class into 2 page object classes, each | ||
| targetting one of the webpage layouts you have found, it is entirely up to you. | ||
|
|
||
| Ask yourself: Is supporting all webpage layout differences making your page | ||
| object class implementation only a few lines of code longer, or is it making it | ||
| an unmaintainable bowl of spagetti code? | ||
|
|
||
|
|
||
| Mapping webpage layouts | ||
| ======================= | ||
|
|
||
| Once you have written a :ref:`page object class <page-objects>` for a webpage | ||
| layout, you need to make it so that your page object class is used for webpages | ||
| that use that webpage layout. | ||
|
|
||
| URL patterns | ||
| ------------ | ||
|
|
||
| Webpage layouts are often associated to specific URL patterns. For example, all | ||
| the product detail pages of an e-commerce website usually have similar URLs, | ||
| such as ``https://example.com/product/<product ID>``. | ||
|
|
||
| When that is the case, you can :ref:`associate your page object class to the | ||
| corresponding URL pattern <rules-intro>`. | ||
|
|
||
|
|
||
| .. _switch: | ||
|
|
||
| Switch page object classes | ||
| -------------------------- | ||
|
|
||
| Sometimes it is impossible to know, based on the target URL, which webpage | ||
| layout you are getting. For example, during `A/B testing`_, you could get a | ||
| random webpage layout on every request. | ||
|
|
||
| .. _A/B testing: https://en.wikipedia.org/wiki/A/B_testing | ||
|
|
||
| For these scenarios, we recommend that you create a special “switch” page | ||
| object class, and use it to switch to the right page object class at run time | ||
| based on the input you receive. | ||
|
|
||
| Your switch page object class should: | ||
|
|
||
| #. Request all the inputs that the candidate page object classes may need. | ||
|
|
||
| For example, if there are 2 candidate page object classes, and 1 of them | ||
| requires browser HTML as input, while the other one requires an HTTP | ||
| response, your switch page object class must request both. | ||
Gallaecio marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| If combining different inputs is a problem, consider refactoring the | ||
| candidate page object classes to require similar inputs. | ||
|
|
||
| #. On its :meth:`~web_poet.pages.ItemPage.to_item` method: | ||
|
|
||
| #. Determine, based on the inputs, which candidate page object class to | ||
| use. | ||
|
|
||
| #. Create an instance of the selected candidade page object class with the | ||
| necessary input, call its :meth:`~web_poet.pages.ItemPage.to_item` | ||
| method, and return its result. | ||
Gallaecio marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| You may use :class:`~web_poet.pages.SwitchPage` as a base class for your switch | ||
| page object class, so you only need to implement the | ||
| :class:`~web_poet.pages.SwitchPage.switch` method that determines which | ||
| candidate page object class to use. For example: | ||
|
|
||
| .. code-block:: python | ||
Gallaecio marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| import attrs | ||
| from web_poet import handle_urls, HttpResponse, Injectable, ItemPage, SwitchPage | ||
| @attrs.define | ||
Gallaecio marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| class Header: | ||
| text: str | ||
| @attrs.define | ||
Gallaecio marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| class H1Page(ItemPage[Header]): | ||
| response: HttpResponse | ||
| @field | ||
| def text(self) -> str: | ||
Gallaecio marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| return self.response.css("h1::text").get() | ||
Gallaecio marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| @attrs.define | ||
| class H2Page(ItemPage[Header]): | ||
| response: HttpResponse | ||
| @field | ||
| def text(self) -> str: | ||
Gallaecio marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| return self.response.css("h2::text").get() | ||
| @handle_urls("example.com") | ||
| @attrs.define | ||
| class HeaderSwitchPage(SwitchPage[Header]): | ||
| response: HttpResponse | ||
| async def switch(self) -> Injectable: | ||
| if self.response.css("h1::text"): | ||
| return H1Page | ||
| return H2Page | ||
Gallaecio marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -10,6 +10,7 @@ | |
| ItemT, | ||
| ItemWebPage, | ||
| Returns, | ||
| SwitchPage, | ||
| WebPage, | ||
| is_injectable, | ||
| ) | ||
|
|
@@ -33,6 +34,74 @@ def to_item(self) -> dict: | |
| } | ||
|
|
||
|
|
||
| @pytest.mark.asyncio | ||
| async def test_switch_page_object(): | ||
|
|
||
| @attrs.define | ||
| class Header: | ||
| text: str | ||
|
|
||
|
|
||
| @attrs.define | ||
Gallaecio marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| class H1Page(ItemPage[Header]): | ||
| response: HttpResponse | ||
|
|
||
| @field | ||
| def text(self) -> str: | ||
| return self.response.css("h1::text").get() | ||
|
|
||
|
|
||
| @attrs.define | ||
Gallaecio marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| class H2Page(ItemPage[Header]): | ||
| response: HttpResponse | ||
|
|
||
| @field | ||
| def text(self) -> str: | ||
| return self.response.css("h2::text").get() | ||
|
|
||
|
|
||
| @attrs.define | ||
| class HeaderSwitchPage(SwitchPage[Header]): | ||
| response: HttpResponse | ||
|
|
||
| async def switch(self) -> Injectable: | ||
| if self.response.css("h1::text"): | ||
| return H1Page | ||
| return H2Page | ||
|
|
||
| html_h1 = b""" | ||
| <!DOCTYPE html> | ||
| <html lang="en"> | ||
| <head> | ||
| <title>h1</title> | ||
| </head> | ||
| <body> | ||
| <h1>a</h1> | ||
| </body> | ||
| </html> | ||
| """ | ||
| html_h2 = b""" | ||
| <!DOCTYPE html> | ||
| <html lang="en"> | ||
| <head> | ||
| <title>h2</title> | ||
| </head> | ||
| <body> | ||
| <h2>b</h2> | ||
| </body> | ||
| </html> | ||
| """ | ||
|
|
||
| response1 = HttpResponse("https://example.com", body=html_h1) | ||
| response2 = HttpResponse("https://example.com", body=html_h2) | ||
|
|
||
| item1 = await HeaderSwitchPage(response=response1).to_item() | ||
| item2 = await HeaderSwitchPage(response=response2).to_item() | ||
|
|
||
| assert item1.text == "a" | ||
| assert item2.text == "b" | ||
|
|
||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's also add a test for use cases when a class MyMultiLayoutPage(MultiLayoutPage[SomeItem]):
response: HttpResponse
layout_page_us: LayoutPageUS
layout_page_uk: LayoutPageUK
async def get_layout(self) -> ItemPage[SomeItem]:
if self.response.css(".origin::text") == "us":
return self.layout_page_us.get_layout()
return self.layout_page_uk.get_layout()Might also be worth creating a doc about this as well. |
||
| def test_web_page_object(book_list_html_response) -> None: | ||
| class MyWebPage(WebPage): | ||
| def to_item(self) -> dict: # type: ignore | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -68,6 +68,28 @@ async def to_item(self) -> ItemT: | |
| ) | ||
|
|
||
|
|
||
| class SwitchPage(Injectable, Returns[ItemT]): | ||
Gallaecio marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| """Base class for :ref:`switch page object classes <switch>`. | ||
|
|
||
| Subclasses must reimplement the :meth:`switch` method. | ||
| """ | ||
|
|
||
| @abc.abstractmethod | ||
| async def switch(self) -> Injectable: | ||
Gallaecio marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| """Return the right :ref:`page object class <page-objects>` based on | ||
| the received input.""" | ||
|
||
| raise NotImplementedError | ||
|
|
||
Gallaecio marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| async def to_item(self) -> ItemT: | ||
Gallaecio marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| """Create an instance of the class that :meth:`switch` returns with the | ||
| required input, and return the output of its | ||
| :meth:`~web_poet.pages.ItemPage.to_item` method.""" | ||
| page_object_class = await self.switch() | ||
| # page_object = page_object_class(...) # TODO: pass the right inputs | ||
|
||
| page_object = page_object_class(response=self.response) | ||
| return await page_object.to_item() | ||
|
|
||
|
|
||
| @attr.s(auto_attribs=True) | ||
| class WebPage(ItemPage[ItemT], ResponseShortcutsMixin): | ||
| """Base Page Object which requires :class:`~.HttpResponse` | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.