Skip to content

Commit

Permalink
Add-on (#216)
Browse files Browse the repository at this point in the history
  • Loading branch information
Gallaecio authored Jan 17, 2025
1 parent c62ee50 commit 2f75c54
Show file tree
Hide file tree
Showing 19 changed files with 286 additions and 136 deletions.
36 changes: 23 additions & 13 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@ scrapy-poet
With ``scrapy-poet`` is possible to make a single spider that supports many sites with
different layouts.

Requires **Python 3.9+** and **Scrapy >= 2.6.0**.

Read the `documentation <https://scrapy-poet.readthedocs.io>`_ for more information.

License is BSD 3-clause.
Expand All @@ -48,24 +50,32 @@ Installation
pip install scrapy-poet
Requires **Python 3.9+** and **Scrapy >= 2.6.0**.

Usage in a Scrapy Project
=========================

Add the following inside Scrapy's ``settings.py`` file:

.. code-block:: python
DOWNLOADER_MIDDLEWARES = {
"scrapy_poet.InjectionMiddleware": 543,
"scrapy.downloadermiddlewares.stats.DownloaderStats": None,
"scrapy_poet.DownloaderStatsMiddleware": 850,
}
SPIDER_MIDDLEWARES = {
"scrapy_poet.RetryMiddleware": 275,
}
REQUEST_FINGERPRINTER_CLASS = "scrapy_poet.ScrapyPoetRequestFingerprinter"
- Scrapy ≥ 2.10:

.. code-block:: python
ADDONS = {
"scrapy_poet.Addon": 300,
}
- Scrapy < 2.10:

.. code-block:: python
DOWNLOADER_MIDDLEWARES = {
"scrapy_poet.InjectionMiddleware": 543,
"scrapy.downloadermiddlewares.stats.DownloaderStats": None,
"scrapy_poet.DownloaderStatsMiddleware": 850,
}
REQUEST_FINGERPRINTER_CLASS = "scrapy_poet.ScrapyPoetRequestFingerprinter"
SPIDER_MIDDLEWARES = {
"scrapy_poet.RetryMiddleware": 275,
}
Developing
==========
Expand Down
17 changes: 13 additions & 4 deletions docs/api_reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,22 @@ API
:members:
:no-special-members:

Injection Middleware
====================
Scrapy components
=================

.. autoclass:: scrapy_poet.DownloaderStatsMiddleware
:members:

.. autoclass:: scrapy_poet.InjectionMiddleware
:members:

.. autoclass:: scrapy_poet.RetryMiddleware
:members:

.. automodule:: scrapy_poet.downloadermiddlewares
.. autoclass:: scrapy_poet.ScrapyPoetRequestFingerprinter
:members:

Page Input Providers
Page input providers
====================

.. automodule:: scrapy_poet.page_input_providers
Expand Down
4 changes: 2 additions & 2 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ testability and reusability.
Concrete integrations are not provided by ``web-poet``, but
``scrapy-poet`` makes them possbile.

To get started, see :ref:`intro-install` and :ref:`intro-tutorial`.
To get started, see :ref:`setup` and :ref:`intro-tutorial`.

:ref:`license` is BSD 3-clause.

Expand All @@ -34,7 +34,7 @@ To get started, see :ref:`intro-install` and :ref:`intro-tutorial`.
:caption: Getting started
:maxdepth: 1

intro/install
intro/setup
intro/basic-tutorial
intro/advanced-tutorial
intro/pitfalls
Expand Down
16 changes: 0 additions & 16 deletions docs/intro/advanced-tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -77,14 +77,6 @@ It can be directly used inside the spider as:
class ProductSpider(scrapy.Spider):
custom_settings = {
"DOWNLOADER_MIDDLEWARES": {
"scrapy_poet.InjectionMiddleware": 543,
"scrapy.downloadermiddlewares.stats.DownloaderStats": None,
"scrapy_poet.DownloaderStatsMiddleware": 850,
}
}
def start_requests(self):
for url in [
"https://example.com/category/product/item?id=123",
Expand Down Expand Up @@ -152,14 +144,6 @@ Let's see it in action:
class ProductSpider(scrapy.Spider):
custom_settings = {
"DOWNLOADER_MIDDLEWARES": {
"scrapy_poet.InjectionMiddleware": 543,
"scrapy.downloadermiddlewares.stats.DownloaderStats": None,
"scrapy_poet.DownloaderStatsMiddleware": 850,
}
}
start_urls = [
"https://example.com/category/product/item?id=123",
"https://example.com/category/product/item?id=989",
Expand Down
51 changes: 0 additions & 51 deletions docs/intro/install.rst

This file was deleted.

56 changes: 56 additions & 0 deletions docs/intro/setup.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
.. _setup:

=====
Setup
=====

.. _intro-install:

Install from PyPI::

pip install scrapy-poet

Then configure:

- For Scrapy ≥ 2.10, install the add-on:

.. code-block:: python
:caption: settings.py
ADDONS = {
"scrapy_poet.Addon": 300,
}
.. _addon-changes:

This is what the add-on changes:

- In :setting:`DOWNLOADER_MIDDLEWARES`:

- Sets :class:`~scrapy_poet.InjectionMiddleware` with value ``543``.

- Replaces
:class:`scrapy.downloadermiddlewares.stats.DownloaderStats`
with :class:`scrapy_poet.DownloaderStatsMiddleware`.

- Sets :setting:`REQUEST_FINGERPRINTER_CLASS` to
:class:`~scrapy_poet.ScrapyPoetRequestFingerprinter`.

- In :setting:`SPIDER_MIDDLEWARES`, sets
:class:`~scrapy_poet.RetryMiddleware` with value ``275``.

- For Scrapy < 2.10, manually apply :ref:`the add-on changes
<addon-changes>`. For example:

.. code-block:: python
:caption: settings.py
DOWNLOADER_MIDDLEWARES = {
"scrapy_poet.InjectionMiddleware": 543,
"scrapy.downloadermiddlewares.stats.DownloaderStats": None,
"scrapy_poet.DownloaderStatsMiddleware": 850,
}
REQUEST_FINGERPRINTER_CLASS = "scrapy_poet.ScrapyPoetRequestFingerprinter"
SPIDER_MIDDLEWARES = {
"scrapy_poet.RetryMiddleware": 275,
}
16 changes: 3 additions & 13 deletions example/example/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,26 +9,16 @@

from example.autoextract import AutoextractProductProvider

from scrapy_poet import ScrapyPoetRequestFingerprinter

BOT_NAME = "example"

SPIDER_MODULES = ["example.spiders"]
NEWSPIDER_MODULE = "example.spiders"

SCRAPY_POET_PROVIDERS = {AutoextractProductProvider: 500}

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

DOWNLOADER_MIDDLEWARES = {
"scrapy_poet.InjectionMiddleware": 543,
"scrapy.downloadermiddlewares.stats.DownloaderStats": None,
"scrapy_poet.DownloaderStatsMiddleware": 850,
ADDONS = {
"scrapy_poet.Addon": 300,
}

REQUEST_FINGERPRINTER_CLASS = ScrapyPoetRequestFingerprinter

SPIDER_MIDDLEWARES = {
"scrapy_poet.RetryMiddleware": 275,
}
SCRAPY_POET_PROVIDERS = {AutoextractProductProvider: 500}
1 change: 1 addition & 0 deletions scrapy_poet/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@
from .page_input_providers import HttpResponseProvider, PageObjectInputProvider
from .spidermiddlewares import RetryMiddleware
from ._request_fingerprinter import ScrapyPoetRequestFingerprinter
from ._addon import Addon
103 changes: 103 additions & 0 deletions scrapy_poet/_addon.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
from logging import getLogger

from scrapy.downloadermiddlewares.stats import DownloaderStats
from scrapy.settings import BaseSettings
from scrapy.utils.misc import load_object

from ._request_fingerprinter import ScrapyPoetRequestFingerprinter
from .downloadermiddlewares import DownloaderStatsMiddleware, InjectionMiddleware
from .spidermiddlewares import RetryMiddleware

logger = getLogger(__name__)


# https://github.com/zytedata/zyte-spider-templates/blob/1b72aa8912f6009d43bf87a5bd1920537d458744/zyte_spider_templates/_addon.py#L33C1-L88C37
def _replace_builtin(
settings: BaseSettings, setting: str, builtin_cls: type, new_cls: type
) -> None:
setting_value = settings[setting]
if not setting_value:
logger.warning(
f"Setting {setting!r} is empty. Could not replace the built-in "
f"{builtin_cls} entry with {new_cls}. Add {new_cls} manually to "
f"silence this warning."
)
return None

if new_cls in setting_value:
return None
for cls_or_path in setting_value:
if isinstance(cls_or_path, str):
_cls = load_object(cls_or_path)
if _cls == new_cls:
return None

builtin_entry: object = None
for _setting_value in (setting_value, settings[f"{setting}_BASE"]):
if builtin_cls in _setting_value:
builtin_entry = builtin_cls
pos = _setting_value[builtin_entry]
break
for cls_or_path in _setting_value:
if isinstance(cls_or_path, str):
_cls = load_object(cls_or_path)
if _cls == builtin_cls:
builtin_entry = cls_or_path
pos = _setting_value[builtin_entry]
break
if builtin_entry:
break

if not builtin_entry:
logger.warning(
f"Settings {setting!r} and {setting + '_BASE'!r} are both "
f"missing built-in entry {builtin_cls}. Cannot replace it with {new_cls}. "
f"Add {new_cls} manually to silence this warning."
)
return None

if pos is None:
logger.warning(
f"Built-in entry {builtin_cls} of setting {setting!r} is disabled "
f"(None). Cannot replace it with {new_cls}. Add {new_cls} "
f"manually to silence this warning. If you had replaced "
f"{builtin_cls} with some other entry, you might also need to "
f"disable that other entry for things to work as expected."
)
return

settings[setting][builtin_entry] = None
settings[setting][new_cls] = pos


# https://github.com/scrapy-plugins/scrapy-zyte-api/blob/a1d81d11854b420248f38e7db49c685a8d46d943/scrapy_zyte_api/addon.py#L12
def _setdefault(settings, setting, cls, pos):
setting_value = settings[setting]
if not setting_value:
settings[setting] = {cls: pos}
return
if cls in setting_value:
return
for cls_or_path in setting_value:
if isinstance(cls_or_path, str):
_cls = load_object(cls_or_path)
if _cls == cls:
return
settings[setting][cls] = pos


class Addon:
def update_settings(self, settings: BaseSettings) -> None:
settings.set(
"REQUEST_FINGERPRINTER_CLASS",
ScrapyPoetRequestFingerprinter,
priority="addon",
)
_setdefault(settings, "DOWNLOADER_MIDDLEWARES", InjectionMiddleware, 543)
_setdefault(settings, "SPIDER_MIDDLEWARES", RetryMiddleware, 275)
_replace_builtin(
settings,
"DOWNLOADER_MIDDLEWARES",
DownloaderStats,
DownloaderStatsMiddleware,
)
Loading

0 comments on commit 2f75c54

Please sign in to comment.