-
Notifications
You must be signed in to change notification settings - Fork 28
Description
https://github.com/scrapinghub/scrapy-poet/blob/master/scrapy_poet/page_input_providers.py#L165-L180
Currently, the HttpResponseProvider creates a new HttpResponse instance each time it's called:
class HttpResponseProvider(PageObjectInputProvider, CacheDataProviderMixin):
"""This class provides ``web_poet.page_inputs.HttpResponse`` instances."""
provided_classes = {HttpResponse}
name = "response_data"
def __call__(self, to_provide: Set[Callable], response: Response):
"""Builds a ``HttpResponse`` instance using a Scrapy ``Response``"""
return [
HttpResponse(
url=response.url,
body=response.body,
status=response.status,
headers=HttpResponseHeaders.from_bytes_dict(response.headers),
)
]From another thread:
Suppose the average HTML size for a particular website is 256 KB. Let's also suppose that we have 12 POs that we need to support in our MultiLayoutPage subclass. This means that for every multi layout PO instance, it holds at least 256 KB * 12 = 3 MB in memory. Assuming that we're parsing at a rate of 10 pages per second, then we're holding at least 3 MB * 10 pages = 30 MB of memory per second.
It's not a crucial issue for now but it can certainly be made more efficient by having the provider return the same HttpResponse instance given a response identifier. HttpResponseProvider already inherits from CacheDataProviderMixin. Perhaps we can use an in-memory cache to determine if we can return the same instance instead of creating a new one.