Skip to content

A more memory-efficient HttpResponseProvider #95

@BurnzZ

Description

@BurnzZ

https://github.com/scrapinghub/scrapy-poet/blob/master/scrapy_poet/page_input_providers.py#L165-L180

Currently, the HttpResponseProvider creates a new HttpResponse instance each time it's called:

class HttpResponseProvider(PageObjectInputProvider, CacheDataProviderMixin):
    """This class provides ``web_poet.page_inputs.HttpResponse`` instances."""

    provided_classes = {HttpResponse}
    name = "response_data"

    def __call__(self, to_provide: Set[Callable], response: Response):
        """Builds a ``HttpResponse`` instance using a Scrapy ``Response``"""
        return [
            HttpResponse(
                url=response.url,
                body=response.body,
                status=response.status,
                headers=HttpResponseHeaders.from_bytes_dict(response.headers),
            )
        ]

From another thread:

Suppose the average HTML size for a particular website is 256 KB. Let's also suppose that we have 12 POs that we need to support in our MultiLayoutPage subclass. This means that for every multi layout PO instance, it holds at least 256 KB * 12 = 3 MB in memory. Assuming that we're parsing at a rate of 10 pages per second, then we're holding at least 3 MB * 10 pages = 30 MB of memory per second.

It's not a crucial issue for now but it can certainly be made more efficient by having the provider return the same HttpResponse instance given a response identifier. HttpResponseProvider already inherits from CacheDataProviderMixin. Perhaps we can use an in-memory cache to determine if we can return the same instance instead of creating a new one.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions