-
Notifications
You must be signed in to change notification settings - Fork 17
Description
Problem statement
A typical scenario when using the Scrapy middleware to auto-extract e.g. product page URLs is that said URLs may respond with 404 status.
However, the library does not provide a way to handle the associated AutoExtractErrors. It seems that only successful (w.r.t to the domain crawled, not the auto-extract API) requests are returned from the middleware, with the rest of them (non-successful) simply logged:
...
if result.get('error'):
self.inc_metric('autoextract/errors/result_error')
self._log_debug_error(response, body)
raise AutoExtractError('Received error from AutoExtract for {}: {}'.format(url, result["error"]))
...Example
This is the output I get when I try to crawl the 404 URL: https://www.dosfarma.com/30583-bossauto-mascarilla-ffp2-con-valvula-20.unidades.html
2021-12-13 12:54:43 [scrapy_autoextract.middlewares] DEBUG: Process AutoExtract request for product URL <GET https://www.dosfarma.com/30583-bossauto-mascarilla-ffp2-con-valvula-20.unidades.html>
2021-12-13 12:54:55 [scrapy_autoextract.middlewares] DEBUG: AutoExtract response status=200 headers={'date': 'Mon, 13 Dec 2021 10:54:44 GMT', 'content-type': 'application/json', 'strict-transport-security': 'max-age=0; includeSubDomains; preload'} content=[{"query":{"id":"1639392884013-e7d673376b493f68","domain":"dosfarma.com","userAgent":"scrapy-autoextract/0.5.2 scrapy/2.4.1","userQuery":{"url":"https://www.dosfarma.com/30583-bossauto-mascarilla-ffp2-con-valvula-20.unidades.html","pageType":"product"}},"error":"Downloader error: http404"}]
2021-12-13 12:54:55 [scrapy.core.scraper] ERROR: Error downloading <POST https://autoextract.scrapinghub.com/v1/extract>
Traceback (most recent call last):
File "/home/iantonopoulos/.cache/pypoetry/virtualenvs/src-7UZI2jN5-py3.7/lib/python3.7/site-packages/twisted/internet/defer.py", line 1661, in _inlineCallbacks
result = current_context.run(gen.send, result)
StopIteration: <200 https://autoextract.scrapinghub.com/v1/extract>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/iantonopoulos/.cache/pypoetry/virtualenvs/src-7UZI2jN5-py3.7/lib/python3.7/site-packages/twisted/internet/defer.py", line 1661, in _inlineCallbacks
result = current_context.run(gen.send, result)
File "/home/iantonopoulos/.cache/pypoetry/virtualenvs/src-7UZI2jN5-py3.7/lib/python3.7/site-packages/scrapy/core/downloader/middleware.py", line 55, in process_response
response = yield deferred_from_coro(method(request=request, response=response, spider=spider))
File "/home/iantonopoulos/.cache/pypoetry/virtualenvs/src-7UZI2jN5-py3.7/lib/python3.7/site-packages/scrapy_autoextract/middlewares.py", line 190, in process_response
'{}: {}'.format(url, result["error"]))
scrapy_autoextract.middlewares.AutoExtractError: Received error from AutoExtract for https://www.dosfarma.com/30583-bossauto-mascarilla-ffp2-con-valvula-20.unidades.html: Downloader error: http404
2021-12-13 13:00:21 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-13 13:00:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'autoextract/errors/result_error': 1,
'autoextract/request_count': 1,
'downloader/request_bytes': 460,
'downloader/request_count': 1,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 445,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 31.248149,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 12, 13, 11, 0, 21, 791393),
'log_count/DEBUG': 2,
'log_count/ERROR': 1,
'log_count/INFO': 9,
'memusage/max': 70422528,
'memusage/startup': 70422528,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2021, 12, 13, 10, 59, 50, 543244)}
With this information (a DEBUG-level log + an increase in the metric autoextract/errors/result_error) user does not have access to the information contained in the unsuccessful responses, which may very well be important for many applications. Parsing the DEBUG logs seems a subpar practice, since deployed applications typically log statements with a level of WARNING and above.
Proposal
A refactoring of the (at least) the process_response method of the AutoExtractMiddleware in order to return a more unified response that covers all cases. For example, unsuccessful (w.r.t to the domain crawled, not the auto-extract API) responses should contain the Downloader error: http404.