-
Notifications
You must be signed in to change notification settings - Fork 106
Open
Labels
Description
Description
Demo spider with settings:
DNS_RESOLVER = "scrapy.resolver.CachingHostnameResolver"
import scrapy
class DemoSpider(scrapy.Spider):
name = 'demo_spider'
start_urls = ['https://[2402:4e00:40:40::2:3b6]']
def parse(self, response, **kwargs):
print(response.body)
print(response)Command to start the spider:
scrapy crawl demo_spider -s JOBDIR=./jobs/run-1When i use the JOBDIR parameter, it cause an exception:
Traceback (most recent call last):
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/commands/crawl.py", line 27, in run
self.crawler_process.start()
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/crawler.py", line 348, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 1318, in run
self.mainLoop()
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 1328, in mainLoop
reactorBaseSelf.runUntilCurrent()
--- <exception caught here> ---
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 994, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/utils/reactor.py", line 51, in __call__
return self._func(*self._a, **self._kw)
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/engine.py", line 147, in _next_request
while not self._needs_backout() and self._next_request_from_scheduler() is not None:
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/engine.py", line 176, in _next_request_from_scheduler
request = self.slot.scheduler.next_request()
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/scheduler.py", line 263, in next_request
request = self._dqpop()
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/scheduler.py", line 299, in _dqpop
return self.dqs.pop()
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/pqueues.py", line 99, in pop
m = q.pop()
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/squeues.py", line 78, in pop
return request_from_dict(request, spider=self.spider)
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/utils/request.py", line 124, in request_from_dict
return request_cls(**kwargs)
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/http/request/__init__.py", line 60, in __init__
self._set_url(url)
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/http/request/__init__.py", line 100, in _set_url
s = safe_url_string(url, self.encoding)
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/w3lib/url.py", line 103, in safe_url_string
parts.port,
File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/parse.py", line 178, in port
raise ValueError(message) from None
builtins.ValueError: Port could not be cast to integer value as '4e00:40:40::2:3b6'
2022-10-09 13:57:19 [twisted] CRITICAL: Unhandled Error
Traceback (most recent call last):
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/commands/crawl.py", line 27, in run
self.crawler_process.start()
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/crawler.py", line 348, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 1318, in run
self.mainLoop()
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 1328, in mainLoop
reactorBaseSelf.runUntilCurrent()
--- <exception caught here> ---
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 994, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/utils/reactor.py", line 51, in __call__
return self._func(*self._a, **self._kw)
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/engine.py", line 147, in _next_request
while not self._needs_backout() and self._next_request_from_scheduler() is not None:
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/engine.py", line 176, in _next_request_from_scheduler
request = self.slot.scheduler.next_request()
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/scheduler.py", line 263, in next_request
request = self._dqpop()
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/scheduler.py", line 299, in _dqpop
return self.dqs.pop()
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/pqueues.py", line 99, in pop
m = q.pop()
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/squeues.py", line 78, in pop
return request_from_dict(request, spider=self.spider)
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/utils/request.py", line 124, in request_from_dict
return request_cls(**kwargs)
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/http/request/__init__.py", line 60, in __init__
self._set_url(url)
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/http/request/__init__.py", line 100, in _set_url
s = safe_url_string(url, self.encoding)
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/w3lib/url.py", line 103, in safe_url_string
parts.port,
File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/parse.py", line 178, in port
raise ValueError(message) from None
builtins.ValueError: Port could not be cast to integer value as '4e00:40:40::2:3b6'I debugged and found that the problem was in urllib.parse#L202,as shown below:
And when I stopped using the JOBDIR parameter and debugged again, I found that the problem still existed. At this point, the problem is in middlewares such as CookieJar, RetryMiddleware,RobotsTxtMiddleware and so on.
The problem should be in the creation of the Request instance,It called self._set_url and then parsed the url https://[2402:4e00:40:40::2:3b6] to https://2402:4e00:40:40::2:3b6 .
When the middlewares create another instance of Request based on Request.url, calling self._set_url will return the wrong hostname and port.
Versions
$ scrapy version --verbose
Scrapy : 2.6.3
lxml : 4.9.1.0
libxml2 : 2.9.4
cssselect : 1.1.0
parsel : 1.6.0
w3lib : 2.0.1
Twisted : 22.8.0
Python : 3.9.6 (default, Sep 13 2022, 22:03:16) - [Clang 14.0.0 (clang-1400.0.29.102)]
pyOpenSSL : 22.0.0 (OpenSSL 3.0.5 5 Jul 2022)
cryptography : 37.0.4
Platform : macOS-12.6-arm64-arm-64bit

