urllib IncompleteRead() 错误我可以通过重新请求来解决吗?
2022-01-11
4708
我正在运行一个脚本,该脚本正在抓取网站上的数百个页面,但最近我遇到了
IncompleteRead()
错误。我从 stackoverflow 上了解到,这些错误可能由于许多未知原因而发生。
根据我的搜索,我认为该错误是由
Request()
函数随机引起的:
for ec in unq:
print(ec)
url = Request("https://www.brenda-enzymes.org/enzyme.php?ecno=" +
ec, headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
3.5.2.3
2.1.3.15
2.5.1.72
1.5.1.2
6.1.1.9
3.2.2.27
Traceback (most recent call last):
File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 554, in _get_chunk_left
chunk_left = self._read_next_chunk_size()
File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 521, in _read_next_chunk_size
return int(line, 16)
ValueError: invalid literal for int() with base 16: b''
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 571, in _readall_chunked
chunk_left = self._get_chunk_left()
File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 556, in _get_chunk_left
raise IncompleteRead(b'')
IncompleteRead: IncompleteRead(0 bytes read)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<ipython-input-20-82f1876d3006>", line 5, in <module>
html = urlopen(url).read()
File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 464, in read
return self._readall_chunked()
File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 578, in _readall_chunked
raise IncompleteRead(b''.join(value))
IncompleteRead: IncompleteRead(1772944 bytes read)
该错误是随机发生的,因为并不总是由相同的 URL 导致该错误,
https://www.brenda-enzymes.org/enzyme.php?ecno=3.2.2.27
导致此特定错误。
某些解决方案似乎引入了
try
子句,但在
except
中它们存储了部分数据(
我认为
)。为什么会出现这种情况,为什么不直接重新提交请求?
如果是这样,我该如何重新运行请求,因为这样做通常似乎可以解决问题。除此之外,我不知道如何解决这个问题。
2个回答
堆栈跟踪会认为您正在读取 分块传输编码 响应,并且 出于某种原因 您丢失了 2 个块之间的连接。
正如您所说,这种情况可能由于多种原因而发生,并且发生是随机的。因此:
- 您无法预测何时或针对哪个文件会发生这种情况
- 您无法阻止它发生
您能做的最好的事情就是在可选延迟后捕获错误并重试。
例如:
for ec in unq:
print(ec)
url = Request("https://www.brenda-enzymes.org/enzyme.php?ecno=" +
ec, headers={'User-Agent': 'Mozilla/5.0'})
for i in range(4):
try:
html = urlopen(url).read()
break
except http.client.IncompleteRead:
if i == 3:
raise # give up after 4 attempts
# optionaly add a delay here
soup = BeautifulSoup(html, 'html.parser')
Serge Ballesta
2022-01-12
我遇到过同样的问题,并找到了 此解决方案
经过一些小改动后,代码如下所示:
from http.client import IncompleteRead, HTTPResponse
from urllib.request import urlopen
from urllib.error import URLError, HTTPError
...
def patch_http_response_read(func):
def inner(args):
try:
return func(args)
except IncompleteRead as e:
return e.partial
return inner
HTTPResponse.read = patch_http_response_read(HTTPResponse.read)
try:
response = urlopen(my_url)
result = json.loads(response.read().decode('UTF-8'))
except URLError as e:
print('URL Error Reason: ', e.reason)
except HTTPError as e:
print('HTTP Error code: ', e.code)
我不确定这是否是更好的方法。但对我来说,它有效。如果这些建议对您有用或帮助您找到其他更好的解决方案,我将非常高兴。祝您编码愉快!
Aleksey Ozimkov
2023-01-14