Question

urllib IncompleteRead() 错误我可以通过重新请求来解决吗？

2022-01-11

4708

python urllib

我正在运行一个脚本，该脚本正在抓取网站上的数百个页面，但最近我遇到了 IncompleteRead() 错误。我从 stackoverflow 上了解到，这些错误可能由于许多未知原因而发生。

根据我的搜索，我认为该错误是由 Request() 函数随机引起的：

    for ec in unq:
        print(ec)
        url = Request("https://www.brenda-enzymes.org/enzyme.php?ecno=" +
                              ec, headers={'User-Agent': 'Mozilla/5.0'})
        html = urlopen(url).read()
        soup = BeautifulSoup(html, 'html.parser')


    3.5.2.3
    2.1.3.15
    2.5.1.72
    1.5.1.2
    6.1.1.9
    3.2.2.27
    Traceback (most recent call last):
    
      File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 554, in _get_chunk_left
        chunk_left = self._read_next_chunk_size()
    
      File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 521, in _read_next_chunk_size
        return int(line, 16)
    
    ValueError: invalid literal for int() with base 16: b''
    
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
    
      File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 571, in _readall_chunked
        chunk_left = self._get_chunk_left()
    
      File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 556, in _get_chunk_left
        raise IncompleteRead(b'')
    
    IncompleteRead: IncompleteRead(0 bytes read)
    
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
    
      File "<ipython-input-20-82f1876d3006>", line 5, in <module>
        html = urlopen(url).read()
    
      File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 464, in read
        return self._readall_chunked()
    
      File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 578, in _readall_chunked
        raise IncompleteRead(b''.join(value))
    
    IncompleteRead: IncompleteRead(1772944 bytes read)

该错误是随机发生的，因为并不总是由相同的 URL 导致该错误， https://www.brenda-enzymes.org/enzyme.php?ecno=3.2.2.27 导致此特定错误。

某些解决方案似乎引入了 try 子句，但在 except 中它们存储了部分数据（我认为）。为什么会出现这种情况，为什么不直接重新提交请求？

如果是这样，我该如何重新运行请求，因为这样做通常似乎可以解决问题。除此之外，我不知道如何解决这个问题。

Answer 1

堆栈跟踪会认为您正在读取分块传输编码响应，并且 出于某种原因 您丢失了 2 个块之间的连接。

正如您所说，这种情况可能由于多种原因而发生，并且发生是随机的。因此：

您无法预测何时或针对哪个文件会发生这种情况
您无法阻止它发生

您能做的最好的事情就是在可选延迟后捕获错误并重试。

例如：

for ec in unq:
    print(ec)
    url = Request("https://www.brenda-enzymes.org/enzyme.php?ecno=" +
                          ec, headers={'User-Agent': 'Mozilla/5.0'})
    for i in range(4):
        try:
            html = urlopen(url).read()
            break
        except http.client.IncompleteRead:
            if i == 3:
               raise       # give up after 4 attempts
            # optionaly add a delay here
    soup = BeautifulSoup(html, 'html.parser')

Answer 2

我遇到过同样的问题，并找到了此解决方案

经过一些小改动后，代码如下所示：

from http.client import IncompleteRead, HTTPResponse
from urllib.request import urlopen
from urllib.error import URLError, HTTPError
...


def patch_http_response_read(func):
    def inner(args):
        try:
            return func(args)
        except IncompleteRead as e:
            return e.partial
    return inner

HTTPResponse.read = patch_http_response_read(HTTPResponse.read)

try:
    response = urlopen(my_url)
    result = json.loads(response.read().decode('UTF-8'))
except URLError as e:
    print('URL Error Reason: ', e.reason)
except HTTPError as e:
    print('HTTP Error code: ', e.code)

我不确定这是否是更好的方法。但对我来说，它有效。如果这些建议对您有用或帮助您找到其他更好的解决方案，我将非常高兴。祝您编码愉快！