开发者问题收集

urllib IncompleteRead() 错误我可以通过重新请求来解决吗?

2022-01-11
4708

我正在运行一个脚本,该脚本正在抓取网站上的数百个页面,但最近我遇到了 IncompleteRead() 错误。我从 stackoverflow 上了解到,这些错误可能由于许多未知原因而发生。

根据我的搜索,我认为该错误是由 Request() 函数随机引起的:

    for ec in unq:
        print(ec)
        url = Request("https://www.brenda-enzymes.org/enzyme.php?ecno=" +
                              ec, headers={'User-Agent': 'Mozilla/5.0'})
        html = urlopen(url).read()
        soup = BeautifulSoup(html, 'html.parser')


    3.5.2.3
    2.1.3.15
    2.5.1.72
    1.5.1.2
    6.1.1.9
    3.2.2.27
    Traceback (most recent call last):
    
      File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 554, in _get_chunk_left
        chunk_left = self._read_next_chunk_size()
    
      File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 521, in _read_next_chunk_size
        return int(line, 16)
    
    ValueError: invalid literal for int() with base 16: b''
    
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
    
      File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 571, in _readall_chunked
        chunk_left = self._get_chunk_left()
    
      File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 556, in _get_chunk_left
        raise IncompleteRead(b'')
    
    IncompleteRead: IncompleteRead(0 bytes read)
    
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
    
      File "<ipython-input-20-82f1876d3006>", line 5, in <module>
        html = urlopen(url).read()
    
      File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 464, in read
        return self._readall_chunked()
    
      File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 578, in _readall_chunked
        raise IncompleteRead(b''.join(value))
    
    IncompleteRead: IncompleteRead(1772944 bytes read)

该错误是随机发生的,因为并不总是由相同的 URL 导致该错误, https://www.brenda-enzymes.org/enzyme.php?ecno=3.2.2.27 导致此特定错误。

某些解决方案似乎引入了 try 子句,但在 except 中它们存储了部分数据( 我认为 )。为什么会出现这种情况,为什么不直接重新提交请求?

如果是这样,我该如何重新运行请求,因为这样做通常似乎可以解决问题。除此之外,我不知道如何解决这个问题。

2个回答

堆栈跟踪会认为您正在读取 分块传输编码 响应,并且 出于某种原因 您丢失了 2 个块之间的连接。

正如您所说,这种情况可能由于多种原因而发生,并且发生是随机的。因此:

  • 您无法预测何时或针对哪个文件会发生这种情况
  • 您无法阻止它发生

您能做的最好的事情就是在可选延迟后捕获错误并重试。

例如:

for ec in unq:
    print(ec)
    url = Request("https://www.brenda-enzymes.org/enzyme.php?ecno=" +
                          ec, headers={'User-Agent': 'Mozilla/5.0'})
    for i in range(4):
        try:
            html = urlopen(url).read()
            break
        except http.client.IncompleteRead:
            if i == 3:
               raise       # give up after 4 attempts
            # optionaly add a delay here
    soup = BeautifulSoup(html, 'html.parser')
Serge Ballesta
2022-01-12

我遇到过同样的问题,并找到了 此解决方案

经过一些小改动后,代码如下所示:

from http.client import IncompleteRead, HTTPResponse
from urllib.request import urlopen
from urllib.error import URLError, HTTPError
...


def patch_http_response_read(func):
    def inner(args):
        try:
            return func(args)
        except IncompleteRead as e:
            return e.partial
    return inner

HTTPResponse.read = patch_http_response_read(HTTPResponse.read)

try:
    response = urlopen(my_url)
    result = json.loads(response.read().decode('UTF-8'))
except URLError as e:
    print('URL Error Reason: ', e.reason)
except HTTPError as e:
    print('HTTP Error code: ', e.code)
    

我不确定这是否是更好的方法。但对我来说,它有效。如果这些建议对您有用或帮助您找到其他更好的解决方案,我将非常高兴。祝您编码愉快!

Aleksey Ozimkov
2023-01-14