开发者问题收集

Python HTML 页面中的 Web 抓取不完整

2020-11-21
380

我试图从页面中抓取两个表格

但是当我使用 soup.find('table') 时,它就是找不到它。此外,当我打印 soup 对象时,HTML 代码的表格部分没有被打印出来,有什么解决办法吗?

到目前为止我的代码:

from bs4 import BeautifulSoup
import pandas as pd
import requests

url = 'http://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/consultas/mercado-a-vista/opcoes/posicoes-em-aberto/posicoes-em-aberto-8AE490CA64BA055F0164CCCAE1F1460A.htm?empresaEmissora=AMBEV%20S.A.&data=19/11/2020&dataVencimento=21/12/20&f=0'

r = requests.get(url)

soup = BeautifulSoup(r.text, 'lxml')

table = soup.find('div').find_all('table')

print(table)

输出:

[]
[Finished in 3.4s]

当我运行这个:

from bs4 import BeautifulSoup
import pandas as pd
import requests

url = 'http://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/consultas/mercado-a-vista/opcoes/posicoes-em-aberto/posicoes-em-aberto-8AE490CA64BA055F0164CCCAE1F1460A.htm?empresaEmissora=AMBEV%20S.A.&data=19/11/2020&dataVencimento=21/12/20&f=0'

r = requests.get(url)

soup = BeautifulSoup(r.text, 'lxml')

table = soup.find('tbody').find_all('tr')

print(table)

我得到了这个,但在页面的 HTML 中,表格信息在 tbody > tr 中,就像我以前抓取过的表格一样

Traceback (most recent call last):
  File "C:\Users\jvbf9\Documents\data-science\scraping_thiago\main.py", line 11, in <module>
    table = soup.find('tbody').find_all('tr')
AttributeError: 'NoneType' object has no attribute 'find_all'
[Finished in 7.2s with exit code 1]
1个回答

当您创建解析器时,您不会检索文本,而是检索内容:

from bs4 import BeautifulSoup
import pandas as pd
import requests

url = 'http://www.b3.com.br/pt_br/market-data-e-indices/servicos-de- 
dados/market-data/consultas/mercado-a-vista/opcoes/posicoes-em- 
aberto/posicoes-em-aberto-8AE490CA64BA055F0164CCCAE1F1460A.htm? 
empresaEmissora=AMBEV%20S.A.&data=19/11/2020&dataVencimento=21/12/20&f=0'

r = requests.get(url)

soup = BeautifulSoup(r.content, 'lxml')

table = soup.find('div').find_all('table')

print(table)

这应该是问题所在。

Israel Adelaja
2020-11-21