Python:requests.get 获取错误的 html 文件
2021-02-03
229
我试图从 https://essentials.swissdox.ch 抓取数据,该链接仅适用于 VPN。因此,我所做的是,我使用查询参数生成一个 URL,并尝试获取相应的 html 文件。问题是,虽然链接有效,但 Python 为我提供了 https://essentials.swissdox.ch 起始页的 html 文件。我非常感谢任何帮助!
改为我得到了该页面的 html 文件: https://essentials.swissdox.ch/View/log/index.jsp?reset=true
这是我目前所拥有的:
#Set keywords for URL
keyword_queries = ['lissabon']
startdate = "2007-01-01"
enddate = "2007-01-01"
#Encode and hit URL
for keyword in keyword_queries:
html_keyword= urllib.parse.quote_plus(keyword)
URL = "https://essentials.swissdox.ch/View/log/index.jsp#&search=true&sortorder=pubDateTime%20desc&formdata=%5B%7B%22name%22%3A%22SEARCH_mltid%22%2C%22value%22%3A%22%22%7D%2C%7B%22name%22%3A%22SEARCH_sc%22%2C%22value%22%3A%22swissdox%22%7D%2C%7B%22name%22%3A%22SEARCH_query%22%2C%22value%22%3A%22" + html_keyword + "%22%7D%2C%7B%22name%22%3A%22SEARCH_exact%22%2C%22value%22%3A%22true%22%7D%2C%7B%22name%22%3A%22dateDropdown%22%2C%22value%22%3A%22-1%22%7D%2C%7B%22name%22%3A%22SEARCH_pubDate_lower%22%2C%22value%22%3A%22" + startdate + "%22%7D%2C%7B%22name%22%3A%22SEARCH_pubDate_upper%22%2C%22value%22%3A%22" + enddate + "%22%7D%2C%7B%22name%22%3A%22SEARCH_tiall%22%2C%22value%22%3A%22%22%7D%2C%7B%22name%22%3A%22SEARCH_source%22%2C%22value%22%3A%22%22%7D%2C%7B%22name%22%3A%22SEARCH_author%22%2C%22value%22%3A%22%22%7D%5D"
weburl = urllib.request.urlopen(URL)
#Hit the url
ua = UserAgent()
page = requests.get(URL, {"User-Agent": ua.random})
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find('div', class_='documentlist')
print(page.content)
1个回答
看起来您在 URL 中使用了“#”而不是“?”。通常,“?”将用于启动查询参数,这些参数在键值对之间用“=”指定。
使用“#”表示跳转到页面中的特定部分,在本例中为 https://essentials.swissdox.ch/View/log/index.jsp ,这是您获得的响应。将“#”更改为“?”似乎会在原始 URL 上引发有关无效字符的错误。确保在对查询参数进行 百分比编码 时使用有效字符。
Ilango Rajagopal
2021-02-03