开发者问题收集

Python:requests.get 获取错误的 html 文件

2021-02-03
229

我试图从 https://essentials.swissdox.ch 抓取数据,该链接仅适用于 VPN。因此,我所做的是,我使用查询参数生成一个 URL,并尝试获取相应的 html 文件。问题是,虽然链接有效,但 Python 为我提供了 https://essentials.swissdox.ch 起始页的 html 文件。我非常感谢任何帮助!

示例: 我想要以下 url 的 html 文件: https://essentials.swissdox.ch/View/log/index.jsp#&search=true&filter_de=la&sortorder=pubDateTime%20desc&formdata=%5B%7B%22name%22%3A%22SEARCH_mltid%22%2C%22value%22%3A%22%22%7D%2C%7B%22name%22%3A%22S EARCH_sc%22%2C%22value%22%3A%22swissdox%22%7D%2C%7B%22name%22%3A%22SEARCH_query%22%2C%22value%22%3A%22lissabon%22%7D%2C%7B%22name%22%3A%22filter_de%22%2C%22value%2 2%3A%22de%22%7D%2C%7B%22名称%22%3A%22SEARCH_exact%22%2C%22值%2 2%3A%22true%22%7D%2C%7B%22name%22%3A%22dateDropdown%22%2C%22value%22%3A%22-1%2 2%7D%2C%7B%22name%22%3A%22SEARCH_pubDate_lower%22%2C%22value%22%3A%222020-02-04 %22%7D%2C%7B%22名称%22%3A%22SEARCH_pubDate_upper%22%2C%22值%22%3A%22 2020-02-04%22%7D%2C%7B%22name%22%3A%22SEARCH_tial%22%2C%22value%22%3A%22%22%7D%2C%7B%22name%22%3A%22SEARCH_source%22%2C%22value%22%3A%22%22%7D%2C%7B %22name%22%3A%22SEARCH_author%22%2C%22value%22%3A%22%22%7D%5D

改为我得到了该页面的 html 文件: https://essentials.swissdox.ch/View/log/index.jsp?reset=true

这是我目前所拥有的:

#Set keywords for URL
keyword_queries = ['lissabon']
startdate = "2007-01-01"
enddate = "2007-01-01"

#Encode  and hit URL
for keyword in keyword_queries:
    html_keyword= urllib.parse.quote_plus(keyword)
    URL = "https://essentials.swissdox.ch/View/log/index.jsp#&search=true&sortorder=pubDateTime%20desc&formdata=%5B%7B%22name%22%3A%22SEARCH_mltid%22%2C%22value%22%3A%22%22%7D%2C%7B%22name%22%3A%22SEARCH_sc%22%2C%22value%22%3A%22swissdox%22%7D%2C%7B%22name%22%3A%22SEARCH_query%22%2C%22value%22%3A%22" + html_keyword + "%22%7D%2C%7B%22name%22%3A%22SEARCH_exact%22%2C%22value%22%3A%22true%22%7D%2C%7B%22name%22%3A%22dateDropdown%22%2C%22value%22%3A%22-1%22%7D%2C%7B%22name%22%3A%22SEARCH_pubDate_lower%22%2C%22value%22%3A%22" + startdate + "%22%7D%2C%7B%22name%22%3A%22SEARCH_pubDate_upper%22%2C%22value%22%3A%22" + enddate + "%22%7D%2C%7B%22name%22%3A%22SEARCH_tiall%22%2C%22value%22%3A%22%22%7D%2C%7B%22name%22%3A%22SEARCH_source%22%2C%22value%22%3A%22%22%7D%2C%7B%22name%22%3A%22SEARCH_author%22%2C%22value%22%3A%22%22%7D%5D"
    weburl  = urllib.request.urlopen(URL)

    
    #Hit the url
    ua = UserAgent()
    page = requests.get(URL, {"User-Agent": ua.random})
    soup = BeautifulSoup(page.content, "html.parser")
    results = soup.find('div', class_='documentlist')
    print(page.content)
1个回答

看起来您在 URL 中使用了“#”而不是“?”。通常,“?”将用于启动查询参数,这些参数在键值对之间用“=”指定。

使用“#”表示跳转到页面中的特定部分,在本例中为 https://essentials.swissdox.ch/View/log/index.jsp ,这是您获得的响应。将“#”更改为“?”似乎会在原始 URL 上引发有关无效字符的错误。确保在对查询参数进行 百分比编码 时使用有效字符。

Wiki - URL 语法

Ilango Rajagopal
2021-02-03