开发者问题收集

使用 Python 和 Selenium 从多个工具提示中抓取数据

2022-01-05
653

我正在尝试使用 Python 和 Selenium 从此 网站 抓取硫化氢数据。到目前为止,我一直在努力的是我不知道如何获取每个工具提示的数据(站点 ID、站点名称、日期、值、单位等)。如您所见,我们有七个监测点,从 A 到 G,每个点对应自己的数据。我做了很多研究,但仍然陷入困境。我编写了以下代码来抓取特定日期的数据,但遇到了错误。请参阅下面的代码。

from selenium import webdriver
from webdriver_manager.microsoft import EdgeChromiumDriverManager
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Edge(EdgeChromiumDriverManager(log_level=20).install())
driver.maximize_window()
driver.get("https://marathonlosangelesrefineryfencelinemonitoring.com/index.html")

# Navigate to monitors
button = driver.find_element_by_xpath("//div[@class='nav-link-text']")   
button.click()

# Navigate to dropdown button
dropdown = driver.find_element_by_xpath("//i[@class='arrow-down parameter-arrow']") 
dropdown.click()

# Select Hydrogen Sulfide and click
h2s = driver.find_element_by_xpath("//ul[@class='dropdown-menu' and @role='menu' and @aria-labelledby='ParameterDropdown']//li[12]")
h2s.click()

res = []
test = driver.find_elements_by_xpath("//div[@class='leaflet-pane leaflet-marker-pane']//div[contains(@class, 'leaflet-marker-icon')]")
for ele in test:
    hover = ActionChains(driver).move_to_element(ele)
    hover.perform()
    try:
        site_id = driver.find_element_by_css_selector(".LAR-tooltip-site-id > p")
        site_name = driver.find_element_by_css_selector(".LAR-tooltip-site-name")
        date = driver.find_element_by_css_selector(".LAR-tooltip-localtime")
        value = driver.find_element_by_css_selector(".LAR-tooltip-data-value")
        unit = driver.find_element_by_css_selector(".LAR-tooltip-data-unit")
        para_mdl = driver.find_element_by_css_selector(".tooltip-parameter-mdl")
        res.append((site_id.text, site_name.text, date.text, value.text, unit.text, para_mdl.text))
    except:
        pass

如果有人能帮助我解决这个问题,我真的很感激。另外,我想利用上述代码在时间窗口内(假设从 2021 年 8 月 1 日到 2022 年 1 月 1 日)抓取数据,因此非常感谢任何反馈。

3个回答

看起来您所需的所有代码都是一些 WebdriverWaits。如果我没记错的话,基于 React 的网站在自动化方面有点困难,因为有很多 aysncs,并且由于虚拟 DOM。我已根据需要使用 WebdriverWaits 重构了您的代码(并且还删除了多行,尽管如果您想要更好的可读性,您可以保留它们)。这是代码:

from selenium import webdriver
from webdriver_manager.microsoft import EdgeChromiumDriverManager
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

driver = webdriver.Edge(EdgeChromiumDriverManager(log_level=20).install())
driver.maximize_window()
driver.get("https://marathonlosangelesrefineryfencelinemonitoring.com/index.html")
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//div[@class='nav-link-text']"))).click()
# Navigate to monitors
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//i[@class='arrow-down parameter-arrow']"))).click()
# Navigate to dropdown button
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//ul[@class='dropdown-menu' and @role='menu' and @aria-labelledby='ParameterDropdown']//li[12]"))).click()
# Select Hydrogen Sulfide and click
WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='leaflet-pane leaflet-marker-pane']//div[contains(@class, 'leaflet-marker-icon')]")))
driver.find_element_by_css_selector(".arrow-down.date-arrow").click()
req_month = 'Aug'
req_year = '2021'
req_timeline = req_month + " " + req_year
print(f"Timeline Selected is: {req_timeline}")
for i in range(11):
    month = driver.find_element(By.XPATH, "//th[@class='month']").text
    if month == req_timeline:
        break
    else:
        driver.find_element(By.XPATH, "//th[@class='prev available']").click()
driver.find_element(By.XPATH, "//*[@class='table-condensed']//td[text()='1']").click()
driver.find_element(By.XPATH, "//*[text()='Apply']").click()
time.sleep(8)
res = []
test = driver.find_elements_by_xpath("//div[@class='leaflet-pane leaflet-marker-pane']//div[contains(@class, 'leaflet-marker-icon')]")
for ele in test:
    hover = ActionChains(driver).move_to_element(ele)
    hover.perform()
    time.sleep(1)
    try:
        site_id = driver.find_element_by_css_selector(".LAR-tooltip-site-id > p")
        site_name = driver.find_element_by_css_selector(".LAR-tooltip-site-name")
        date = driver.find_element_by_css_selector(".LAR-tooltip-localtime")
        value = driver.find_element_by_css_selector(".LAR-tooltip-data-value")
        unit = driver.find_element_by_css_selector(".LAR-tooltip-data-unit")
        para_mdl = driver.find_element_by_css_selector(".tooltip-parameter-mdl")
        res.append((site_id.text, site_name.text, date.text, value.text, unit.text, para_mdl.text))
    except:
        pass
print(res)

这是结果:

Timeline Selected is: Aug 2021
[('F', 'Point Monitor', '7:55 AM', '1.80', 'ppb', 'MDL: 0.40 ppb'), ('B', 'Point Monitor', '7:55 AM', '1.20', 'ppb', 'MDL: 0.40 ppb'), ('E', 'Point Monitor', '7:55 AM', '1.10', 'ppb', 'MDL: 0.40 ppb'), ('A', 'Point Monitor', '7:55 AM', '0.40', 'ppb', 'MDL: 0.40 ppb')]

Process finished with exit code 0

您会看到,即使引入了 WebdriverWaits,有些地方也需要在 time.sleep 上进行硬停止,否则测试会变得不稳定。

Anand Gautam
2022-01-06

@ThaiNguyen,添加另一个答案以保留先前的答案。我尝试了一些粗暴的方法来完成工作,经过多次尝试后我成功了,但我想说要谨慎对待,因为我只在 8 月迭代了 3 个日期。重构的代码粘贴在下面,但在您看到代码之前,让我向您解释我遇到的问题,您可以标记以进行处理。 我不得不添加很多睡眠以使 DOM 在每个操作中稳定下来(如您所知,time.sleep 在异步方面非常不可靠),但我认为即使等待之后,我也会看到代码无法使元素过时,而添加时间帮助我(暂时)处理它们。 另一件事 - 对我来说,这是一个大问题:即使此代码成功获取了结果,但我不能向您保证它会对 8 月的所有日期都这样做(更不用说所有需要的月份了),因为代码在呈现的 DOM 上表现得非常不稳定,我不想在此时责怪代码(我对 selenium 的了解有限),但如果我没错的话,DOM 具有严重的异步性。 所以,我想说,使用这段代码,你不能指望一次得到所有东西;相反,你可能不得不花时间重构代码并改进它,或者通过多次运行来分块获取每个月的几个日期的数据,这非常令人沮丧,因为它很容易不稳定。

from selenium import webdriver
from webdriver_manager.microsoft import EdgeChromiumDriverManager
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

driver = webdriver.Edge(EdgeChromiumDriverManager(log_level=20).install())
driver.maximize_window()
def h2s_selection():
    driver.get("https://marathonlosangelesrefineryfencelinemonitoring.com/index.html")
    WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//div[@class='nav-link-text']"))).click()
    # Navigate to monitors
    WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//i[@class='arrow-down parameter-arrow']"))).click()
    # Navigate to dropdown button
    WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//ul[@class='dropdown-menu' and @role='menu' and @aria-labelledby='ParameterDropdown']//li[12]"))).click()
    # Select Hydrogen Sulfide and click
    WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='leaflet-pane leaflet-marker-pane']//div[contains(@class, 'leaflet-marker-icon')]")))

def aug_date():
    driver.find_element_by_css_selector(".arrow-down.date-arrow").click()
    req_month = 'Aug'
    req_year = '2021'
    req_timeline = req_month + " " + req_year
    print(f"Timeline Selected is: {req_timeline}")
    for i in range(11):
        month = driver.find_element(By.XPATH, "//th[@class='month']").text
        if month == req_timeline:
            break
        else:
            driver.find_element(By.XPATH, "//th[@class='prev available']").click()
    dt = ['1', '2', '3']
    for i in dt:
        time.sleep(5)
        each_date = driver.find_element(By.XPATH, "//*[@class='table-condensed']//td[text()=" + i + ']')
        print(f"Date is {each_date.text}")
        each_date.click()
        driver.find_element(By.XPATH, "//*[text()='Apply']").click()
        time.sleep(10)
        tooltips()
        time.sleep(5)
        driver.find_element_by_css_selector(".arrow-down.date-arrow").click()

def tooltips():
    # time.sleep(8)
    res = []
    test = driver.find_elements_by_xpath("//div[@class='leaflet-pane leaflet-marker-pane']//div[contains(@class, 'leaflet-marker-icon')]")
    for ele in test:
        hover = ActionChains(driver).move_to_element(ele)
        hover.perform()
        time.sleep(1)
        try:
            site_id = driver.find_element_by_css_selector(".LAR-tooltip-site-id > p")
            site_name = driver.find_element_by_css_selector(".LAR-tooltip-site-name")
            date = driver.find_element_by_css_selector(".LAR-tooltip-localtime")
            value = driver.find_element_by_css_selector(".LAR-tooltip-data-value")
            unit = driver.find_element_by_css_selector(".LAR-tooltip-data-unit")
            para_mdl = driver.find_element_by_css_selector(".tooltip-parameter-mdl")
            res.append((site_id.text, site_name.text, date.text, value.text, unit.text, para_mdl.text))
        except:
            pass
    print(res)


if __name__ == "__main__":
    h2s_selection()
    aug_date()

输出:

Timeline Selected is: Aug 2021
Date is 1
[('F', 'Point Monitor', '10:55 AM', '0.90', 'ppb', 'MDL: 0.40 ppb'), ('B', 'Point Monitor', '10:55 AM', '1.20', 'ppb', 'MDL: 0.40 ppb'), ('E', 'Point Monitor', '10:55 AM', '1.30', 'ppb', 'MDL: 0.40 ppb'), ('A', 'Point Monitor', '10:55 AM', '0.60', 'ppb', 'MDL: 0.40 ppb')]
Date is 2
[('B', 'Point Monitor', '10:25 PM', '1.70', 'ppb', 'MDL: 0.40 ppb'), ('E', 'Point Monitor', '10:25 PM', '1.90', 'ppb', 'MDL: 0.40 ppb')]
Date is 3
[('F', 'Point Monitor', '9:55 AM', '1.20', 'ppb', 'MDL: 0.40 ppb'), ('B', 'Point Monitor', '9:55 AM', '1.20', 'ppb', 'MDL: 0.40 ppb'), ('E', 'Point Monitor', '9:55 AM', '1.90', 'ppb', 'MDL: 0.40 ppb'), ('A', 'Point Monitor', '9:55 AM', '0.50', 'ppb', 'MDL: 0.40 ppb')]

Process finished with exit code 0
Anand Gautam
2022-01-08

@AnandGautam,我意识到每当我想抓取整个月的数据(比如说 2021 年 9 月)时,一切都很顺利,直到我到达 29 日,而 8 月 29 日和 9 月 29 日在同一个日历上。因此,为了使 Xpath 独一无二,我通过添加 @data-title= 对它进行了一些修改。但我遇到了一些错误。我尝试验证 Xpath 并发现它是有效的,所以我仍然不知道为什么会发生错误。请参阅下面的代码。

from selenium import webdriver
from webdriver_manager.microsoft import EdgeChromiumDriverManager
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

driver = webdriver.Edge(EdgeChromiumDriverManager(log_level=20).install())
driver.maximize_window()
def h2s_selection():
    driver.get("https://marathonlosangelesrefineryfencelinemonitoring.com/index.html")
    WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//div[@class='nav-link-text']"))).click()
    # Navigate to monitors
    WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//i[@class='arrow-down parameter-arrow']"))).click()
    # Navigate to dropdown button
    WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//ul[@class='dropdown-menu' and @role='menu' and @aria-labelledby='ParameterDropdown']//li[12]"))).click()
    # Select Hydrogen Sulfide and click
    WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='leaflet-pane leaflet-marker-pane']//div[contains(@class, 'leaflet-marker-icon')]")))

def month_data(req_month, req_year, data):
    driver.find_element_by_css_selector(".arrow-down.date-arrow").click()
    req_timeline = req_month + " " + req_year
    print(f"Timeline Selected is: {req_timeline}")
    for i in range(11):
        month = driver.find_element(By.XPATH, "//th[@class='month']").text
        if month == req_timeline:
            break
        else:
            driver.find_element(By.XPATH, "//th[@class='prev available']").click()
    
    for k, v in data.items():
        time.sleep(5)
        each_date = driver.find_element(By.XPATH, f"//*[@class='table-condensed']//td[text()={k} and @data-title={v}]")
        #print(f"Date is {each_date.text}")
        each_date.click()
        driver.find_element(By.XPATH, "//*[text()='Apply']").click()
        time.sleep(10)
        tooltips()
        time.sleep(5)
        driver.find_element_by_css_selector(".arrow-down.date-arrow").click()

def tooltips():
    # time.sleep(8)
    res = []
    test = driver.find_elements_by_xpath("//div[@class='leaflet-pane leaflet-marker-pane']//div[contains(@class, 'leaflet-marker-icon')]")
    for ele in test:
        hover = ActionChains(driver).move_to_element(ele)
        hover.perform()
        time.sleep(1)
        try:
            site_id = driver.find_element_by_css_selector(".LAR-tooltip-site-id > p")
            site_name = driver.find_element_by_css_selector(".LAR-tooltip-site-name")
            date = driver.find_element_by_css_selector(".LAR-tooltip-localtime")
            value = driver.find_element_by_css_selector(".LAR-tooltip-data-value")
            unit = driver.find_element_by_css_selector(".LAR-tooltip-data-unit")
            para_mdl = driver.find_element_by_css_selector(".tooltip-parameter-mdl")
            res.append((site_id.text, site_name.text, date.text, value.text, unit.text, para_mdl.text))
        except:
            pass
    print(res)


if __name__ == "__main__":
    h2s_selection()
    data_dict = {'29': 'r4c3', '30': 'r4c4'}
    month_data(req_month='Sep', req_year='2021', data=data_dict)

如果您能给我一些关于如何解决这个问题的指示/反馈,我将不胜感激。谢谢!

Thai Nguyen
2022-01-12