开发者问题收集

Puppeteer 中未选择 HTML 元素

2021-05-08
777

因此,我从网页中摘录了以下 HTML 内容:

<li class="PaEvOc tv5olb wbTnP gws-horizon-textlists__li-ed">
  //random div/element stuff inside here
</li>
<li class ="PaEvOc tv5olb gws-horizon-textlists__li-ed">
  //random div/element stuff inside here as well
</li>

不确定如何正确复制 HTML,但如果您在 Google Chrome 上查看“ 位置 附近的事件”,我正在查看这些内容并尝试从中抓取数据:

https://i.sstatic.net/fv4a4.png

首先,我只是想弄清楚如何在 Puppeteer 中正确选择这些元素:

(async () => {
  const browser = await puppeteer.launch({ args: [
  '--no-sandbox'
  ]});
  const page = await browser.newPage();
  page.once('load', () => console.log('Page loaded!'));
  await page.goto('https://www.google.com/search?q=events+near+poughkeepsie+today&client=safari&rls=en&uact=5&ibp=htl;events&rciv=evn&sa=X&fpstate=tldetail');
  console.log('Hit wait for selector')
  const test = await page.waitForSelector(".PaEvOc");
  console.log('finished waiting for selector');
  const seeMoreEventsButton = await page.$(".PaEvOc");

  console.log('seeMoreEventsButton is ' + seeMoreEventsButton);
  console.log('test is ' + test);
})();

这里到底是什么问题?非常感谢任何帮助,谢谢!

1个回答

我建议阅读此文: https://intoli.com/blog/not-possible-to-block-chrome-headless/

基本上,网站会检测到您正在抓取数据,但您可以绕过它。

下面是我让您的控制台日志打印一些有用内容的方法

const puppeteer = require('puppeteer');

(async () => {                                                    
  const preparePageForTests = async (page) => {
    const userAgent = 'Mozilla/5.0 (X11; Linux x86_64)' +           
      'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.39 Safari/537.36';
    await page.setUserAgent(userAgent);
  }   

  const browser = await puppeteer.launch({ args: [                
  '--no-sandbox'                                                  
  ]});
  const page = await browser.newPage();
  await preparePageForTests(page);
      
  page.once('load', () => console.log('Page loaded!'));           
  await page.goto('https://www.google.com/search?q=events+near+poughkeepsie+today&client=safari&rls=en&uact=5&ibp=htl;events&rciv=evn&sa=X&fpstate=tldetail');                                        
  
  console.log('Hit wait for selector')
  const test = await page.waitForSelector(".PaEvOc");
    
  console.log('finished waiting for selector');                   
  const seeMoreEventsButton = await page.$(".PaEvOc");            
    
  console.log('seeMoreEventsButton is ' + seeMoreEventsButton);   
  console.log('test is ' + test);                                 
})();
olore
2021-05-08