Puppeteer 中未选择 HTML 元素
2021-05-08
777
因此,我从网页中摘录了以下 HTML 内容:
<li class="PaEvOc tv5olb wbTnP gws-horizon-textlists__li-ed">
//random div/element stuff inside here
</li>
<li class ="PaEvOc tv5olb gws-horizon-textlists__li-ed">
//random div/element stuff inside here as well
</li>
不确定如何正确复制 HTML,但如果您在 Google Chrome 上查看“ 位置 附近的事件”,我正在查看这些内容并尝试从中抓取数据:
https://i.sstatic.net/fv4a4.png
首先,我只是想弄清楚如何在 Puppeteer 中正确选择这些元素:
(async () => {
const browser = await puppeteer.launch({ args: [
'--no-sandbox'
]});
const page = await browser.newPage();
page.once('load', () => console.log('Page loaded!'));
await page.goto('https://www.google.com/search?q=events+near+poughkeepsie+today&client=safari&rls=en&uact=5&ibp=htl;events&rciv=evn&sa=X&fpstate=tldetail');
console.log('Hit wait for selector')
const test = await page.waitForSelector(".PaEvOc");
console.log('finished waiting for selector');
const seeMoreEventsButton = await page.$(".PaEvOc");
console.log('seeMoreEventsButton is ' + seeMoreEventsButton);
console.log('test is ' + test);
})();
这里到底是什么问题?非常感谢任何帮助,谢谢!
1个回答
我建议阅读此文: https://intoli.com/blog/not-possible-to-block-chrome-headless/
基本上,网站会检测到您正在抓取数据,但您可以绕过它。
下面是我让您的控制台日志打印一些有用内容的方法
const puppeteer = require('puppeteer');
(async () => {
const preparePageForTests = async (page) => {
const userAgent = 'Mozilla/5.0 (X11; Linux x86_64)' +
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.39 Safari/537.36';
await page.setUserAgent(userAgent);
}
const browser = await puppeteer.launch({ args: [
'--no-sandbox'
]});
const page = await browser.newPage();
await preparePageForTests(page);
page.once('load', () => console.log('Page loaded!'));
await page.goto('https://www.google.com/search?q=events+near+poughkeepsie+today&client=safari&rls=en&uact=5&ibp=htl;events&rciv=evn&sa=X&fpstate=tldetail');
console.log('Hit wait for selector')
const test = await page.waitForSelector(".PaEvOc");
console.log('finished waiting for selector');
const seeMoreEventsButton = await page.$(".PaEvOc");
console.log('seeMoreEventsButton is ' + seeMoreEventsButton);
console.log('test is ' + test);
})();
olore
2021-05-08