PHP-如何使用 preg_match_all 获取具有特定类名的 img 标签的 src?
2019-05-16
1060
我正在尝试从 Amazon 产品搜索列表页面创建一个抓取工具。
方法:
function getHTMLcode($url) {
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 10.10; labnol;) ctrlq.org");
curl_setopt($curl, CURLOPT_ENCODING, 'identity');
curl_setopt($curl, CURLOPT_FAILONERROR, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($curl);
curl_close($curl);
return $html;
}
方法调用:
$url="http://www.amazon.com/s/?url=search-alias%3Daps&field-keywords=iphone";
$html= getHTMLcode($url);
$image = '/src="(?P<img>[^"]*)"/';
preg_match_all($image,$html,$data);
var_dump($data);
问题:这将返回页面上存在的所有 src 标签。我只需要具有
class = "s-image"
的产品,但不返回 h2(产品标题)和价格标签。
问题:如何从 Amazon 产品搜索列表中仅获取具有特定类名的图像、标题和价格标签。 Amazon 返回
<img src="https://m.media-amazon.com/images/I/61QSgY4zXNL._AC_UL436_.jpg" class="s-image" alt="Apple iPhone Xs Max with FaceTime - 256GB, 4G LTE, Space Gray" srcset="https://m.media-amazon.com/images/I/61QSgY4zXNL._AC_UL436_.jpg 1x, https://m.media-amazon.com/images/I/61QSgY4zXNL._AC_UL654_FMwebp_QL65_.jpg 1.5x, https://m.media-amazon.com/images/I/61QSgY4zXNL._AC_UL872_FMwebp_QL65_.jpg 2x, https://m.media-amazon.com/images/I/61QSgY4zXNL._AC_UL1090_FMwebp_QL65_.jpg 2.5x, https://m.media-amazon.com/images/I/61QSgY4zXNL._AC_UL1308_FMwebp_QL65_.jpg 3x" data-image-index="0" data-image-load="" data-image-latency="s-product-image" data-image-source-density="1">
类似地;要获取产品的标题和价格,我正在尝试
$title = '/<h2 class="a-size-mini a-spacing-none a-color-base s-line-clamp-2">(?P<val>[^>]*)<\/h2>/';
preg_match_all($title,$html,$value);
var_dump($value);
$price ='/<span class="a-price-whole><span class="a-price-symbol"> <\/span>(?P<price>[^>]*)<\/span>/';
preg_match_all($price,$html,$cost);
var_dump($value);
1个回答
您使用的工具不对。您应该使用 HTML 解析器来执行此操作,并使用 XPath 查询来查找所需内容:
<?php
$url="http://www.amazon.com/s/?url=search-alias%3Daps&field-keywords=iphone";
$html= getHTMLcode($url);
$dom = new DomDocument();
libxml_use_internal_errors();
$dom->loadHTML($html);
$xpath = new DomXPath($dom);
$nodes = $xpath->query("//img[contains(@class, 's-image')]/@src");
foreach ($nodes as $node) {
$data[] = $node->textContent;
}
print_r($data);
输出:
Array
(
[0] => https://m.media-amazon.com/images/I/418H4DiygbL._AC_UL436_.jpg
[1] => https://m.media-amazon.com/images/I/61IzJCh8i8L._AC_UL436_.jpg
[2] => https://m.media-amazon.com/images/I/71RYhD1uzpL._AC_UL436_.jpg
[3] => https://m.media-amazon.com/images/I/41jUosGQiDL._AC_UL436_.jpg
[4] => https://m.media-amazon.com/images/I/51CBPR-l2VL._AC_UL436_.jpg
[5] => https://m.media-amazon.com/images/I/813nLXVhnwL._AC_UL436_.jpg
[6] => https://m.media-amazon.com/images/I/61WpoMEdpoL._AC_UL436_.jpg
[7] => https://m.media-amazon.com/images/I/913VoEdo-4L._AC_UL436_.jpg
[8] => https://m.media-amazon.com/images/I/81s7ZLOGOWL._AC_UL436_.jpg
[9] => https://m.media-amazon.com/images/I/81s7ZLOGOWL._AC_UL436_.jpg
[10] => https://m.media-amazon.com/images/I/513R4aVg1cL._AC_UL436_.jpg
[11] => https://m.media-amazon.com/images/I/51BbI-8wpTL._AC_UL436_.jpg
[12] => https://m.media-amazon.com/images/I/61pRPj+-IYL._AC_UL436_.jpg
[13] => https://m.media-amazon.com/images/I/71x3e0x+M2L._AC_UL436_.jpg
[14] => https://m.media-amazon.com/images/I/6165FLUs1+L._AC_UL436_.jpg
[15] => https://m.media-amazon.com/images/I/81ZJNQZBFCL._AC_UL436_.jpg
[16] => https://m.media-amazon.com/images/I/51sTR66B1UL._AC_UL436_.jpg
[17] => https://m.media-amazon.com/images/I/71QxMMTKiVL._AC_UL436_.jpg
[18] => https://m.media-amazon.com/images/I/61OUrdtiDcL._AC_UL436_.jpg
[19] => https://m.media-amazon.com/images/I/71ktNlpWWdL._AC_UL436_.jpg
[20] => https://m.media-amazon.com/images/I/51x3FM83EQL._AC_UL436_.jpg
[21] => https://m.media-amazon.com/images/I/41-Mv2nSrNL._AC_UL436_.jpg
)
miken32
2019-05-16