开发者问题收集

PHP-如何使用 preg_match_all 获取具有特定类名的 img 标签的 src?

2019-05-16
1060

我正在尝试从 Amazon 产品搜索列表页面创建一个抓取工具。

方法:

function getHTMLcode($url) {

    $curl = curl_init($url);
    curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 10.10; labnol;) ctrlq.org");
    curl_setopt($curl, CURLOPT_ENCODING, 'identity');
    curl_setopt($curl, CURLOPT_FAILONERROR, true);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
    $html = curl_exec($curl);
    curl_close($curl);

    return $html;

}

方法调用:

  $url="http://www.amazon.com/s/?url=search-alias%3Daps&field-keywords=iphone";

  $html= getHTMLcode($url);
  $image = '/src="(?P<img>[^"]*)"/';  
  preg_match_all($image,$html,$data);
  var_dump($data);

问题:这将返回页面上存在的所有 src 标签。我只需要具有 class = "s-image" 的产品,但不返回 h2(产品标题)和价格标签。

问题:如何从 Amazon 产品搜索列表中仅获取具有特定类名的图像、标题和价格标签。 Amazon 返回

<img src="https://m.media-amazon.com/images/I/61QSgY4zXNL._AC_UL436_.jpg" class="s-image" alt="Apple iPhone Xs Max with FaceTime - 256GB, 4G LTE, Space Gray" srcset="https://m.media-amazon.com/images/I/61QSgY4zXNL._AC_UL436_.jpg 1x, https://m.media-amazon.com/images/I/61QSgY4zXNL._AC_UL654_FMwebp_QL65_.jpg 1.5x, https://m.media-amazon.com/images/I/61QSgY4zXNL._AC_UL872_FMwebp_QL65_.jpg 2x, https://m.media-amazon.com/images/I/61QSgY4zXNL._AC_UL1090_FMwebp_QL65_.jpg 2.5x, https://m.media-amazon.com/images/I/61QSgY4zXNL._AC_UL1308_FMwebp_QL65_.jpg 3x" data-image-index="0" data-image-load="" data-image-latency="s-product-image" data-image-source-density="1">

类似地;要获取产品的标题和价格,我正在尝试

 $title = '/<h2 class="a-size-mini a-spacing-none a-color-base s-line-clamp-2">(?P<val>[^>]*)<\/h2>/'; 
    preg_match_all($title,$html,$value);
     var_dump($value);
    $price ='/<span class="a-price-whole><span class="a-price-symbol">&nbsp;&nbsp;<\/span>(?P<price>[^>]*)<\/span>/';
    preg_match_all($price,$html,$cost);

     var_dump($value);
1个回答

您使用的工具不对。您应该使用 HTML 解析器来执行此操作,并使用 XPath 查询来查找所需内容:

<?php
$url="http://www.amazon.com/s/?url=search-alias%3Daps&field-keywords=iphone";
$html= getHTMLcode($url);
$dom = new DomDocument();
libxml_use_internal_errors();
$dom->loadHTML($html);
$xpath = new DomXPath($dom);
$nodes = $xpath->query("//img[contains(@class, 's-image')]/@src");
foreach ($nodes as $node) {
    $data[] = $node->textContent;
}
print_r($data);

输出:

Array
(
    [0] => https://m.media-amazon.com/images/I/418H4DiygbL._AC_UL436_.jpg
    [1] => https://m.media-amazon.com/images/I/61IzJCh8i8L._AC_UL436_.jpg
    [2] => https://m.media-amazon.com/images/I/71RYhD1uzpL._AC_UL436_.jpg
    [3] => https://m.media-amazon.com/images/I/41jUosGQiDL._AC_UL436_.jpg
    [4] => https://m.media-amazon.com/images/I/51CBPR-l2VL._AC_UL436_.jpg
    [5] => https://m.media-amazon.com/images/I/813nLXVhnwL._AC_UL436_.jpg
    [6] => https://m.media-amazon.com/images/I/61WpoMEdpoL._AC_UL436_.jpg
    [7] => https://m.media-amazon.com/images/I/913VoEdo-4L._AC_UL436_.jpg
    [8] => https://m.media-amazon.com/images/I/81s7ZLOGOWL._AC_UL436_.jpg
    [9] => https://m.media-amazon.com/images/I/81s7ZLOGOWL._AC_UL436_.jpg
    [10] => https://m.media-amazon.com/images/I/513R4aVg1cL._AC_UL436_.jpg
    [11] => https://m.media-amazon.com/images/I/51BbI-8wpTL._AC_UL436_.jpg
    [12] => https://m.media-amazon.com/images/I/61pRPj+-IYL._AC_UL436_.jpg
    [13] => https://m.media-amazon.com/images/I/71x3e0x+M2L._AC_UL436_.jpg
    [14] => https://m.media-amazon.com/images/I/6165FLUs1+L._AC_UL436_.jpg
    [15] => https://m.media-amazon.com/images/I/81ZJNQZBFCL._AC_UL436_.jpg
    [16] => https://m.media-amazon.com/images/I/51sTR66B1UL._AC_UL436_.jpg
    [17] => https://m.media-amazon.com/images/I/71QxMMTKiVL._AC_UL436_.jpg
    [18] => https://m.media-amazon.com/images/I/61OUrdtiDcL._AC_UL436_.jpg
    [19] => https://m.media-amazon.com/images/I/71ktNlpWWdL._AC_UL436_.jpg
    [20] => https://m.media-amazon.com/images/I/51x3FM83EQL._AC_UL436_.jpg
    [21] => https://m.media-amazon.com/images/I/41-Mv2nSrNL._AC_UL436_.jpg
)
miken32
2019-05-16