A while ago I had to crawl some websites to gather information about products. In the past I’ve used RegExp to parse the HTML, knowing it’s not the best method, but I just felt that PHP’s DOMDocument was clumsy.
I started coding the crawler with CakePHP 2.5.x and the following classes: electrolinux/phpquery and php-curl-class/php-curl-class.
The php-curl-class is pretty straight forward, it’s just easier to work with curl with it. In addition, the phpQuery is a library that let’s you use CSS3 selectors just like you do with jQuery.
I know it’s lame, but as example let’s get the title of SaveWalterWhite.
<?php
$curl = new \Curl\Curl();
$curl->get("http://www.savewalterwhite.com");
$pq = phpQuery::newDocument($curl->response);
echo $pq->find('title')->text();
?>
Obviously you can do more complex stuff, like getting all the image paths that are inside list items of the #walter-container div.
<?php
$curl = new \Curl\Curl();
$curl->get("http://www.savewalterwhite.com");
$pq = phpQuery::newDocument($curl->response);
for ($i=1;$i
$pics = $pq->find('div#walter-container li img')->attr('src');
if (!empty($pics)) { var_dump($pics); }
?>
You can also use the selector on an iteration like this:
<?php
for ($i=1;$i<=$limit;$i++)
{
$pics = $pq->find('div#product-detail ul li:nth-child('.$i.') a')->attr('data-image-zoom');
if (!empty($pics))
{
$images[] = $pics;
}
}
?>
Checkout the phpQuery manual for further information. This class is handy and saved me a lot of time.