PHP, MySQL, SEO, Wordpress, Web-marketing

Simple PHP Crawler

June 3rd, 2008 by patlatyj

I’ve recently made a wordpress plugin for generating really big sites, like 30k pages in some minutes, it works so that it first adds all the keywords post titles into the database with no post_content and then when a page is displayed it generates content. Actually it is using Google for making that so I just worried not to be banned if I let Googlebot request a lot of pages at once thus causing excessive parsing of SE. To say here is a little PHP crawler to request all pages within a domain, just for lulz

set_time_limit(60*60*24);
define(DELAY, '3');
$tocrawl = array('http://DOMAIN.COM');//seed
$dom = new DOMDocument();
$curlopts = array(CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
// CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14',
CURLOPT_USERAGENT => 'Roddik Crawler',
CURLOPT_MAXREDIRS => 3,
CURLOPT_TIMEOUT => 30,
CURLOPT_COOKIEJAR => 'cookies.txt',
CURLOPT_COOKIEFILE => 'cookies.txt',
CURLOPT_SSL_VERIFYPEER => false);
$dom = new DOMDocument();
$ch = curl_init();
curl_setopt_array($ch, $curlopts);
$host = parse_url($tocrawl[0], PHP_URL_HOST);
$lu = 0;
while (count($tocrawl)) {
//
$lu = microtime(true)-$lu;
if ($lu<DELAY) usleep(1000000*(DELAY-$lu));
$lu = microtime(true);
//
$link = array_shift($tocrawl);
$crawled[] = $link;
curl_setopt($ch, CURLOPT_URL, $link);
$page = curl_exec($ch);
@$dom->loadHTML($page);
$xpath = new DOMXPath($dom);
$list = $xpath->query(”//a[starts-with(@href, 'http://$host')]“);
foreach ($list as $a) {
$href = $a->getAttribute(’href’);
if (substr($href, -1) == ‘/’) $href = substr($href, 0, -1);
if (!in_array($href, $crawled) && !in_array($href, $tocrawl))
$tocrawl[] = $href;
}
}

Filed under PHP having

Leave a Comment

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.