Simple PHP Crawler
I’ve recently made a wordpress plugin for generating really big sites, like 30k pages in some minutes, it works so that it first adds all the keywords post titles into the database with no post_content and then when a page is displayed it generates content. Actually it is using Google for making that so I just worried not to be banned if I let Googlebot request a lot of pages at once thus causing excessive parsing of SE. To say here is a little PHP crawler to request all pages within a domain, just for lulz
set_time_limit(60*60*24);
define(DELAY, '3');
$tocrawl = array('http://DOMAIN.COM');//seed
$dom = new DOMDocument();
$curlopts = array(CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
// CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14',
CURLOPT_USERAGENT => 'Roddik Crawler',
CURLOPT_MAXREDIRS => 3,
CURLOPT_TIMEOUT => 30,
CURLOPT_COOKIEJAR => 'cookies.txt',
CURLOPT_COOKIEFILE => 'cookies.txt',
CURLOPT_SSL_VERIFYPEER => false);
$dom = new DOMDocument();
$ch = curl_init();
curl_setopt_array($ch, $curlopts);
$host = parse_url($tocrawl[0], PHP_URL_HOST);
$lu = 0;
while (count($tocrawl)) {
//
$lu = microtime(true)-$lu;
if ($lu<DELAY) usleep(1000000*(DELAY-$lu));
$lu = microtime(true);
//
$link = array_shift($tocrawl);
$crawled[] = $link;
curl_setopt($ch, CURLOPT_URL, $link);
$page = curl_exec($ch);
@$dom->loadHTML($page);
$xpath = new DOMXPath($dom);
$list = $xpath->query(”//a[starts-with(@href, 'http://$host')]“);
foreach ($list as $a) {
$href = $a->getAttribute(’href’);
if (substr($href, -1) == ‘/’) $href = substr($href, 0, -1);
if (!in_array($href, $crawled) && !in_array($href, $tocrawl))
$tocrawl[] = $href;
}
}