I've moved! You can find my new blog at EricMartindale.com.

Remaeus' shared items

Monday, June 04, 2007

PHP5, Scraping, and XPath

I've been building a scraper using PHP5 and the newly added XPath functionality. The idea here, as an exercise in programming, is to scrape complete records from Google Maps, including name, address, and phone number.

Here's a snippet of what I've been trying to do. This probably isn't the best approach, but I can't quite figure out how to pull a child of a resulting element, PHP is forever returning an error when I try to use firstchild.

//start our result counter
$i = 0;
//try setting higher than 1000
while ($i < 1000)
{
//show status so we don't get lost
echo "Currently extracting data from records ".$i." through ".($i + 10)."...";

$raw = new domdocument;
$clean = new domdocument;

//special to Google
$url = 'http://maps.google.com/maps?f=l&hl=en&q='.$what.'&near='.$where.'&view=text&start='.$i."&radius=".$radius;

@$raw->loadHTMLFile($url);

$HTML = $raw->saveHTML();
@$clean->loadHTML($HTML);

$xpath = new domxpath($clean);
$xNodes = $clean->getElementsByTagName('td');

foreach ($xNodes as $xNode)
{
if ($xNode->getAttribute('valign') == "top")
{
//echo $xNode->nodeValue."\n";
$output .= $xNode->nodeValue."";
}
}

echo "...done\n";

//add to our counter
//10 results per page, so we add 10
$i = $i + 10;

}

//fix bugged double comma, can't figure out where this is happening
$output = preg_replace("/,,/",",",$output);

$somecontent = make_csv(strip_non_ascii($output));
echo $somecontent;


There's a bit of extra and unrelated code here, but that's the basic process I'm using.

No comments: