Scraping Yahoo Results

Thursday, 1 January 2009

Hoya!

Simple little script for ya here. This scraper basicaly pulls all the data it can for your given keyword.

Returns the amount of results, suggested keywords and 1st 10 results (broken into title, blurb and url).

I suggest you run this in a development only enviroment (Wampserver is a great suggestion for windows users new to PHP and wanting to play around with scripts!)

The Code:
<?php
# config
$yah_search = 'http://search.yahoo.com/search?ei=UTF-8&fr=yfp-t-802&rs=more&rs=all&p=';
$ua = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11';

$keyword = 'poker';

# Load the page
$yahoo_data = $yah_search.urlencode($keyword);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $yahoo_data);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $ua);
$yahoo_data = curl_exec($ch);
curl_close($ch);

# Get the search results total
preg_match_all("#1 - 10 of (.+?) for #", $yahoo_data, $yah_total);
foreach($yah_total[1] as $key=>$val):
$total.= $val;
endforeach;

# Get the suggested keywords
preg_match_all("#=rs-top\">(.+?)<\/a>,<\/li>#", $yahoo_data, $yah_keywords);
foreach($yah_keywords[1] as $key=>$val):
$keywords.= strip_tags($val).'<br />';
endforeach;

# Get the 1st 10 results
preg_match_all("#<h3><a class=\"yschttl\" (.+?) >(.+?)<\/h3>#", $yahoo_data, $title);
preg_match_all("#<div class=\"abstr\">(.+?)<\/div>#", $yahoo_data, $rs);
preg_match_all("#<span class=url>(.+?)<\/span>#", $yahoo_data, $urls);
foreach($title[2] as $key => $none):
$res.= '<b>'.strip_tags($title[2][$key]).'</b><br />'
.wordwrap(strip_tags($rs[1][$key])).'<br />'
.strip_tags($urls[1][$key])."<br /><br />\n";
endforeach;

# Output it all!
echo'<pre><scraped>';
echo $total.' Results<br /><br />';
echo'<b>Suggested Keywords</b>:<br />'. $keywords.'<br />';
echo $res;
echo'</scraped></pre>';

Really simple!

Maybe too simple? What if you want more than 10 results?To show 100 results, simply look for 1st line:

$yah_search = 'http://search.yahoo.com/search?ei=UTF-8&fr=yfp-t-802&rs=more&rs=all&p=';

And change it to:

$yah_search = 'http://search.yahoo.com/search?n=100&ei=UTF-8&va_vt=any&vo_vt=any&ve_vt=any&vp_vt=any&vd=all&vst=0&vf=all&vm=p&fl=0&fr=sfp&p=';

Maybe you want the urls to become hyperlinks?

Piece of piss! Just look for this:

foreach($title[2] as $key => $none):
$res.= '<b>'.strip_tags($title[2][$key]).'</b><br />'
.wordwrap(strip_tags($rs[1][$key])).'<br />'
.strip_tags($urls[1][$key])."<br /><br />\n";
endforeach;

And change it to:

foreach($title[2] as $key => $none):
$res.= '<b>'.strip_tags($title[2][$key]).'</b><br />'
    .wordwrap(strip_tags($rs[1][$key])).'<br />
    <a href="http://'.strip_tags($urls[1][$key])."\">".strip_tags($urls[1][$key])."</a><br /><br />\n";
endforeach;

There is loads you can do!

Hows about saving the results in a database and randomly choosing 10 blurbs for filling a space? Might add a solution if anyone asks!

For now, Enjoy, and pop back to this page for further ideas and implementations for this scraper!