What To Do With All Them Scrapings

Wednesday 21 January 2009

There comes a point where you must stop scraping for the sake of scraping and actually do something creative/profitable/useful!

Some ideas for your scrapings, both good and bad, and in no particular order:

  • Create a search box and backend for one of your sites. Scraping as you go if it has not yet been scraped. Depending on your sources and how you process the scrapings, you might just be able to answer everyones searchs, and sty (kinda)under the radar!
  • Create an online database of resources and stuff to do with a cetain keyword. For example scrape videos to build a niche library and encourage rating and commenting to give it some value. Myabe even have users suggest sources!
  • Create an ebook and become a hope peddling guru with the latest known secret to making millions of dollars per second with out having to do anything. They just buy the ebook and instantly become rich, famous and successful. (then after 12 months, sell it again as version 2.0!!!
  • Create tools to monitor shit like your current position in the serps, analyze certain keywords in the serps and whatever else your fancy coding arse can be bothered churning out. Sell access to it as SEO tools. Charge for access to more tools. Have cunstomers send in their hottest and old enough daughter. (has potential for a 2.0 version too!)
  • Churn out page after page of keyword rich scrapings along side plenty of linkbuilding and cloaking to cash in on affiliate promotions. Its easy to rank for long tailed keywords, and search traffic usualy converts better, find some profitable keywords in any niche, and own em with junk sites!
  • Wrap it up in any old template, register a 99c .info domain and sell it on ebay as a live and established website, with 3 months free hosting, then only $50 per month after. They can move the site any time, but they must pay $95 to cover our good bye email.
  • Warp it up in brown paper, give them to your friends and family for christmas, birthdays, weddings etc then wonder why they think you are stoned or something!

Dah, I'm just weird today!

XML To Article Generator

Friday 9 January 2009

Still toying around with ideas of a content generator template generator command center, here is a few ideas for an article generator.

Idea: RSS feeds! Not the fastest if pulled on the fly, but providing the xml sources are researched properly to ensure that relevant articles appear, and that they are actually updated often, with just a few lines of code you are able to manipulate the data as easy as a drunk college chick!

Save this as xml.php

<?php
/************************************************************************
* Copyright 2005 Niklas Angebrand <niklas.a@gamershell.com>            *
************************************************************************
* This program is free software; you can redistribute it and/or modify *
* it under the terms of the GNU General Public License as published by *
* the Free Software Foundation; either version 2 of the License, or    *
* (at your option) any later version.                                  *
*                                                                      *
* This program is distributed in the hope that it will be useful,      *
* but WITHOUT ANY WARRANTY; without even the implied warranty of       *
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the        *
* GNU General Public License for more details.                         *
************************************************************************/
class RSSParser {
var $in_item, $tag, $item, $items, $parser, $rss_data, $error;
function RSSParser($rss_file = "http://hwhell.com/xml/feature_feed.rss") {
  $this->rss_data = file_get_contents($rss_file);
  $this->parser = xml_parser_create();
  xml_set_object($this->parser, $this);
  xml_set_element_handler($this->parser, 'startElement', 'endElement');
  xml_set_character_data_handler($this->parser, 'characterData');
}

function parse() {
  if (!xml_parse($this->parser, $this->rss_data, true)) {
    $this->error = xml_error_string(xml_get_error_code($this->parser)).": ".xml_get_current_line_number($this->parser);
    return false;
  }
  return $this->items;
}

function startElement($parser, $tagname, $attrs) {
  if ($this->in_item) {
    $this->tag = $tagname;
  } else if ($tagname == 'ITEM') {
    $this->in_item = true;
  }
}

function endElement($parser, $tagname) {
  $this->tag = '';
  if ($this->in_item && $tagname == 'ITEM') {
    unset($this->item['']);
    $this->items[] = $this->item;
    $this->in_item = false;
  }
}

function characterData($parser, $data) {
  $this->item[$this->tag] = $data;
}
}

?>

This handy little class, while looking to scrape gamershell, found I did.

There is loads of ways of parsing xml, but this little puppy has served me well for over a year, and my only mods being removing some whitespace. I just use this for the simple fact this class makes an appearance a few times in my development and construction needs!

To output the feed as some kind of article:

<?php
include'xml.php';// the xml class
$rss = new RSSParser('http://www.yourfeedsource.com/xml.xml');// rss feed
$items = $rss->parse();
$i = 0;// counter
echo'<p>';
foreach ($items as $item):
 echo $item['DESCRIPTION'];
 $i % 3 == 0 ? print"</p>\n<p>" : '';// create our own paragraphs
 $i++;
endforeach;
echo"</p>\n";
?>

All we are doing here is stripping the feed items down to nothing but the descriptions, then making paragraphs out of them. There you go, one fresh article! Depending on your sources, you could have some pretty decent, and elligible results that may even fly under the radar!

Oh Lord, Not Another Template Generator

Wednesday 7 January 2009

As I mentioned, I am already working on template to work along side my content generator, I thought I would put my messy thoughts here!

This time its not so much a list, more just recent thoughts!

The idea is to be able to generate a random template at the click of a button. These templates will have placeholders ready for our saved data.

My initial plans where rather large, and then as I started to think more into it, I realized that a much simpler, but scalable so lotion would be required.

This generator (as well as the content generator) will not be run on the fly from a live box. To free up resources on the server, maintain some sort of standard (by being able to quickly review the final item), and to be able to decide how far I can take things (how fast the content drips, link building yahdeeyah.., they will be used to simply create everything needed to just upload to a server and do their thing, all the content and templating ready to roll.

Kinda like an outta the box, set and forget solution, minus all the marketing bullshit!

Idea 1: Using just a standard DW generated 2 col, header and footer fixed template, randomize everything about it! I mean everything! different style sheet name, use of attributes, random class and id names and numbers, spacing and indenting, and whatever else will not invalidate the markup. (yes, a clean and valid document helps! Just do not waste time linking to prove it!)

Start the script by deciding which style to use from a list of pre defined styles. Some coders will use tabs at the start new lines, or maybe just a space or 2, while others will have many lines between elements and sections. By thinking of as many ways to indent and format our html we are removing more footprints. This would maybe include 1 array such as:
$attribute_style[]='"'; # double quotes
$attribute_style[]="'"; # single quotes
$attribute_style[]="`"; # them things!
$attribute_style[]='';  # nothing!
$att_style = $attribute_style[array_rand($attribute_style)];
echo'<div class=".$att_style.$rand_class.$att_style"></div>;

Note: Do note, that changing the quotes in the DOCTYPE appears to invalid the markup! That's shit innit!

Using scraped images and GD we could try and build some kind of graphical header too, but the amount of editing and filtering would be too much effort, and ongoing too I feel. I generally only fuck around with a few niches so an evening with a big bong and photoshop can produce plenty of unique(my arse) graphics for mashing up into our generator.

This idea is pretty exhaustive and boring to code (big array, randomize it, big array, randomize it ..), and over time will only become too prone to leaving footprints (fingerprints, arseprints whatever you wanna call em!). Not to be written off though, as the concept of having random styles and formatting is the very essence of this generator!

Idea 2: Applying the Big bong approach again, this time we are getting into the template searching zone and downloading as many fresh templates as our lungs can handle! *Maybe write a script to notify of us new templates!

Using a script to grab a random template, we preg all the classes and ids from the .css and .html files and change them! Apply a bit of the randomization techniques mentioned in idea 1 to restyle it, then ...

... Shit! We need to add place holders in the right places for our data (articles, images, videos and other crap). If we are getting our templates from many sources then that just means that every one of them is going to be different!

We know for a fact that each template will have head tags, so we can add relevant meta tags if needed as well as add / remove anything else we see fit. We can assume that h1 will be used and hopefully h2, there is a chance they may always use .headerdiv and .footerdiv too, but a lot more common and relevant ground is needed. So, with a bit of digging round we may be able to find some decent sources that use a standard way of creating their stuff. I would rather target specific sites than waste resources and time going blind for anything and ending up with a fucked up end prodoct most of the time!

Once template sources have been found and and a few downloaded, a bit of randomising will decide which one to use, then preg'ing and randomizing again will take care of the placeholders and any crap we can add, edit or remove.

This idea requires lots of time sourcing the templates. Then we need to write a shit load of regexs for each source in order to clean up any prints. (maybe even remove it from our folder of templates we randomly chose from too!)

Idea 3: Make each template each time I need one, remembering at least once a day to wash, eat, watch porn, feed pet girlfriend and generally live on the human side!

<robotvoice style="dalek">I AM NOT A ROBOT</robotvoice>

Have got a whole day to myself this week, will put some code together all being well.

Multi SERP, Suggested Keywords Scraper

Tuesday 6 January 2009

It uses curl_multi_exec to scrape live, ask, yahoo and google for suggested keywords based on a keyword you provide.

At the moment it just outputs the keywords as hyperlinks to fetch longer tails keywords based on the actual suggestion. But, with just a minor tweak it can output as plain text and with a couple of lines of extra code could even save em to a file/db/milky way.

Its really simple stuff, but pretty slick thanks to curl_multi_exec.

<?php
// Form to fetch keywords
echo'<form method="get" action="">';
echo'<input type="text" name="keyword" />';
echo'<input type="submit" value="Query!" />';
echo'</form>';

// Prepare our keyword
$keyword = trim(str_replace(array('_','+'),' ',strip_tags($_GET['keyword'])));

// Build http header
$header[] = "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5"; 
$header[] = "Cache-Control: max-age=0"; 
$header[] = "Connection: keep-alive"; 
$header[] = "Keep-Alive: 300"; 
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7"; 
$header[] = "Accept-Language: en-us,en;q=0.5"; 
$header[] = "Pragma: ";

// User agent
$ua = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/0.4.154.29 Safari/525.19';

if ($keyword):
  // A keyword has been submitted 
 
 // Our search urls
 $nodes[] = 'http://search.yahoo.com/search?ei=UTF-8&fr=yfp-t-802&rs=more&rs=all&p='.urlencode($keyword);
 $nodes[] = 'http://www.google.com/search?hl=en&btnG=Google+Search&q='.urlencode($keyword);
 $nodes[] = 'http://uk.ask.com/web?search=search&dm=all&qsrc=0&o=312&l=dir&siteid=&q='.urlencode($keyword);
 $nodes[] = 'http://search.live.com/results.aspx?form=QBRE&q='.urlencode($keyword);
 $node_count = count($nodes);
 
 $curl_arr = array();
 $master = curl_multi_init();

 // Loop through urls
 for($i = 0; $i < $node_count; $i++):
  $url = $nodes[$i];
  $curl_arr[$i] = curl_init($url);
  curl_setopt($curl_arr[$i], CURLOPT_URL, $url); 
  curl_setopt($curl_arr[$i], CURLOPT_USERAGENT, $ua); 
  curl_setopt($curl_arr[$i], CURLOPT_HTTPHEADER, $header); 
  curl_setopt($curl_arr[$i], CURLOPT_ENCODING, 'gzip,deflate'); 
  curl_setopt($curl_arr[$i], CURLOPT_TIMEOUT, 10); 
  curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, true);
  curl_setopt($curl_arr[$i], CURLOPT_FOLLOWLOCATION, true);
  curl_multi_add_handle($master, $curl_arr[$i]);
 endfor;

 do{curl_multi_exec($master, $running);}
 while($running > 0);

 // Get our keywords
 for($i = 0; $i < $node_count; $i++):
  $results = curl_multi_getcontent($curl_arr[$i]);
  switch($i):
   case 0: // yahoo.com
    preg_match_all("//=rs-top\">(.+?)<\/a>,<\/li>//", $results, $yah_keywords);
    break;
   case 1: // google.com
    preg_match_all("//<td style=\"padding:0 0 7px;padding-right:34px;vertical-align:top\"><a (.+?)&(.+?)\">(.+?)<\/a>//", $results, $goo_keywords);
    break;
   case 2: // ask.com
    preg_match_all("//<div class=\"zm\" ><a href=(.+?)ec:\'19\'\}\)\" >(.+?)<\/a><\/div>//", $results, $ask_keywords);
    break;
   case 3: // live.com
    preg_match_all("//<li><a href=(.+?);FORM=QSRE(.+?),this\)\">(.+?)<\/a> <\/li>//", $results, $msn_keywords);
    break;
  endswitch;
 endfor;

 // Join keywords and make sure the list is unique
 $wordz = array_unique(array_merge($yah_keywords[1], $goo_keywords[3], $ask_keywords[2], $msn_keywords[3]));

 $words = genKeywords($wordz); // HTML
 //$words = genKeywords($wordz,1); // TEXT

 // Good Place To Maybe Save Em?!
 // Make a folder in the same folder as this script, name it keywords, then uncomment below the save the keyword lists
// $file = str_replace(" ", "_", $keyword);
// if ($fp = fopen('./keywords/'.$keyword.'.txt', "w+")):
//  fwrite($fp, genKeywords($wordz,1));
//  fclose($fp);
//  echo'<h3>Saved!</h3>';
// endif;
 
 // output!
 echo $words;

endif;

function genKeywords($kw, $linked=0)
{
 // Helper function for outputting the data
 // The second parameter if set to 1 will out the data as plain text
 $res = '';
 $rem = array(' ','+','-',);
 foreach ($kw as $keyword):
  $keyword = str_replace($rem, ' ', strip_tags($keyword));
  if (0 == $linked):
   $res.= '<a href="?keyword='.str_replace($rem, '_', $keyword).'">'.$keyword."</a><br />\n";
  else: $res.= trim(str_replace('_', ' ', $keyword))."\r\n";
  endif;
 endforeach;
 return $res;
}

?>
Keyword List Building Made Easy!

Please, Not Another Content Generator

Sunday 4 January 2009

Every now and then, I get the urge to start programming a new tool to gather BH content for BH sites.
The idea is that I wake up in the morning, grab a coffee and then feed the generator a keyword, it then scrapes articles from numerous sites. These articles will then be loaded into a fresh website ready to be drip fed over time.
I am also working on a template generator too, so combining the two means I can have a fresh niche website created within minutes. My coffee will still be hot and bifter barely chuffed!
Here is some things I need to remember and do. It not a complete list, and until I have detailed everything I will not start coding.
Requirments:
  • Search a couple of few decent article directories for given keywords
  • Create a list of urls to the actual content
  • Open up all the urls and scrape  n save the data
  • +as they come!
Notes:
  • Depending where is being searched, filters may have to be written to avoid any junk such as ads
  • If available, use advanced settings for searching (display as many as poss' and any other useful shit!)
  • Check for duplicate article while grabbing / saving (of course!)
  • Maybe save the directory serps as we go!
  • Maybe spin / markov the scrapings!
  • +as they come!
The Directories:
  • * RESEARCHING THE LIST *
Poplan:
  • Display a form to send keywords to the script. Use GET data.
  • Script curls our directories. Will look into best / fastest / most appropriate way. Using something along the line of curl_multi_exec
  • Preg our data. (titles, urls, blurbs and other crap!) **Hhmm
  • Save data. **Hhmm
  • Curl our urls, preg'ing the content and saving as we go. Using curl_multi_exec again.
  • With data saved, do as wish!
This does not need to be BH at all. You can use the articles as they are meant to be and leave a reference, just building a niche article site. Some directories may only allow a certain amount of articles to be used on 1 site, but that's not really an issue as we are gonna scrape as many sites as our lazy arses can! 
And anyway, if you want whitehat stuff then this really aint the place mate.
Piss off!

Scraping Yahoo Results

Thursday 1 January 2009

Hoya!

Simple little script for ya here. This scraper basicaly pulls all the data it can for your given keyword.

Returns the amount of results, suggested keywords and 1st 10 results (broken into title, blurb and url).

I suggest you run this in a development only enviroment (Wampserver is a great suggestion for windows users new to PHP and wanting to play around with scripts!)

The Code:
<?php
# config
$yah_search = 'http://search.yahoo.com/search?ei=UTF-8&fr=yfp-t-802&rs=more&rs=all&p=';
$ua = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11';

$keyword = 'poker';

# Load the page
$yahoo_data = $yah_search.urlencode($keyword);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $yahoo_data);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $ua);
$yahoo_data = curl_exec($ch);
curl_close($ch);

# Get the search results total
preg_match_all("#1 - 10 of (.+?) for #", $yahoo_data, $yah_total);
foreach($yah_total[1] as $key=>$val):
$total.= $val;
endforeach;

# Get the suggested keywords
preg_match_all("#=rs-top\">(.+?)<\/a>,<\/li>#", $yahoo_data, $yah_keywords);
foreach($yah_keywords[1] as $key=>$val):
$keywords.= strip_tags($val).'<br />';
endforeach;

# Get the 1st 10 results
preg_match_all("#<h3><a class=\"yschttl\" (.+?) >(.+?)<\/h3>#", $yahoo_data, $title);
preg_match_all("#<div class=\"abstr\">(.+?)<\/div>#", $yahoo_data, $rs);
preg_match_all("#<span class=url>(.+?)<\/span>#", $yahoo_data, $urls);
foreach($title[2] as $key => $none):
$res.= '<b>'.strip_tags($title[2][$key]).'</b><br />'
.wordwrap(strip_tags($rs[1][$key])).'<br />'
.strip_tags($urls[1][$key])."<br /><br />\n";
endforeach;

# Output it all!
echo'<pre><scraped>';
echo $total.' Results<br /><br />';
echo'<b>Suggested Keywords</b>:<br />'. $keywords.'<br />';
echo $res;
echo'</scraped></pre>';

Really simple!

Maybe too simple? What if you want more than 10 results?To show 100 results, simply look for 1st line:

$yah_search = 'http://search.yahoo.com/search?ei=UTF-8&fr=yfp-t-802&rs=more&rs=all&p=';

And change it to:

$yah_search = 'http://search.yahoo.com/search?n=100&ei=UTF-8&va_vt=any&vo_vt=any&ve_vt=any&vp_vt=any&vd=all&vst=0&vf=all&vm=p&fl=0&fr=sfp&p=';

Maybe you want the urls to become hyperlinks?

Piece of piss! Just look for this:

foreach($title[2] as $key => $none):
$res.= '<b>'.strip_tags($title[2][$key]).'</b><br />'
.wordwrap(strip_tags($rs[1][$key])).'<br />'
.strip_tags($urls[1][$key])."<br /><br />\n";
endforeach;

And change it to:

foreach($title[2] as $key => $none):
$res.= '<b>'.strip_tags($title[2][$key]).'</b><br />'
    .wordwrap(strip_tags($rs[1][$key])).'<br />
    <a href="http://'.strip_tags($urls[1][$key])."\">".strip_tags($urls[1][$key])."</a><br /><br />\n";
endforeach;

There is loads you can do!

Hows about saving the results in a database and randomly choosing 10 blurbs for filling a space? Might add a solution if anyone asks!

For now, Enjoy, and pop back to this page for further ideas and implementations for this scraper!

Scraping Data Because Its There

Scraping data is not a set and forget process. Unfortunatly, your sources change things about their site every now and then which means your scraping tools need a tweak now and then.

Luckily, its a piece of piss to do. With a few PHP basics under your belt, adjusting, editing and even writing scraper scripts is not too challenging.

Over time, this blog will be my notebook, code depot and place to let of some steam (bennet!)

My name is Dork. Full name, Dork Hairyarse Mingebreath.

My parents have a sense of humour and very open minds.

I play around with websites, tinker with PHP and generally hover around the edges.

If any of that shit amuses you, then add this site to your reader and enjoy whats to come.

Be sure to have your say in the comments too.