Please, Not Another Content Generator

Sunday 4 January 2009

Every now and then, I get the urge to start programming a new tool to gather BH content for BH sites.
The idea is that I wake up in the morning, grab a coffee and then feed the generator a keyword, it then scrapes articles from numerous sites. These articles will then be loaded into a fresh website ready to be drip fed over time.
I am also working on a template generator too, so combining the two means I can have a fresh niche website created within minutes. My coffee will still be hot and bifter barely chuffed!
Here is some things I need to remember and do. It not a complete list, and until I have detailed everything I will not start coding.
Requirments:
  • Search a couple of few decent article directories for given keywords
  • Create a list of urls to the actual content
  • Open up all the urls and scrape  n save the data
  • +as they come!
Notes:
  • Depending where is being searched, filters may have to be written to avoid any junk such as ads
  • If available, use advanced settings for searching (display as many as poss' and any other useful shit!)
  • Check for duplicate article while grabbing / saving (of course!)
  • Maybe save the directory serps as we go!
  • Maybe spin / markov the scrapings!
  • +as they come!
The Directories:
  • * RESEARCHING THE LIST *
Poplan:
  • Display a form to send keywords to the script. Use GET data.
  • Script curls our directories. Will look into best / fastest / most appropriate way. Using something along the line of curl_multi_exec
  • Preg our data. (titles, urls, blurbs and other crap!) **Hhmm
  • Save data. **Hhmm
  • Curl our urls, preg'ing the content and saving as we go. Using curl_multi_exec again.
  • With data saved, do as wish!
This does not need to be BH at all. You can use the articles as they are meant to be and leave a reference, just building a niche article site. Some directories may only allow a certain amount of articles to be used on 1 site, but that's not really an issue as we are gonna scrape as many sites as our lazy arses can! 
And anyway, if you want whitehat stuff then this really aint the place mate.
Piss off!