On Monday, like most of the rest of the search world, I was reading the seomoz blog and marvelling at their amazing accomplishment. For those of you who don’t keep up with what’s going on, Rand and the seomoz team have built an database of 30 billion web pages including a link map with information about which links are nofollowed and the anchor text used. All this is available for only $79/month.
An Index for Everyone
On Sphinn MarkeD commented “I can see this being the next trend, private indexes for each SEO agency.” Which got me thinking about the best way to go about collecting a similar sort of data to the seomoz index whilst minimising costs. Crawling the whole web is never going to be cheap; bandwidth and storage requirements are too big, but I think that for most applications a crawl of the whole web is not necessary. Ask uses a “hubs and authorities” model of the web and Google is believed to use something similar to calculate rankings called hilltop so I believe a crawl around the main players in a site’s vertical will give most of the relevant data.
Anyone can Crawl
So how would I accomplish this? The open source search engine Nutch (used by wikipedia) has a web crawler that can gather all the data used by the linkscape tool. Nutch uses the Apache License so, if you know java you can modify the whole thing to suit your needs. Most of the Nutch documentation goes straight over my head, but there is an easier to understand review of the crawler on java.net.
The official Nutch site has a simple tutorial on how to set up a whole web crawl. By using a list of seeds that are appropriate to your vertical and a smaller crawl depth a smaller link map can be made that only considers sites in your target community. You can then apply your own metrics to this data and decide on your strategy.
Sharing the Load
But what if you don’t agree with the hubs and authorities model, or what if you think that hilltop doesn’t have much influence on most rankings? These opinions are justified; pre-hilltop Google massively outperformed Ask so the value of a search model based on web communities is debatable. What is a small SEO company to do? A single small company can’t really do very much, but many small companies can utilise the java distributed computing service hadoop (inspired by Google’s MapReduce and File System) which is well supported by Nutch. This means that as a community, SEO companies could have a comprehensive index of the web, including pretty much any information we’d like with a cost dependent on the number of participants.
$79/month? This could work out cheaper but the initial investment of time would be large; it took seomoz 12 months to set up their database and they were all working for the same company with a clearly defined goal. I don’t think my idea will survive in the wild; Rand, your investment is safe.