Building the perfect SEO crawler
A guy can dream, right?
I’ve fiddled with crawler technologies for years. A good web spider is an essential tool for any SEO. There’s Xenu, and Screaming Frog, and Scrapy, and lots of others. They’re all nice. But I have this wish list of features I’d like to see in a perfect SEO crawler.
I’d always told myself that I’d code something up with all these features when I had the spare time. Since I probably won’t have any of that until I’m mummified, here’s my specification for a perfect SEO crawler:
My crawler can’t barf everywhere on sites larger than 100,000 URLs. To make that work, it should be:
- Multi-threaded: Runs several ‘workers’ simultaneously, all crawling the same site.
- Distributed: Able to use multiple computers to crawl and retrieve URLs.
- Polite: Monitors site performance. If response times drop, then the crawler slows down. No unintended denial of service attacks, and no angry calls from IT managers, ok?
- Smart storage: Store pages in a noSQL database for fast retrieval. Then store page attributes, URLs, response codes, etc. in a SQL database.
Just racing through a site and saying “this page is good, this page is bad” isn’t enough. I need to build a real index that stores pages, an inverted index of the site, and response codes/other page data found along the way. That will let me
- Find the topics and terms pages emphasize, and detect clusters of related content.
- Search for and detect crawl issues ‘off-line’ after the crawl is complete.
- Mine for coding problems and errors.
- Check for ‘quality’ signals like grammar and writing level.
I also need a crawler that behaves the way the big kids do: Google renders pages. I need my crawler to enable the same behavior. (I know, all you purists will say “The crawler doesn’t do that, the indexing mechanism does.” This is my dream. I get to mess with semantics a little. Phhbbbt.)
On large sites, this would provide critical insight. Designers are so often obsessed with hiding all that nasty content. That’s fine, if we can see where content needs to be revealed. With a rendering tool, we could do that.
Reporting over time
Finally, fold all of this data together, into a tool that shows me rankings, organic search traffic and site attributes, all in once place. Then, I can finally show my clients what changes they made, how well the changes influenced results, and why they should stop rolling their eyes every time I make a suggestion.
Get me that tomorrow, OK?
I know this is a really tall order. Like I said, I’ve been plinking away at this for years. But I do think it’s all possible. Cloud storage and processor time is cheap. Crawling technologies are ubiquitous. So, who’s with me?