Building the perfect SEO crawler

A guy can dream, right?

I’ve fiddled with crawler technologies for years. A good web spider is an essential tool for any SEO. There’s Xenu, and Screaming Frog, and Scrapy, and lots of others. They’re all nice. But I have this wish list of features I’d like to see in a perfect SEO crawler.

perfect SEO crawler

I’d always told myself that I’d code something up with all these features when I had the spare time. Since I probably won’t have any of that until I’m mummified, here’s my specification for a perfect SEO crawler:

Scalable

My crawler can’t barf everywhere on sites larger than 100,000 URLs. To make that work, it should be:

 

  • Multi-threaded: Runs several ‘workers’ simultaneously, all crawling the same site.
  • Distributed: Able to use multiple computers to crawl and retrieve URLs.
  • Polite: Monitors site performance. If response times drop, then the crawler slows down. No unintended denial of service attacks, and no angry calls from IT managers, ok?
  • Smart storage: Store pages in a noSQL database for fast retrieval. Then store page attributes, URLs, response codes, etc. in a SQL database.

 

Index building

Just racing through a site and saying “this page is good, this page is bad” isn’t enough. I need to build a real index that stores pages, an inverted index of the site, and response codes/other page data found along the way. That will let me

 

  • Find the topics and terms pages emphasize, and detect clusters of related content.
  • Search for and detect crawl issues ‘off-line’ after the crawl is complete.
  • Mine for coding problems and errors.
  • Check for ‘quality’ signals like grammar and writing level.

 

Page rendering

I also need a crawler that behaves the way the big kids do: Google renders pages. I need my crawler to enable the same behavior. (I know, all you purists will say “The crawler doesn’t do that, the indexing mechanism does.” This is my dream. I get to mess with semantics a little. Phhbbbt.)

If my crawler can store and index pages, then it can render them, too. Give me a crawler that can detect pages where 75% of the content is hidden via javascript, or where critical links are pushed down the page. Then I’ll be happy.

On large sites, this would provide critical insight. Designers are so often obsessed with hiding all that nasty content. That’s fine, if we can see where content needs to be revealed. With a rendering tool, we could do that.

Reporting over time

Finally, fold all of this data together, into a tool that shows me rankings, organic search traffic and site attributes, all in once place. Then, I can finally show my clients what changes they made, how well the changes influenced results, and why they should stop rolling their eyes every time I make a suggestion.

Get me that tomorrow, OK?

I know this is a really tall order. Like I said, I’ve been plinking away at this for years. But I do think it’s all possible. Cloud storage and processor time is cheap. Crawling technologies are ubiquitous. So, who’s with me?

10 Comments

  1. We have it all in our next release 😉

    Really like the post & some excellent ideas there Ian. A few areas might be a tall order, but all is possible. We have some of the above on own development / wish list.

    There is huge scope and potential for enterprise level crawling, monitoring and reporting solutions that are integrated in this way.

  2. @ScreamingFrog very much looking forward to the next release!

  3. 😉 This is not a “SEO crawler” but a decent search engine.

    Be prepared to burn your pensions.

  4. @Ryan Yep, it is indeed. See above, where I said it’s not strictly a crawler, so much as a crawler plus search engine. The insights it can provide would be priceless, though.

  5. I like all of your ideas, Ian. I’m surprised that there isn’t a crawler that can classify content like some of the advanced, text mining tools. For example, RapidMinder will show you duplicate content and content cluster based on similarity (it drains your computer RAM though).

  6. I like what Ranks.NL has done with their Page Analyzer. The tool is a keyword density/prominence analyzer. For every phrase that it finds on the page, the tool does a Google Ranking check. It uses SEMRush data, so it won’t have ranking information on every page. But it’s still a step in the right direction.

  7. [quote name=”Ian Lurie”]@Ryan Yep, it is indeed. See above, where I said it’s not strictly a crawler, so much as a crawler plus search engine. The insights it can provide would be priceless, though.[/quote]

    Yes Ian, the insights are great. Actually this is the dream of any SEO.

    Btw, I don’t see any link analysis methods are mentioned. Since you are going to build an index and do offline computation anyway, why not also create a matrix to do PageRank sort of node importance calculation?

  8. Polite is an important one, and why I rarely use a multi-threaded crawler for a single site (they are handy when crawling lots of different URLs though such as checking validity of backlinks or a listing from SERPs). Think the HTTP definition does recommend keeping it to 3 requests at a time max between a single server/client – although that is when you could use distribution or proxies.

    I been wanting to do clustering/classification based on HTML features, but never got around to it, and like the idea of mixing NoSQL and SQL databases for storage.

    Boilerplate extraction based on the HTML hierarchy and tag/text ratio has worked well for me, as has phrase based indexing for extracting phrases and identifying the pages/components they appear in.

    Also PageRank is ok for analysing structure, although ideally needs combined with some weighting for where the links are located on the page which I’ve not yet nailed. I do some hierarchical analysis as well to map out the site, but this gets unwieldy with anything but the simplest of sites.

    Using cross-entropy to measure the similarity of pages is also reasonably effective at identifying similar content, although if I had the time would probably look at something based on block-level HTML to give an indication of just what components are being duplicated.

  9. @Screaming Frog – I think it would be great if you guys provided an exportable summary of all the issues found – preferably an Excel workbook with separate worksheets for ever issue.

  10. Thats exactly why we’ve built Botify.
    Hope you’ve all started your free trial 😉
    Please share your feedback with the team.
    Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *