Banner

Follow Along

RSS Feed Join Us on Twitter On Facebook

Get Engaged

Banner

Featured Article

Getting a grip on social signals in searchGetting a grip on social signals in  searchIf there's one thing that has driven me nuts over the last 6 months it's the non-stop chatter and search...
Read More >>

Latest Comments

Latest Articles

Will Googles Agent Rank Ever Become a Ranking Factor?Will Google's Agent Rank Ever...
I've seen some interesting discussions recently on the question of whether authority (Agent Rank)...
Read More >>
Algorithm Updates vs Manual Penalties - Some People Still Don’t Get ItAlgorithm Updates vs Manual Penalties...
In the fallout of the last publicly announced (sorta) Panda update and as the...
Read More >>
3 Quick Fixes to Enterprise-Level Technical SEO3 Quick Fixes to Enterprise-Level...
As Google continues to transpose the idea and essence of the real world, physical marketplace...
Read More >>

Our Sponsors

Banner
Banner
Banner

Latest Search Videos

Join Us

Banner
Banner
Building the perfect SEO crawler
Written by Ian Lurie
Monday, 13 February 2012 00:00

A guy can dream, right?

I've fiddled with crawler technologies for years. A good web spider is an essential tool for any SEO. There's Xenu, and Screaming Frog, and Scrapy, and lots of others. They're all nice. But I have this wish list of features I'd like to see in a perfect SEO crawler.

perfect SEO crawler

I'd always told myself that I'd code something up with all these features when I had the spare time. Since I probably won't have any of that until I'm mummified, here's my specification for a perfect SEO crawler:

Scalable

My crawler can't barf everywhere on sites larger than 100,000 URLs. To make that work, it should be:

 

  • Multi-threaded: Runs several 'workers' simultaneously, all crawling the same site.
  • Distributed: Able to use multiple computers to crawl and retrieve URLs.
  • Polite: Monitors site performance. If response times drop, then the crawler slows down. No unintended denial of service attacks, and no angry calls from IT managers, ok?
  • Smart storage: Store pages in a noSQL database for fast retrieval. Then store page attributes, URLs, response codes, etc. in a SQL database.

 

Index building

Just racing through a site and saying "this page is good, this page is bad" isn't enough. I need to build a real index that stores pages, an inverted index of the site, and response codes/other page data found along the way. That will let me

 

  • Find the topics and terms pages emphasize, and detect clusters of related content.
  • Search for and detect crawl issues 'off-line' after the crawl is complete.
  • Mine for coding problems and errors.
  • Check for 'quality' signals like grammar and writing level.

 

Page rendering

I also need a crawler that behaves the way the big kids do: Google renders pages. I need my crawler to enable the same behavior. (I know, all you purists will say "The crawler doesn't do that, the indexing mechanism does." This is my dream. I get to mess with semantics a little. Phhbbbt.)

If my crawler can store and index pages, then it can render them, too. Give me a crawler that can detect pages where 75% of the content is hidden via javascript, or where critical links are pushed down the page. Then I'll be happy.

On large sites, this would provide critical insight. Designers are so often obsessed with hiding all that nasty content. That's fine, if we can see where content needs to be revealed. With a rendering tool, we could do that.

Reporting over time

Finally, fold all of this data together, into a tool that shows me rankings, organic search traffic and site attributes, all in once place. Then, I can finally show my clients what changes they made, how well the changes influenced results, and why they should stop rolling their eyes every time I make a suggestion.

Get me that tomorrow, OK?

I know this is a really tall order. Like I said, I've been plinking away at this for years. But I do think it's all possible. Cloud storage and processor time is cheap. Crawling technologies are ubiquitous. So, who's with me?

Ian Lurie -

Ian Lurie is Chief Marketing Curmudgeon and President at Portent, an internet marketing company he started in 1995. Portent is a full-service internet marketing company whose services include SEO, SEM and strategic consulting. He started practicing SEO in 1997 and has been addicted ever since. Ian rants and raves, with a little teaching mixed in, on his internet marketing blog, Conversation Marketing. He recently co-published the Web Marketing for Dummies All In One Desk Reference. In it, he wrote the sections on SEO, blogging, social media and web analytics.

Also hook up via

Read More >>


More articles by this author

Dear Google: This is warDear Google: This is war
Dear Google: With your announcement yesterday, you've become the enemy. My...
Read More >>
How to: Scrape search engines without pissing them offHow to: Scrape search engines without pissing them off
You can learn a lot about a search engine by...
Read More >>
Last Updated on Monday, 13 February 2012 13:44
 

Comments  

 
0 #1 Screaming Frog 2012-02-13 15:15
We have it all in our next release ;-)

Really like the post & some excellent ideas there Ian. A few areas might be a tall order, but all is possible. We have some of the above on own development / wish list.

There is huge scope and potential for enterprise level crawling, monitoring and reporting solutions that are integrated in this way.
Quote
 
 
0 #2 Elevatelocal 2012-02-13 16:24
@ScreamingFrog very much looking forward to the next release!
Quote
 
 
0 #3 Ryan Chooai 2012-02-13 17:14
;-) This is not a "SEO crawler" but a decent search engine.

Be prepared to burn your pensions.
Quote
 
 
+1 #4 Ian Lurie 2012-02-13 19:06
@Ryan Yep, it is indeed. See above, where I said it's not strictly a crawler, so much as a crawler plus search engine. The insights it can provide would be priceless, though.
Quote
 
 
0 #5 SEO Newbie 2012-02-13 22:49
I like all of your ideas, Ian. I'm surprised that there isn't a crawler that can classify content like some of the advanced, text mining tools. For example, RapidMinder will show you duplicate content and content cluster based on similarity (it drains your computer RAM though).
Quote
 
 
0 #6 SEO Newbie 2012-02-13 23:03
I like what Ranks.NL has done with their Page Analyzer. The tool is a keyword density/prominence analyzer. For every phrase that it finds on the page, the tool does a Google Ranking check. It uses SEMRush data, so it won't have ranking information on every page. But it's still a step in the right direction.
Quote
 
 
0 #7 Ryan Chooai 2012-02-14 06:18
Quoting Ian Lurie:
@Ryan Yep, it is indeed. See above, where I said it's not strictly a crawler, so much as a crawler plus search engine. The insights it can provide would be priceless, though.


Yes Ian, the insights are great. Actually this is the dream of any SEO.

Btw, I don't see any link analysis methods are mentioned. Since you are going to build an index and do offline computation anyway, why not also create a matrix to do PageRank sort of node importance calculation?
Quote
 
 
0 #8 Chris McGiffen 2012-02-14 12:43
Polite is an important one, and why I rarely use a multi-threaded crawler for a single site (they are handy when crawling lots of different URLs though such as checking validity of backlinks or a listing from SERPs). Think the HTTP definition does recommend keeping it to 3 requests at a time max between a single server/client - although that is when you could use distribution or proxies.

I been wanting to do clustering/classification based on HTML features, but never got around to it, and like the idea of mixing NoSQL and SQL databases for storage.

Boilerplate extraction based on the HTML hierarchy and tag/text ratio has worked well for me, as has phrase based indexing for extracting phrases and identifying the pages/components they appear in.

Also PageRank is ok for analysing structure, although ideally needs combined with some weighting for where the links are located on the page which I've not yet nailed. I do some hierarchical analysis as well to map out the site, but this gets unwieldy with anything but the simplest of sites.

Using cross-entropy to measure the similarity of pages is also reasonably effective at identifying similar content, although if I had the time would probably look at something based on block-level HTML to give an indication of just what components are being duplicated.
Quote
 
 
0 #9 SEO Newbie 2012-02-14 16:54
@Screaming Frog - I think it would be great if you guys provided an exportable summary of all the issues found - preferably an Excel workbook with separate worksheets for ever issue.
Quote
 

Add comment


Security code
Refresh

Getting Around the Site

Home - all the latest on SNC
SEO - our collection of SEO articles
Technical SEO - for the geeks
Latest News - latest news in search
Analytics - measure up and convert
RSS Rack - feeds from around the industry
Search - looking for something specific?
Authors - Author Login
SEO Training - Our sister site
Contact Us - get in touch with SNC

What's New?

All content and images copyright Search News Central 2014
SNC is a Verve Developments production, the Forensic SEO Specialists- where Gypsies roam.