If there's one thing that has driven me nuts over the last 6 months it's the non-stop chatter and search...
Algorithm Updates vs Manual Penalties...
3 Quick Fixes to Enterprise-Level...
Link Pruning - is it...
In this video interview link building expert Jim Boykin explains[…]
Join Nick Mihailovski and Ikai Lan from the Analytics and[…]
Could you please give details on what should be included[…]
| Building the perfect SEO crawler |
| Written by Ian Lurie | |
| Monday, 13 February 2012 00:00 | |
| A guy can dream, right? I've fiddled with crawler technologies for years. A good web spider is an essential tool for any SEO. There's Xenu, and Screaming Frog, and Scrapy, and lots of others. They're all nice. But I have this wish list of features I'd like to see in a perfect SEO crawler. I'd always told myself that I'd code something up with all these features when I had the spare time. Since I probably won't have any of that until I'm mummified, here's my specification for a perfect SEO crawler: ScalableMy crawler can't barf everywhere on sites larger than 100,000 URLs. To make that work, it should be:
Index buildingJust racing through a site and saying "this page is good, this page is bad" isn't enough. I need to build a real index that stores pages, an inverted index of the site, and response codes/other page data found along the way. That will let me
Page renderingI also need a crawler that behaves the way the big kids do: Google renders pages. I need my crawler to enable the same behavior. (I know, all you purists will say "The crawler doesn't do that, the indexing mechanism does." This is my dream. I get to mess with semantics a little. Phhbbbt.) If my crawler can store and index pages, then it can render them, too. Give me a crawler that can detect pages where 75% of the content is hidden via javascript, or where critical links are pushed down the page. Then I'll be happy. On large sites, this would provide critical insight. Designers are so often obsessed with hiding all that nasty content. That's fine, if we can see where content needs to be revealed. With a rendering tool, we could do that. Reporting over timeFinally, fold all of this data together, into a tool that shows me rankings, organic search traffic and site attributes, all in once place. Then, I can finally show my clients what changes they made, how well the changes influenced results, and why they should stop rolling their eyes every time I make a suggestion. Get me that tomorrow, OK?I know this is a really tall order. Like I said, I've been plinking away at this for years. But I do think it's all possible. Cloud storage and processor time is cheap. Crawling technologies are ubiquitous. So, who's with me? More articles by this author | |
| Last Updated on Monday, 13 February 2012 13:44 |
Home - all the latest on SNC
SEO - our collection of SEO articles
Technical SEO - for the geeks
Latest News - latest news in search
Analytics - measure up and convert
RSS Rack - feeds from around the industry
Search - looking for something specific?
Authors - Author Login
SEO Training - Our sister site
Contact Us - get in touch with SNC
| Digital Marketing Weekly - Issue 4 Linked to last week's Google I/O event in San Francisco there is a lot of new features and updates [ ... ] | Playing with Google Conversational Search Earlier today Danny Sullivan, via SEL, was writing about Google Conversational search, which was an [ ... ] |
Comments
Really like the post & some excellent ideas there Ian. A few areas might be a tall order, but all is possible. We have some of the above on own development / wish list.
There is huge scope and potential for enterprise level crawling, monitoring and reporting solutions that are integrated in this way.
Be prepared to burn your pensions.
Yes Ian, the insights are great. Actually this is the dream of any SEO.
Btw, I don't see any link analysis methods are mentioned. Since you are going to build an index and do offline computation anyway, why not also create a matrix to do PageRank sort of node importance calculation?
I been wanting to do clustering/classification based on HTML features, but never got around to it, and like the idea of mixing NoSQL and SQL databases for storage.
Boilerplate extraction based on the HTML hierarchy and tag/text ratio has worked well for me, as has phrase based indexing for extracting phrases and identifying the pages/components they appear in.
Also PageRank is ok for analysing structure, although ideally needs combined with some weighting for where the links are located on the page which I've not yet nailed. I do some hierarchical analysis as well to map out the site, but this gets unwieldy with anything but the simplest of sites.
Using cross-entropy to measure the similarity of pages is also reasonably effective at identifying similar content, although if I had the time would probably look at something based on block-level HTML to give an indication of just what components are being duplicated.
RSS feed for comments to this post