<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<title>Building the perfect SEO crawler</title>
		<description>Discuss Building the perfect SEO crawler</description>
		<link>http://searchnewscentral.com/20120213251/Technical/building-the-perfect-seo-crawler.html</link>
		<lastBuildDate>Sat, 18 May 2013 16:38:56 +0000</lastBuildDate>
		<generator>JComments</generator>
		<atom:link href="http://searchnewscentral.com/feed/com_content/251/Page-1.html" rel="self" type="application/rss+xml" />
		<item>
			<title>SEO Newbie says:</title>
			<link>http://searchnewscentral.com/20120213251/Technical/building-the-perfect-seo-crawler.html#comment-1108</link>
			<description><![CDATA[@Screaming Frog - I think it would be great if you guys provided an exportable summary of all the issues found - preferably an Excel workbook with separate worksheets for ever issue.]]></description>
			<dc:creator>SEO Newbie</dc:creator>
			<pubDate>Tue, 14 Feb 2012 16:54:44 +0000</pubDate>
			<guid>http://searchnewscentral.com/20120213251/Technical/building-the-perfect-seo-crawler.html#comment-1108</guid>
		</item>
		<item>
			<title>Chris McGiffen says:</title>
			<link>http://searchnewscentral.com/20120213251/Technical/building-the-perfect-seo-crawler.html#comment-1107</link>
			<description><![CDATA[Polite is an important one, and why I rarely use a multi-threaded crawler for a single site (they are handy when crawling lots of different URLs though such as checking validity of backlinks or a listing from SERPs). Think the HTTP definition does recommend keeping it to 3 requests at a time max between a single server/client - although that is when you could use distribution or proxies. I been wanting to do clustering/classification based on HTML features, but never got around to it, and like the idea of mixing NoSQL and SQL databases for storage. Boilerplate extraction based on the HTML hierarchy and tag/text ratio has worked well for me, as has phrase based indexing for extracting phrases and identifying the pages/components they appear in. Also PageRank is ok for analysing structure, although ideally needs combined with some weighting for where the links are located on the page which I've not yet nailed. I do some hierarchical analysis as well to map out the site, but this gets unwieldy with anything but the simplest of sites. Using cross-entropy to measure the similarity of pages is also reasonably effective at identifying similar content, although if I had the time would probably look at something based on block-level HTML to give an indication of just what components are being duplicated.]]></description>
			<dc:creator>Chris McGiffen</dc:creator>
			<pubDate>Tue, 14 Feb 2012 12:43:40 +0000</pubDate>
			<guid>http://searchnewscentral.com/20120213251/Technical/building-the-perfect-seo-crawler.html#comment-1107</guid>
		</item>
		<item>
			<title>Ryan Chooai says:</title>
			<link>http://searchnewscentral.com/20120213251/Technical/building-the-perfect-seo-crawler.html#comment-1106</link>
			<description><![CDATA[ Yes Ian, the insights are great. Actually this is the dream of any SEO. Btw, I don't see any link analysis methods are mentioned. Since you are going to build an index and do offline computation anyway, why not also create a matrix to do PageRank sort of node importance calculation?]]></description>
			<dc:creator>Ryan Chooai</dc:creator>
			<pubDate>Tue, 14 Feb 2012 06:18:08 +0000</pubDate>
			<guid>http://searchnewscentral.com/20120213251/Technical/building-the-perfect-seo-crawler.html#comment-1106</guid>
		</item>
		<item>
			<title>SEO Newbie says:</title>
			<link>http://searchnewscentral.com/20120213251/Technical/building-the-perfect-seo-crawler.html#comment-1105</link>
			<description><![CDATA[I like what Ranks.NL has done with their Page Analyzer. The tool is a keyword density/prominence analyzer. For every phrase that it finds on the page, the tool does a Google Ranking check. It uses SEMRush data, so it won't have ranking information on every page. But it's still a step in the right direction.]]></description>
			<dc:creator>SEO Newbie</dc:creator>
			<pubDate>Mon, 13 Feb 2012 23:03:25 +0000</pubDate>
			<guid>http://searchnewscentral.com/20120213251/Technical/building-the-perfect-seo-crawler.html#comment-1105</guid>
		</item>
		<item>
			<title>SEO Newbie says:</title>
			<link>http://searchnewscentral.com/20120213251/Technical/building-the-perfect-seo-crawler.html#comment-1104</link>
			<description><![CDATA[I like all of your ideas, Ian. I'm surprised that there isn't a crawler that can classify content like some of the advanced, text mining tools. For example, RapidMinder will show you duplicate content and content cluster based on similarity (it drains your computer RAM though).]]></description>
			<dc:creator>SEO Newbie</dc:creator>
			<pubDate>Mon, 13 Feb 2012 22:49:02 +0000</pubDate>
			<guid>http://searchnewscentral.com/20120213251/Technical/building-the-perfect-seo-crawler.html#comment-1104</guid>
		</item>
		<item>
			<title>Ian Lurie says:</title>
			<link>http://searchnewscentral.com/20120213251/Technical/building-the-perfect-seo-crawler.html#comment-1103</link>
			<description><![CDATA[@Ryan Yep, it is indeed. See above, where I said it's not strictly a crawler, so much as a crawler plus search engine. The insights it can provide would be priceless, though.]]></description>
			<dc:creator>Ian Lurie</dc:creator>
			<pubDate>Mon, 13 Feb 2012 19:06:26 +0000</pubDate>
			<guid>http://searchnewscentral.com/20120213251/Technical/building-the-perfect-seo-crawler.html#comment-1103</guid>
		</item>
		<item>
			<title>Ryan Chooai says:</title>
			<link>http://searchnewscentral.com/20120213251/Technical/building-the-perfect-seo-crawler.html#comment-1102</link>
			<description><![CDATA[;-) This is not a "SEO crawler" but a decent search engine. Be prepared to burn your pensions.]]></description>
			<dc:creator>Ryan Chooai</dc:creator>
			<pubDate>Mon, 13 Feb 2012 17:14:44 +0000</pubDate>
			<guid>http://searchnewscentral.com/20120213251/Technical/building-the-perfect-seo-crawler.html#comment-1102</guid>
		</item>
		<item>
			<title>Elevatelocal says:</title>
			<link>http://searchnewscentral.com/20120213251/Technical/building-the-perfect-seo-crawler.html#comment-1101</link>
			<description><![CDATA[@ScreamingFrog very much looking forward to the next release!]]></description>
			<dc:creator>Elevatelocal</dc:creator>
			<pubDate>Mon, 13 Feb 2012 16:24:40 +0000</pubDate>
			<guid>http://searchnewscentral.com/20120213251/Technical/building-the-perfect-seo-crawler.html#comment-1101</guid>
		</item>
		<item>
			<title>Screaming Frog says:</title>
			<link>http://searchnewscentral.com/20120213251/Technical/building-the-perfect-seo-crawler.html#comment-1100</link>
			<description><![CDATA[We have it all in our next release ;-) Really like the post & some excellent ideas there Ian. A few areas might be a tall order, but all is possible. We have some of the above on own development / wish list. There is huge scope and potential for enterprise level crawling, monitoring and reporting solutions that are integrated in this way.]]></description>
			<dc:creator>Screaming Frog</dc:creator>
			<pubDate>Mon, 13 Feb 2012 15:15:44 +0000</pubDate>
			<guid>http://searchnewscentral.com/20120213251/Technical/building-the-perfect-seo-crawler.html#comment-1100</guid>
		</item>
	</channel>
</rss>
