Hello and welcome to the first of many Google Plus Hangouts here on SNC. To get things rolling our first session...
Digital Marketing Insights; Integrated SEO...
In this video interview link building expert Jim Boykin explains[…]
Join Nick Mihailovski and Ikai Lan from the Analytics and[…]
Could you please give details on what should be included[…]
| Solving Duplicate Content Issues Arising From Faceted Navigation |
| Written by Barry Adams | |
| Wednesday, 01 June 2011 14:25 | |
| I'm a big fan of faceted navigation on ecommerce websites, also known as layered navigation. With faceted nav users can find exactly what they're looking for with just a few clicks, even on websites that contain tens of thousands of products. A good implementation of faceted nav is a user experience dream come true. Faceted nav also has SEO benefits, in that these facets serve as keyword-rich links and 'tags' of sort that add semantic relevance to the products contained within each facet. But it's not all good news: faceted nav can also result in problems with indexation, specifically duplicate content issues. As sometimes many different facets will contain nearly identical sets of products with little variation, search engine spiders could end up in crawling loops where they crawl slightly different product lists over and over again. One example is an ecommerce site under development I came across recently. Build in Magento, this site uses faceted nav and contains about 1200 products. But when I unleashed Xenu on it, it kept finding new pages until I finally aborted the crawl at over 40,000 URLs crawled. There are several different ways to solve these faceted nav indexation issues: Block facets with robots.txtUsing robots.txt to block search engines from crawling faceted navigation pages is probably the most brute force approach to the problem. It will undoubtedly solve the duplicate content issue, as it will block search engines from crawling vast amounts of pages, but it has several side-effects that make this a less than ideal solution. For one, it will mean that the flow of PageRank within your site will be severely distorted. A natural flow of PageRank within your site should come from a solid site structure. Blocking faceted nav pages with robots.txt effectively distorts your site structure as it is perceived by search engine crawlers, as large parts of your site are basically blacked out for search engines. Also, you lose any semantic SEO value the faceted navigation has. One small upside is that if your site has a low PageRank and a large amount of products, you're less likely to run out of crawl budget before all your product pages are indexed. Verdict: don't do this unless you really don't have any other choice. Nofollow your faceted nav linksYou can tag your faceted navigation links with rel=nofollow, thus preventing search engines from indexing your faceted nav pages. A slightly less blunt instrument than the robots.txt blocking approach, this solution nonetheless suffers from similar problems: it distorts the flow of PageRank within your site, as nofollowed links cause PR to evaporate. Verdict: don't do this. Use rel=canonical on all faceted nav pagesBy using the canonical tag on all faceted nav pages and making sure they refer to the most relevant/important facet (or a 'view all products' single page), you can ensure the duplicate content faceted nav pages are not included in the search engines' indices. The flow of PageRank is unaffected, and you also preserve the semantic value of your facets. However search engines will still crawl all the duplicate content pages, which means your crawl budget could be used up before all product pages are indexed. Verdict: best used in conjunction with one of the other preferred solutions. Use JavaScript/Ajax to hide faceted nav linksWith smart use of JavaScript or AJAX you can ensure that search engines don't actually see the faceted navigation links at all, thus preventing the issue from occurring in the first place. What you do is load all the products in a single page, which you then paginate and divide in to facets with JS or AJAX. Search engines see the whole page with all products and will crawl & index all of them, while users are presented with the user-friendly faceted navigation. This is a very solid solution, but it has one caveat: the semantic value of the faceted nav is lost. Verdict: good solution if your main focus for faceted nav is user experience. Meta noindex/follow tag on faceted nav pagesIn order to prevent search engine crawlers from indexing all your duplicate content pages, you can tell them to keep these pages out of their indices but to still follow the links contained within. With the meta robots tag using the noindex,follow value, you do just that. The pages that have this meta tag will not appear in search engines, but crawlers will still find the products that are contained within these faceted nav pages. The flow of PageRank is preserved, and the semantic value of the facets is also intact. However as with some other solutions, low PR sites may run out of crawl budget. Verdict: a very good solution, especially when combined with canonical tags and static URLs. Static URLs for faceted nav pagesOften a CMS that supports faceted navigation uses parameters in their URLs. Every time a facet is used to filter the listed products, another parameter is appended to the URL. As each URL is different, it will be treated as a separate webpage by search engine spiders, even if it contains the exact same products. To prevent duplicate content issues arising from these parameter-driven URLs, you can configure the CMS to use static URLs for predefined facets, regardless of the order in which that facet was reached. This will drastically reduce the number of URLs on your site, and thus prevent duplicate content issues. So if a user refines a product listing first by price and then by colour, the URL of the page they end up on will be identical to the page reached by a user that refines first by colour and then by price. Verdict: if you have faceted nav and you don't do this, you're an idiot. ConclusionFaceted navigation is a very potent instrument, but you need to implement it the right way. In my opinion the best approach to prevent duplicate content and indexation issues is using static URLs for your facets, combined with meta noindex,follow for facets that have no SEO value. Throw in rel=canonical meta tags that point to your core facets, and the result is the best of both worlds: a solid user experience and the full SEO value. There are probably some other solutions out there to faceted navigation issues. If you know of other/better approaches, leave them in the comments. More articles by this author |
Home - all the latest on SNC
SEO - our collection of SEO articles
Technical SEO - for the geeks
Latest News - latest news in search
Analytics - measure up and convert
RSS Rack - feeds from around the industry
Search - looking for something specific?
Authors - Author Login
SEO Training - Our sister site
Contact Us - get in touch with SNC
| Playing with Google Conversational Search Earlier today Danny Sullivan, via SEL, was writing about Google Conversational search, which was an [ ... ] | The Battle for the Living Room; Xbox One Ok, I was tempted to actually write that “Xbox Won”, but I guess that'll do. As I sa [ ... ] |
Comments
I should thank some folks as well who provided me with content and tips for this blog post, such as Jeroen Smeekens, Jeroen van Eck, and the peeps in the SEO Facebook group - you know who you are.
Just had one quick question. Do you think there's a chance on the horizon that the engines will be able to see through a Javascript/Ajax solution?
I ask because I recently advised a client to go with a static url/noindex/follow solution rather than JS/Ajax. My reasoning was a concern that somewhere down the line the latter may no longer work properly.
This concern was not born from any factual info though, just speculation. I guess it would depend on how clever the solution is. Would be great to get a bit more insight for future reference though.
Thanks.
In theory search engines can already crawl JS/AJAX code that's embedded within a page. One way around this is to load the JS code from an external .js file, and put this file in a directory that you block with robots.txt. That should, theoretically, prevent search engines from seeing the JS - and thus the faceted nav - at all.
In practice, you never really know how search engines go about things. In my opinion as long as you have the best interests of the user in mind, and aren't trying to deliberately deceive, you'll be OK.
I have an eCom client that when we started working with them they had a number of "Googlebot found an extremely high number of URLs on your site" messages siting in Google Webmaster Tools. For anyone who hasn't seen these messages they looking like this - www.matthewsdiehl.com/wp-content/uploads/2011/03/googlebot-extremely-high-number-urls.png
By "extremely high" we were looking at several million URLs for a product base that was no where near that number.
We are working through the pairing of the rel=canonical and having the developers re-code how the faceted navigation URLs are generated (not a small task). The combination will hopefully clean-up what is a can of worms for the crawlers as the site sits today.
If you Nofollow your faceted nav links the pages will still be indexed but the PR will not flow through.
I am currently in the process of rolling out a few e-commerce sites with layered navigation, and will definitely implement your recommended solutions.
Thank you!
Thanks for a great article
@Mark: yes sometimes static URLs for all facets is simply not feasible. In that case I would suggest you do implement static URLs for your most important facets (ideally facets that people would naturally filter by, such as product type and brand) and the rest can then be parameter-driven URLs. I would then use rel=canonical on these parameter-URLs that point to the most relevant core facet with a static URL. That way you'll preserve PR flow on your core facets. The one downside here is that search engines still need to crawl all those parameter-URLs.
I must admit I didn't quite understand the 'Nofollow your faceted nav links' points you made. I can't really understand how adding rel=nofollow to the faceted links will prevent search engines from indexing the faceted nav pages. No follow will tell search engines not to follow the links but there is no guarantee these pages won't get index. A noindex on the faceted pages would allow for the pages not to get indexed.
- this post assumes that blocking certain parts of your website (certain facets/filters) with robots.txt kill your pagerank flow. I'm not sure if that is the case.
- blocking parts of the site with robots.txt does have an awesome advantage: it's damn cheap. That's gotta count for something as well :)
- rel=canonical should never ever be used. It's just a lame excuse for bad information architecture and more often goes wrong than right. It should have never been invented in the first place (even though it wasn't intended for seo purposes, originally)
- the js option is probably best with regards to performance, but also quite difficult to implement and mostly not very scalable, especially when compared to the robots.txt version
For a recent client of mine i had to work with faceted search (it's not navigation, but search. SEO's use it for navigation, but hey, it depends on which viewpoint you take if it's navigation or search ;)). We did it like this:
- every facet had a nice url
- all facets were prioritized, so the same facets of a resultpage were always displayed in the same order -> to prevent unnecessary duplicate urls
- we looked at analytics extensively to see which types of facets were searched-for, so as to decide which types of facets we wanted to have indexed, and which types should be blocked. We chose about 3 facets that were allowed to be indexed
- we created a url-scheme to recognize which kinds of facet-urls should be blocked and which not. We made it simple: if it had a parameter in the url, it would be blocked with a robots.txt wildcard; if it was a normal url, it would be allowed
2$c
Re: robots.txt & PR flow, we have to work on some basic assumptions here, and one that I have is that if a page is blocked with robots.txt it cannot distribute PR properly. So far I have yet to hear a compelling argument to the contrary, but if you have evidence that I might be wrong I'm very keen to see it. Never too old to learn and all that. :)
Re: rel=canonical, I disagree. It was invented for a purpose, and yes while it is an artificial fix for a real problem, it is often simply not feasible to re-design a site's IA entirely to avoid having to use it entirely. Like you said about robots.txt, the rel=canonical solution is cheap and it works.
Re: JS option, yes it's not always scalable but it's definitely an option to consider, especially when building a new site from the ground up and you have the opportunity to implement it.
Re: faceted nav vs faceted search, I think it's a mistake to approach these things from the coder's angle. Yes it might function as a search in the back-end, but it's the user experience that counts. And for a user it's a form of navigation - clicking on links and all that. Putting the user central is vital imho, which is why I insist on calling it faceted navigation.
As always, YMMV, all roads lead to Rome, and all that jazz. Regardless of what the SEO theory declares to be best (my theories included), use whatever works. That's the only real measure.
but ofcourse we continuously monitor analytics for these pages. If we see that a page becomes searched-for, we make it available
i do know the difference between noindex,follow metatag and robots.txt -> what i meant is that with this method, these many, many pages accrue much less pagerank, and thus we don't need to distribute it. It prevents a problem instead of fixing it
But it certainly doesn't mean that this is the best way for all sites (or probably even for this site). There are many ways to Rome :)
Noindex/follow will not close off a spider trap, and canonicalizing variously filtered pages to the unfiltered state is not recommended by at least 1 Googler I've met. Filtered pages are not the same as unfiltered. Canonicalizatio n is for substantially similar pages — a silent 301.
Again, it also won't close a spider trap, esp. in the context of multiple selection. Multiple selection must be excluded. It's something like N! + (N-1)! ... 0! in scope.
RSS feed for comments to this post