Google’s so dang nice. I could just hug them all.
Recently, they announced that we no longer have to worry about duplicate content. See, Google will sort it all out for us.
So, if you have the same article on your site at; www.mysite.com/article/, www.mysite.com/article/?from=home, and at www.mysite.com/article.html, they’ll be all charitable and figure out which one to use. We’re saved, Google says. Go about our business.
I immediately got 30 snippy e-mails from developers who already hate me, telling me I’ve been wasting their time making them clean up duplication issues.
I could elaborate on ‘No‘, but it’d require cursing, so I’m going to stick with ‘No‘ and explain why duplicate content still sucks.
Wasted crawl budget
I don’t care if Google can suss out every instance of duplicate content on the web. You’re still forcing them to suss it out.
If you have a 10,000 page web site, and 9,000 pages of those pages are duplicates, then Googlebot still has to crawl 9,000 pages it doesn’t need. There. is. no. way. that that is a good thing. Use rel=canonical if you want””Google still has to hit each URL. You’re wasting their time. No one likes having their time wasted.
This is all about crawl efficiency. Don’t waste a search engine’s time if you don’t have to. Let a visiting spider grab what it needs and go on its way.
Duplicate content still sucks.
There is another search engine
Ever hear of Bing? It’s not so speedy or clever. But it does generate 10-15% of all web traffic. If you think that’s not worth bothering about, you’re in better shape than I am. I’ll take any smidgen of relevant traffic I can get.
Duplicate content will still wreak havoc on Bing, as well as on many vertical search engines, Facebook’s proto-search engine and everything else people use to crawl the web.
Duplicate content still sucks!
I come to your site and find the article to which I want to link at www.site.com/?blah=foo, and then someone else finds the same content at www.site.com/?blah=foo&dir=dem and links to it there. Congratulations! You just split your link authority in half for that page! Nice job.
Except it’s not a nice job. It’s a stupid job. And again, rel=canonical may help sort out the link chaos, but not as well as just doing it right in the first f@#)($* place.
Duplicate content still sucks!!!
First thing you do to improve server performance is set up some kind of caching. Caching stores a copy of all, or most-accessed, pages on your site. But most caching schemes are based on page URLs. Say you have the same exact article at three different URLS. Your web server or caching server will have to store three copies of the same page.
That wastes storage, memory and resources on your server. It also means that, until all three versions of the page are cached, you’re still not delivering the performance improvement caching normally generates.
Duplicate content still sucks!!!!!!
Trying to track the attention a single page on your site gets? Duplicate content turns it into a shell game. Multiple versions of each page means tracking down each version, averaging time-on-page, averaging bounce rate, etc..
The irony is that many developers create duplication trying to make analytics easier: They’ll add something like ?from=topnav to all links in the top navigation so that these show up as separate clicks in traffic reports.
Not smart. You can track which clicks come from which areas using tools like ClickTale or CrazyEgg. And you’ve created a total mess for engagement analysis.
Duplicate. Content. Is. The. El. Sucko.
You get my point
Hopefully by now you get the point. Duplicate content is bad for plenty of reasons. Google’s latest questionable claim is another excuse for doing it wrong. Don’t buy it. Build your site right, fix duped content and you’ll have a faster, better-ranking, easier-to-measure site.
Well said sir, it often seems to be a constant battle against developers and their (usually) open source content management systems which find ever-more ingenious ways of producing half a dozen addresses for each page.
Just like the old arguments about whether validation helps ranking (when the argument should be does good coding help produce good websites) another ill-considered Google quote gives them all the excuse to carry on doing it wrong.
And if one more programmers tells me that “it uses a 302 redirect because that’s the default” then I may just turn into Freddie and start slicing and dicing…
Definitely a great post about duplicate content!
It’s obvious for many of us, but having too many similar pages is a pain, for crawlers and people alike!
😀 thanks.. I myslf have a site of 300 pages n some are similar.. helpful post for me.
no doubt this post is very informative and have facts that there are many issues you can face because of duplicate content apart from only the typical “content duplication penalty”. also the Analytic Expert should refer it for their practices because they do nourish duplicate pages.
thanks for this information, as i do tend to use those tags for analytics and i also have site which has 40k around pages.
Totally agree. Sculpting is just as important now with competition on the rise.
Good article, Ian. I’ve been somewhat on the fence regarding dup. content. But I’ve always leaned toward the cautious side. It’s just not good practice, IMO, and it certainly does nothing for the user experience, even if Google does catch it all.
There’s a seemingly endless list of ways to screw up – developers seem to getting ever more inventive at how to produce sites that are “teh suck”.
Most forum, blog, cart and CMS software is riddled with these problems. Sure, Google will “clean it up for you” – by taking a guess which URL to use. You can be quite sure that it will not be the one that you would have chosen.
Good post. Maybe this time someone will sit up and take notice, but I am not holding my breath.
Ian – do you have any resources you could share regarding this announcement from Google?
Would love to read more about it…
Personally I gave up reading Google announcements ‘cos they always lack substance and are full of generalities and idealised situation (and often a bit of FUD too). They said 10 years ago that they could easily detect hidden text yet their serps have been full of sites using it blatantly ever since.
Watch what they do not what they say!
I agree that we should remove duplicates, especially when we know they exist — it is a waste of the search engine’s time. While one may think the task is tedious, it makes for a better website.
Comments are closed.