Try Tuts+ Premium, Get Cash Back!
The Problem of Duplicate Content and How to Solve it

The Problem of Duplicate Content and How to Solve it

This entry is part 6 of 20 in the SEO Fundamentals for Web Designers Session
« PreviousNext »

One problem that we see on almost every website is duplicate content. Larger websites with hundreds of pages are especially prone to this. But what exactly qualifies as duplicate content? Why does having duplicate content lead to problems and how can we avoid them? We’ll cover all that and more within this article.


What is Duplicate Content?

Duplicate content is exactly what you think it is: two or more pieces of content which are identical, the only difference being the URL.

Google sees every URL as a separate page. Owing to this, it would consider the following URLs to be completely different pages:

  • Original page with red shirts: http://website.com/shirts/red
  • Same page, but ordered by price: http://website.com/shirts/red?order=asc

The problem here is that we’re basically looking at the same page with the same content. The only difference is that the content on the last URL is in a different order. Google sees this as duplicate content.


Why is Duplicate Content Bad?

Duplicate content confuses search engines. Why? Because they have a hard time deciding which page is most relevant for a search query.

Search engines will never display two identical pieces of content in the SERPs. This is done to ensure the highest search quality; seeing the same content twice is not very interesting for the user.

Another problem is the ranking power of duplicate pages. Instead of having a single page with a lot of authority, you have multiple pages with diluted, suboptimal performance. This might cost you a lot of organic traffic.


How Duplicate Content is Created

Duplicate content can be created deliberately or by accident. Nevertheless, the result is the same.

An example of deliberate duplicate content is the print version of a page. It’s effectively the same page with the same content, so when this print version gets indexed, there’s an issue with duplicate content.

However, there are plenty of situations where duplicate content is created unintentionally. There can be several causes, such as:

  • Session IDs
  • Sorting options
  • Affiliate codes
  • Domains

Session IDs

A session ID is a variable, a string of randomly-generated numbers and/or letters and is used to keep track of visitors. They are often used for shopping carts, for example:

http://website.com/?sessionid=5649612

The problem with session IDs is obvious: they can create hundreds, perhaps even thousands of duplicates. Storing session IDs in cookies can solve this problem, but if you rely on this option, don’t forget about the EU cookie law.

Sorting Options

When people think about sorting options, they usually think about web shop product catalogues where users can sort by price, date, etc. But sorting functions are often found on other websites too. The following URL uses a typical blog sorting function:

http://website.com/category?sort=asc

The URL with the sorting option and the original are basically the same page. It’s the same content, only sorted in a different manner.

Affiliate Codes

Affiliate codes are popping up all over the web. They are used to identify the referrer, who is in turn rewarded for bringing in a new visitor. An affiliate code can look like this, for example:

http://website.com/product?ref=name

Once again, this code can create a duplicate of the original page.

Domains

Even something as simple as a domain name can sometimes be problematic. Take a look at the following URLs:

http://website.com

http://www.website.com

Search engines have come a long way, but occasionally they still get this one wrong. Both URLs probably point to the homepage, but because both URLs look different they are sometimes seen as different pages.


How to Identify Duplicate Content

We’ve talked about how duplicate content is created, but how can you identify duplicate content issues on your site?

The easiest way to do this is via Google Webmaster Tools. Log in to your account and go to Optimization > HTML Improvements. Here you’ll find a list of duplicate titles (which is probably duplicate content).

duplicate titles
Google Webmaster Tools

Alternatively you can enter the site:-search command in the url bar to find pages from a specific domain (e.g. site:webdesign.tutsplus.com). This method is very useful if you suspect that a particular page has several duplicates. Use the site command and paste a couple of sentences from the suspicious page. If you get a message from Google saying “In order to show you the most relevant results, we have omitted some entries…”, you probably have duplicate content.

Finally, you could also use site crawlers. Software such as Xenu and Screaming Frog can be used to gather necessary information. Analyse the page titles in the crawl report and check for duplicates.


Solving Duplicate Content Issues

As the saying goes: “every illness has a cure”. Fortunately, there are several ways to cure duplicate content issues:

301 Redirect

A simple way to prevent duplicate content from being indexed is a 301 redirect. This way the user and search engines are redirected from the duplicate to the original. As a result, all link juice is sent to the original page.

A 301 redirect is implemented on Apache servers by adding rules to your server’s .htaccess file. Keep in mind that this method ‘deletes’ the copy. If you don’t want to get rid of the duplicate page(s), you should use the following method.

Rel=canonical

There’s another way to tell search engines about duplicate content; the rel=”canonical” tag. This piece of code should be implemented in the <head> of a web page.

Let’s say we have Page B that is a duplicate of Page A. If we want to inform search engines of this, we would put the following code in the markup of Page B:

<link href=”http://website.com/Page-A” rel=”canonical” />

This code states that the current page is actually a copy of the above mentioned URL. After implementing it, most link juice will be transferred to the original page and thus improving the ranking power of that page. Contrary to the 301 redirect, the duplicate pages will still be accessible.

Meta Robots Tag

We’ve already discussed the robots meta tag in detail during a previous tutorial. By adding a meta robots tag with the “noindex” parameter, you can prevent the duplicate page from being indexed.

URL Rewriting

This is a more advanced solution. It’s more difficult to implement if you have a limited understanding of code, but it can be useful on a number of occasions.

As mentioned before, the domain name can often cause duplicate content issues (www vs non-www version). You can solve this problem by adding a URL-rewrite rule to your htaccess file (something else we’ve covered before on Webdesigntuts+). Choose your preferred domain (www or non-www) and automatically rewrite URLs to the specified domain.

Another problem we’ve talked about is use of Session IDs. The same URL with a different Session ID appended can be seen as duplicate content. Once again the htaccess file can be used to disable these parameters. Read Disable session ID’s passed via URL by Constantin Bejenaru to learn how to do this.

Google Webmaster Tools

In the previous section we talked about automatic URL-rewriting for domain names. An easier way to do this is via Google Webmaster Tools. Simply log in to your account, go to Configuration, click on Settings and set a preferred domain.

Preferred Domain
Google Webmaster’s Preferred Domain

If you’re using dynamic URL parameters, you can tell Google how to handle them. This way you can tell which parameters should be ignored. This can often solve a lot of duplicate content issues. Visit Google Webmaster Tools and go to Configuration > URL Parameters. More information can be found at Google Support, but be sure to use this feature only if you know how parameters work, otherwise you may inadvertently block pages.


Language Targeting

This issue is related to duplicate content, but there are some differences.

Let’s say a company which sells products in North America has two websites: company.us and company.ca. The first one is targeted at the United States, the latter at Canada. On both websites we find content that is similar because the webmasters didn’t want to rewrite several pages of text.

It’s possible that the US version will outperform the Canadian version (even on Google.ca) because it has more authority. How can we fix this targeting problem?

There’s a simple solution: the rel=”alternate” hreflang=”x” annotation.

If we use our previous example, we need to add the following code in the <head> section of the .us domain:

<link rel="alternate" hreflang="en-CA" href="http://company.ca/page-example" />

On the .ca domain we need to place this code:

<link rel="alternate" hreflang="en-US" href="http://company.com/page-example" />

In essence you’re telling Google that there’s an alternative version (or duplicate) in another language. The hreflang attribute uses ISO 639-1 to identify the language. Optionally you could add the region in ISO 3166-1 format.


Closing Remarks

Prevention is better than cure… Consistent internal linking can prevent the creation of duplicate content. If you have http://www.website.com as a preferred domain, don’t point your internal links to the non-www version. The same tip applies to inbound links. If you link to your own site from another domain, use a consistent link structure.

Don’t intentionally create duplicate content by copying large chunks of text from other websites. Google will likely find out about it and the consequences might not be so pleasant:

In the rare cases in which Google perceives that duplicate content may be shown with intent to manipulate our rankings and deceive our users…the ranking of the site may suffer, or the site might be removed entirely from the Google index.


Conclusion

Duplicate content is something you see on almost every site. It can have several causes, whether accidental or otherwise.

Unless you want to prevent access from the page via a 301 redirect, it’s best to use the rel=canonical annotation. Alternatively, you could use the meta robots tag or automatic URL rewriting. Google Webmaster Tools also offer some ways of preventing duplicate content.

Finally, it’s best to be consistent in your linking. Internal links and inbound links should appear the same.

Kevin Vertommen is Sybe on Graphicriver
Tags: seo
Note: Want to add some source code? Type <pre><code> before it and </code></pre> after it. Find out more
  • http://www.paulund.co.uk/ Paul

    Thanks for the language tip I need to use this on my next project.

  • Chris B

    Back to your first example of duplicate content… My understanding was that Google spiders ignored content after the “?” in the URL string — so the two red shirt pages wouldn’t be considered as two separate pages (because really they’re not!) and wouldn’t be duplicate content.

    In a similar fashion, a site that has a million users (http://website.com/?sessionid=5649612) Wouldn’t actually have 1,000,000 individual pages listed in GWT — whether they utilised cookies or not.

    My general impression has always been that the duplicate content filters were set up to stop blatant plagiarism and the same site being reproduced on different domains — and not really for penalising webmasters that omitted the rel=canonical meta tag.

    It’s pretty difficult to keep to-the-minute with SEO, and things may have changed, so can you clarify?

    Enjoying this series!

  • Pingback: All You Need to Know About XML Sitemaps | Webdesigntuts+

  • Pingback: All You Need to Know About XML Sitemaps

  • paddyotoole12

    Duplicate content on the website can get it in trouble anytime. After the Google Panda update everyone is concentrating to get good quality content for the site. For more information, you can visit this blog post: http://www.dpfoc.com/blog/is-content-important-in-seo

  • Pingback: All You Need to Know About XML Sitemaps - Website Design Prices

  • Lukasz

    Anyone knows How to solve the problem of duplicate content in multilingual website?
    For example: You have an About page which is in English, German and Spanish
    I am trying to make the websie as accesible as possible, but I heard translation via google translator, which is efficient, sience then I just correct errors, but 90% is done, but google treats it as a duplicate content?

    • http://twitter.com/kevinverto Kevin Vertommen

      Check the ‘language targeting’ chapter of this article.

  • http://twitter.com/screamingfrog Dan Sharp

    Cheers for the mention Kevin.

    Just to mention – Under the ‘URI’ tab in the Screaming Frog SEO spider we also have a ‘duplicate’ filter which does an md5 algorithmic check for duplicate content. The URLs have to be exact duplicates, so it won’t pick up partial duplicate content currently.

    But if you hit the ‘duplicate’ filter, the SEO spider will list any URLs which have an exact duplicate. You’ll see this by the corresponding hash values next to each URLs which will match.

    Cheers,

    Dan

  • Pingback: Tweet Parade (no. 52 Dec 2012) | gonzoblog

  • Pingback: All You Need to Know About XML Sitemaps - Цялостни IT решения,bussines 2 bussines,Оферти,Обяви,Работа, Коли под наем,Rent A Car ,Уеб Дизайн,nternet access, hosting, web design, network monitoring, Comput

  • Pingback: All You Need to Know About XML Sitemaps | SEO

  • ianyates

    Hi Daniel,

    I appreciate what you’re saying and I’m always striving to find that perfect balance (what readers want) with content on Webdesigntuts+. In terms of SEO it’s true that the basics can be found elsewhere, but I want this site to be a solid, long term resource where web designers can come to find the essentials in *all* topics they should be versed in. Building “findable” websites is absolutely relevant to our audience, which is why Kevin has put together this solid session covering what’s needed from a designer’s perspective.

    In any case, don’t worry about content strategy on Webdesigntuts+ – I’ll make sure you get plenty of what you’re after. Thanks :)

    • http://twitter.com/Bongo_IT Bongo IT

      I have an issue with my blog page where each post shows as duplicate content. Im in two minds as to whether or not rel=Canonical will solve the issue or cause more issues, any advice? http://www.bongoit.co.uk/blog.html

  • Pingback: Helping Search Engines Handle Pagination | Webdesigntuts+

  • http://www.kidstoysmalaysia.com.my/ Phil Polaski

    Great information! I’m working on a eCommerce site and have duplicate content due to products listed in multiple categories. I’m thinking that a redirect would be the easiest way to resolve the issue of duplicate content. Is one method better than the other?

    Thanks,

    Phil

  • Pingback: Helping Search Engines Handle Pagination - Website Design Prices

  • Rishikant

    Thanks for this topic.

  • Pingback: All You Need to Know About XML Sitemaps - — Ethiopian Website Design

  • Pingback: I Hate PR People: The Rules of Pitching Bloggers and Media | Spin Sucks

  • Amit

    nice content you have posted .I also write blog for my own site do you have any tips to increase your global rank please I want to know about it.Help me out in this my URL is : http://knowledgeheights.in

  • http://www.cbil360.com/ Web Design Company

    Duplicate content is a very harmful to any website, As said above if
    duplicate is found on your site then search engine gets confused about
    which page is more relevant to display in search engine result pages.
    So, you must have to avoid any type of duplicate content ( or inline
    activity i.e., session id, url canonical issues etc. ) for better search
    engine visibility for your website which directly results in an increase in traffic.

  • http://www.andykuiper.com/ Andy Kuiper – SEO Analyst

    It’s amazing how many sites just seem to not care about dup content… and it’s such an easy thing to fix. Thanks for the article Kevin :-)

  • NASConline

    Very informative article. We have a problem that may not have an easy answer but one of your suggestions may work for us.

    We have an IT Association in which we would like to share some of the great articles, such as yours, with our members. Several ideas have been suggested but we aren’t sure which one is best for us and our members. Our goal would be to hopefully get some kind of link juice for the article, give proper credit to the author and give our members the benefit of the great articles we have found on different subjects. Sort of a repository for things that we thought would be useful. The one we were about to go with is:

    1. Write a preface and a summary of the document, making sure that we add useful content amounting to at least 50% of the original article. Show a link to the original article, attempt to contact the author and show complete credits going to the author.

    We were told that if we did this we the article would be seen as a new article, we would get link juice for it and the author would get their rightful credit.

    One problem we would have, if the 50% rule is true, is actually writing this much content that would be useful and not just fluff, unless it’s a short article.

    We would rather show the entire article but what if we just posted the title and a link to the original article?

  • Md Riya Alam

    Thanks for the information, but I have a question regarding the violation of Google norms and policies. If two or more then two websites has exactly same content and very similar domain name then where and how can I report or this. should I directly call the Google customer care? or something else? please help me out
    Thank you.

  • Pingback: Quick Tip: Use “x-default hreflang” for International Landing Pages | Webdesigntuts+

  • Pingback: A Web Designer’s SEO Checklist (Including Portable Formats) | Webdesigntuts+

  • Pingback: Quick Tip: Use “x-default hreflang” for International Landing Pages | Directory Net

  • Brian

    New to web design, was working on phase 2 of my website and created a duplicate (test version) of my original website, both are “live” online now. I copied the old website from an old domain, and “pasted” it into the new test sub-domain so I could work on the updates there, and not touch the original site, I wanted to see the test results live, but on the test subdomain. I was slow to make changes to the new subdomain site as I was editing on paper first, then on the new subdomain test site. Well, I can’t access the neither site now. From reading online I see Google has “penalized” me for content duplication. Any idea on how to get out of this mess I created?