The Problem of Duplicate Content and How to Solve it
One problem that we see on almost every website is duplicate content. Larger websites with hundreds of pages are especially prone to this. But what exactly qualifies as duplicate content? Why does having duplicate content lead to problems and how can we avoid them? We'll cover all that and more within this article.
What is Duplicate Content?
Duplicate content is exactly what you think it is: two or more pieces of content which are identical, the only difference being the URL.
Google sees every URL as a separate page. Owing to this, it would consider the following URLs to be completely different pages:
- Original page with red shirts: http://website.com/shirts/red
- Same page, but ordered by price: http://website.com/shirts/red?order=asc
The problem here is that we're basically looking at the same page with the same content. The only difference is that the content on the last URL is in a different order. Google sees this as duplicate content.
Why is Duplicate Content Bad?
Duplicate content confuses search engines. Why? Because they have a hard time deciding which page is most relevant for a search query.
Search engines will never display two identical pieces of content in the SERPs. This is done to ensure the highest search quality; seeing the same content twice is not very interesting for the user.
Another problem is the ranking power of duplicate pages. Instead of having a single page with a lot of authority, you have multiple pages with diluted, suboptimal performance. This might cost you a lot of organic traffic.
How Duplicate Content is Created
Duplicate content can be created deliberately or by accident. Nevertheless, the result is the same.
An example of deliberate duplicate content is the print version of a page. It’s effectively the same page with the same content, so when this print version gets indexed, there's an issue with duplicate content.
However, there are plenty of situations where duplicate content is created unintentionally. There can be several causes, such as:
- Session IDs
- Sorting options
- Affiliate codes
A session ID is a variable, a string of randomly-generated numbers and/or letters and is used to keep track of visitors. They are often used for shopping carts, for example:
The problem with session IDs is obvious: they can create hundreds, perhaps even thousands of duplicates. Storing session IDs in cookies can solve this problem, but if you rely on this option, don’t forget about the EU cookie law.
When people think about sorting options, they usually think about web shop product catalogues where users can sort by price, date, etc. But sorting functions are often found on other websites too. The following URL uses a typical blog sorting function:
The URL with the sorting option and the original are basically the same page. It’s the same content, only sorted in a different manner.
Affiliate codes are popping up all over the web. They are used to identify the referrer, who is in turn rewarded for bringing in a new visitor. An affiliate code can look like this, for example:
Once again, this code can create a duplicate of the original page.
Even something as simple as a domain name can sometimes be problematic. Take a look at the following URLs:
Search engines have come a long way, but occasionally they still get this one wrong. Both URLs probably point to the homepage, but because both URLs look different they are sometimes seen as different pages.
How to Identify Duplicate Content
We've talked about how duplicate content is created, but how can you identify duplicate content issues on your site?
The easiest way to do this is via Google Webmaster Tools. Log in to your account and go to Optimization > HTML Improvements. Here you’ll find a list of duplicate titles (which is probably duplicate content).
Google Webmaster Tools
Alternatively you can enter the site:-search command in the url bar to find pages from a specific domain (e.g. site:webdesign.tutsplus.com). This method is very useful if you suspect that a particular page has several duplicates. Use the site command and paste a couple of sentences from the suspicious page. If you get a message from Google saying “In order to show you the most relevant results, we have omitted some entries...”, you probably have duplicate content.
Solving Duplicate Content Issues
As the saying goes: “every illness has a cure”. Fortunately, there are several ways to cure duplicate content issues:
A simple way to prevent duplicate content from being indexed is a 301 redirect. This way the user and search engines are redirected from the duplicate to the original. As a result, all link juice is sent to the original page.
A 301 redirect is implemented on Apache servers by adding rules to your server's .htaccess file. Keep in mind that this method ‘deletes’ the copy. If you don’t want to get rid of the duplicate page(s), you should use the following method.
There’s another way to tell search engines about duplicate content; the
rel=”canonical” tag. This piece of code should be implemented in the
<head> of a web page.
Let's say we have Page B that is a duplicate of Page A. If we want to inform search engines of this, we would put the following code in the markup of Page B:
<link href=”http://website.com/Page-A” rel=”canonical” />
This code states that the current page is actually a copy of the above mentioned URL. After implementing it, most link juice will be transferred to the original page and thus improving the ranking power of that page. Contrary to the 301 redirect, the duplicate pages will still be accessible.
Meta Robots Tag
We've already discussed the robots meta tag in detail during a previous tutorial. By adding a meta robots tag with the “noindex” parameter, you can prevent the duplicate page from being indexed.
This is a more advanced solution. It’s more difficult to implement if you have a limited understanding of code, but it can be useful on a number of occasions.
As mentioned before, the domain name can often cause duplicate content issues (www vs non-www version). You can solve this problem by adding a URL-rewrite rule to your htaccess file (something else we've covered before on Webdesigntuts+). Choose your preferred domain (www or non-www) and automatically rewrite URLs to the specified domain.
Another problem we've talked about is use of Session IDs. The same URL with a different Session ID appended can be seen as duplicate content. Once again the htaccess file can be used to disable these parameters. Read Disable session ID’s passed via URL by Constantin Bejenaru to learn how to do this.
Google Webmaster Tools
In the previous section we talked about automatic URL-rewriting for domain names. An easier way to do this is via Google Webmaster Tools. Simply log in to your account, go to Configuration, click on Settings and set a preferred domain.
Google Webmaster's Preferred Domain
If you’re using dynamic URL parameters, you can tell Google how to handle them. This way you can tell which parameters should be ignored. This can often solve a lot of duplicate content issues. Visit Google Webmaster Tools and go to Configuration > URL Parameters. More information can be found at Google Support, but be sure to use this feature only if you know how parameters work, otherwise you may inadvertently block pages.
This issue is related to duplicate content, but there are some differences.
Let’s say a company which sells products in North America has two websites: company.us and company.ca. The first one is targeted at the United States, the latter at Canada. On both websites we find content that is similar because the webmasters didn’t want to rewrite several pages of text.
It’s possible that the US version will outperform the Canadian version (even on Google.ca) because it has more authority. How can we fix this targeting problem?
There’s a simple solution: the
rel=”alternate” hreflang=”x” annotation.
If we use our previous example, we need to add the following code in the
<head> section of the .us domain:
<link rel="alternate" hreflang="en-CA" href="http://company.ca/page-example" />
On the .ca domain we need to place this code:
<link rel="alternate" hreflang="en-US" href="http://company.com/page-example" />
In essence you’re telling Google that there’s an alternative version (or duplicate) in another language. The hreflang attribute uses ISO 639-1 to identify the language. Optionally you could add the region in ISO 3166-1 format.
Prevention is better than cure... Consistent internal linking can prevent the creation of duplicate content. If you have http://www.website.com as a preferred domain, don’t point your internal links to the non-www version. The same tip applies to inbound links. If you link to your own site from another domain, use a consistent link structure.
Don’t intentionally create duplicate content by copying large chunks of text from other websites. Google will likely find out about it and the consequences might not be so pleasant:
In the rare cases in which Google perceives that duplicate content may be shown with intent to manipulate our rankings and deceive our users...the ranking of the site may suffer, or the site might be removed entirely from the Google index.
Duplicate content is something you see on almost every site. It can have several causes, whether accidental or otherwise.
Unless you want to prevent access from the page via a 301 redirect, it’s best to use the rel=canonical annotation. Alternatively, you could use the meta robots tag or automatic URL rewriting. Google Webmaster Tools also offer some ways of preventing duplicate content.
Finally, it’s best to be consistent in your linking. Internal links and inbound links should appear the same.