Don’t you just hate it when you arrive at a party to find someone else wearing the exact same outfit as you! Funnily enough, Google isn’t too keen on similarities either, especially duplicate content.
Duplicate content is often created without us really knowing that we’ve done it. That’s because the majority of web developers and webmasters aren’t familiar with SEO (search engine optimization) best practices.
However recent algorithm updates, such as Google Panda have made it necessary for us to find and elimate duplication on our sites. Otherwise we’ll take a blow from the big Panda paw all the way down the rankings.
What is duplicate content?
Duplicate content is where two identical pieces of content appear on multiple URLs. Google sees every URL as a unique separate page and will index these duplicate pages just like any other regular page. This is where the trouble begins.
Google will see the following URLs as completely different pages:
We’re basically looking at the same page but the bottom URL has a “/?” at the end of the web address which is a method of adding a parameter to the URL, such as tracking code. Google will treat it differently from the original URL.
Why is duplicate content bad for you?
Duplicate content sends a search engine’s head into a spin. Google will not know which page to return as the most relevant result for a search query and will never show two identical pages of content in its listings. Therefore if duplicates exist, Google decides which page to display, often resulting in the wrong page being returned which is damaging to your SEO efforts and creates a bad user experience.
Ranking power is also affected. Instead of having one powerful, authoritative page, the link juice will be diluted across the multiple duplicated pages. Your levels of organic traffic will also suffer.
How does Google combat duplicate content?
Because Google treats all URLs uniquely, bad people saw this as an opportunity to duplicate content across domains in an attempt to manipulate rankings or gain traffic.
Google retaliated with the Panda update: an algorithmic update which is designed to evaluate websites in terms of quality in order to improve its search results. It will assign a penalty to any website in violation of its guidelines, usually resulting in your site disappearing off the rankings.
Aside from demoting you for having duplicate content, Panda also dislikes thin, low quality content, slow page load times and high bounce rates.
So what are the main types of duplicate content you should look out for to avoid a smack from the hairy Panda paw?
Non-www and www. versions of your pages
If you haven’t selected a preferred top level domain structure for your site, pages on your site may be accessed via the www and non-www versions.
In less common cases, having a secure version (http:) and a non-secure version (https:) of your site will also cause duplicate content problems.
Let’s look at some examples:
Because the URLs are different Google will see them as separate pages, but the content is the same. This is why Google will view it as duplicate content.
A session ID is a string of randomly generated number or letters that appear at the end of a URL string.
Some ecommerce sites assign a unique session ID to each new visitor to keep track of them. It enables visitors to store items in the shopping cart whilst continuing to browse the rest of the site, for example. They generally look like this:
Sometimes the parameter shown in the second example can end up in the URL which is then indexed.
The scale of this particular type of duplicate content can be huge. Think about a very busy ecommerce site: they will probably have thousands upon thousands of visitors a day; all with their own unique session IDs. That’s a lot of duplicate content. The way to solve this is by removing this string from the URL and store it in a cookie. There’s no need for these to even be crawled by a search engine bot.
Pagination is a way of organizing a lot of data or information into a more manageable way that’s user friendly. For example, large ecommerce sites use this method to display their products across multiple pages.
For example, you head to a fashion website and know that you’re looking for an evening dress, so you sort the clothing by the category “dresses”. Instead of viewing hundreds of dresses on one page, you can um and aaww over 20 dresses per page across ten pages. That’s a a lot of dresses:
Any time that you split data or products across multiple pages you’ll create duplicate content. Using the dress example above, you’ll start seeing duplicate URLs like this:
Dependent upon how many pages of dresses there are, that can spin out many duplicates. Although the results may differ, elements such as the page title, meta-description, copy and the template essentially remain the same.
In theory pagination is a good idea for ecommerce sites, but if not implemented correctly, can cause major duplicate content issues. In the past Google has said that they’ll sort out pagination issues themselves, but this has left something to be desired. I’ll explain later how to sort out this tricky issue.
How will you know if your site is suffering from duplicate content?
So, how do you know if your site is suffering from duplicate content?
You can use our old friend Google to find out.
You’ll need to use two advanced search operators. Firstly, we’ll search within our site using site:search and then add the title of the page we’re concerned about
Note how there isn’t a gap between the colons and the search terms.
Let’s say for example that you’re worried Google has indexed duplicates of your home page. Here is how you’d check:
Put the title in “quotes” and always leave off the root domain www.
You can also head to your Webmaster Tools account to find out if you have duplicate content. Head to “Search Appearance” and click on “HTML Improvements”:
Fortunately at this point, I don’t have any duplicate content issues, but you’ll be notified of duplicate content under Meta Descriptions and the Title Tag headings.
How do we solve these issues?
The fact is any of the duplicate URLs on our site might be the right one and the one that we want visitors to see: but you’ll have to choose the correct version – otherwise known as the canonical URL
If your duplicate content is vast, you’ll then have to start a process of canonicalization.
What are the next steps
1. Choose your preferred domain: non-www. or www?
2. Add 301 redirects
3. Add a canonical link
4. Address pagination issues
Setting your preferred domain
When you first set up your site, it’s important that you select only one domain URL and stick to using the same URL for all webpages.
If you’ve already got two versions of your domain then you’ll have to start adding 301 redirects to the preferred domain versions of those webpages. Don’t attempt to do this yourself – please seek the advice of your SEO professional.
The 301 redirect is permanent, so will transfer all of the link equity to the new destination. All it does is direct one URL to another e.g.
You may have duplicate versions of pages but you don’t want to permanently redirect them. This is where the “rel=canonical” link plays a major role.
You’ll simply add the “rel=canonical” element to the code of the pages that are the duplicates.
So, for example, you may want this page to rank well:
You add this link:
<link rel=”canonical” href=”http://www.domain.com/notebooks/”/>
To the head of the duplicate page(s)
This will instruct Google to list the canonical one in the results.
Often the most difficult to resolve.
Google introduced a tool to address this problem. It will tell Google how paginated content connects by using a pair of tags similar to like Rel-Canonical. They’re called Rel-Prev and Rel-Next. Implementation is a bit tricky, but here’s a simple example using pagination of notebooks on an ecommerce site:
In this example the search bot has landed on page 3 and decided to rank that one, so you’ll have to use both the rel=prev tag to tell Google there’s a previous page and then the rel=next tag to tell Google there’s a following page. These are quite difficult to implement and you may have to generate these dynamically.
So here are two your other options:
- You can Meta Noindex,Follow pages 2, plus any proceeding pages of search results. Let Google crawl the paginated content but don’t let them index it.
- Create a “View All” option. This will link to all search results at one URL, then allow Google auto-detect it. This to be Google’s other preferred option.
Either way, don’t take to your website’s code with trigger happy fingers. Make sure you get the help of a professional first, like us.
If you’ve made the terrifying discovery of duplicate content, get in touch with our SEO specialists who can give details of the extent of duplication and indeed the best ways to resolve the issue.