Duplicate Content & SEO

Duplicate Content | Effects from the Panda UpdatesGoogle is refreshing and updating its Panda algorithm nearly every month since February, 2011 in order to promote sites that are high quality (low duplicate content, well-written, authoritative, high user-engagement) and demote sites that are low quality (either technical or editorial duplicate content, poor grammar/spelling, thin/shallow content, poor user-engagement). One of the biggest factors of these updates is duplicate content. Note: If you’re wondering what Google deems a high quality website, read Amit Singhal’s blog post on the topic.

A many number of websites have been penalized by Google’s Panda Update for either intentionally or unintentionally creating duplicate content on their websites. No matter the Content Management System (WordPress, Magento, Drupal, Expression Engine, Community Server, etc.), any website is vulnerable if the proper precautions are not taken. The following Duplicate Content SEO Guide is designed to be a complete guide to allow you to check for duplicate content, fix it, and prevent duplicate in order to protect your website from Google Panda Updates in the future.

Copying & Pasting: Duplicating Content

The first precaution to take when preventing duplicate content, and first place to start when addressing potential existing duplicate content issues, is the manual action of copying and pasting content either throughout your own website, from another website onto your own, or from someone else (with more domain authority) stealing your content. A few simple tips will help prevent such situations:

  1. Ensure that you are not creating any new duplicate content manually. Never post the same article/post/page on more than one of your websites, or copying/pasting any content into a new page on our website where the majority of this page ends up being duplicate content. At the very most, you can “quote” snippets of content from other pages on our site (this is common and okay to do), but the vast majority of the page should be unique content.
  2. Identify any existing duplicate content that you clearly know was copied and pasted from a page on your website, or another website, onto another page on your website. Either set those pages to “noindex,follow” via the meta robots tag, or delete them from your website. If your website has pages with data feeds (from other websites) as the only content on the page(s), then these either need enough unique content written atop the page, or to be set to “noindex,follow” (or deleted).
  3. Find non-compelling, thin content that currently exists on your site…and either make it more robust, set it to “noindex,follow” or simply redirect the page to a more relevant page on the topic. Thin content, for example, could be a blog post that has 1- 3 sentences…and that’s it. Although members of Google have mentioned in Google’s support forums that short content can still rank well, just keep in mind that it’s incredibly difficult to be compelling with such short content unless you’re a well-known brand or very, very creative. The best way to gauge your success is by looking at the bounce rates on short content pages in Google Analytics. If it’s well above your site average, then it’s doing you more harm than good.

Case Study: Manually Pasting Content from Other Websites

Below is an example of what can happen to organic search engine traffic (from Google) when so much content is copied and pasted from an external website that the majority of the site becomes cross-domain duplicate content. Notice the clear drop in traffic around the time of the late April (2012) Panda updates. After setting all of this content to “noindex,follow” (yes, the majority of the website), the traffic ramped back up over time.
Google Panda Penalty: Duplicate Content from Copying Content from Other Websites

Google’s goal is to provide the highest quality, most authoritative content on a topic to their search users. Your goal is to create the same. Whether Google is 100% accurate or not…well that’s a different story!

Technical Sources of Duplicate Content

The following are a list of items that the SEO and eCommerce Departments has been addressing recently, leading to these initial improvements:

  • Removing search result pages from search engines’ indexes. If your site has an internal search feature, and your site is not so big that search result pages don’t need to act as category pages (i.e. – eBay), then you absolutely must set these pages to “noindex,follow” via the meta robots tag. These are pure duplicate content and compete with the real content of your website (home page, category pages, product pages, and curated landing pages).
  • Ensuring canonical URLs are on all pages that we wish search engines to index. When editors, affiliates, Twitter and other referrers link to your site with tracking code added to the end of URLs, this creates a real problem. Search engines can end up seeing hundreds of different versions of the URL for a single page…and as a result, think they are all different pages when in fact they are not. This duplicate content problem is solved by using canonical URLs. These define a single version of the URL for search engines in order to ensure it is the only URL that they ever index.
  • Removing checkout/cart pages from search engines’ indexes. If you run an eCommerce site, search engines have no business indexing any content on our site from the point of clicking the “Add to Cart” button onward. As a result, it is best to set all of these pages to “noindex,nofollow” and disallow access to them within your robots.txt file so search engines do not index them.
  • Removing Tag pages on WordPress sites from search engines’ indexes. When editors create new content, the lifeblood of WordPress sites, they often add multiple tags per post. When this is done, a new URL/page for each tag is created. This page is simply repurposed article snippets, and has 0% unique content. Thus, it’s advised that all WordPress sites have Tag pages set to “noindex,follow” to prevent/reverse Panda penalties.
  • Adding unique content atop Category pages. Category pages exist on most all websites and often target very general, highly searched keywords, are often linked to from top navigation menus, and thus carry a lot of weight on a website. They can also attract more links. However, unless unique, introductory description content is added atop these pages…they are just like Tag pages above. To see an example, visit the SEO Blogpage on this site. Simply add 100+ words of unique content to Category pages to avoid the page being considered pure duplicate content by Google.
    • Note: When Gaiam (a well-known yoga equipment manufacturer) redesigned its website, they dropped the intro content on their main Yoga Mats page and lost their #1 ranking in Google for “yoga mats.” As of this writing, Gaiam is ranking #4 in Google. Clearly, category pages are vulnerable to duplicate content and can cost your company serious revenue if not accounted for.
  • Fixing duplicate/empty meta tags. If you haven’t set up your website with Google and Bing’s Webmaster Tools, you’re missing out on incredible free tools. These are online applications which give webmasters incredible insight into how these search engines view their websites. One of the most important areas is the “HTML Improvements” section which points out which pages on our sites have duplicate meta titles, missing meta titles, duplicate meta descriptions, missing meta descriptions, etc. While these are easy to fix, either manually or with script modifications, they are often clues to deeper duplicate content problems within the actual body content of the page.
  • Improving efficiency of search bots in crawling a website. Although this is not a source of duplicate content…the positive effect of blocking access to unnecessary pages via robots.txt files, canonicalizing URLs, and setting unnecessary pages to “noindex,follow” is that we improve the likelihood that Google, Bing and other search engines are going to fully crawl and index all of the real content pages on our websites.
  • *Rewriting duplicate product descriptions on product pages. This is critical. Most eCommerce sites take the easy route and simply use the manufacturer’s description for their product pages. Guess what? So does (nearly) every other site that sells the same product. Unless you have unique reviews on your product pages (unlikely…getting reviews is very difficult), then your product pages are going to not rank well in Google if you don’t have high Domain Authority. The solution is to make your product descriptions unique. Use a site like The Crowder to review the most popular crowdsourcing sites that can help with rewriting product pages. It’s going to require an investment, but you need to spend money to make money. And…rewriting product descriptions is a lot cheaper than link building, which is required to build Domain Authority.

These are just a few examples of technical duplicate content, and we are continuously hunting for more. Have you found anything of similar nature?

Content Syndication = Duplicate Content (unless…)

Content syndication is a common source of cross-domain duplicate content. This could consist of:

  • Syndicating your content out to other websites
  • Syndicating other websites’ content onto your website
  • Syndicating content between owned websites

All of these content development methods are used in today’s online world (for valid reasons at times), however they can pose serious risks for organic search engine optimization. Let’s consider a situation where you give X amount of articles per week (which are originally posted on your website) to an external website via a content syndication agreement. In exchange, you get a decent amount of referral traffic back to our site via links to our site (from your article on their site). One would think that this is an okay strategy (even if they are linking to the source article on our site), but it still leaves the door open for a search engine like Google to “get it wrong” and see our own original content (on our site) as duplicate in situations where the full article is posted on both websites. This often happens in cases where the site you’re syndicating your content onto has higher domain authority than your own.

Solutions for Avoiding Duplicate Content:

  • Set the duplicated content to be “noindex,follow” via the meta robots tag in the source code of the page. This probably sounds like Greek to most of you, but it’s important to remember the phrase “noindex,follow” as an online publisher. This essentially means: “Dear Google, please don’t index this page but follow the links on it.” This allows us to put our content on other websites (or other websites of our own), or put other websites’ content on our own website…but disallow search engines to index the duplicated page…which avoids duplicate content problems that can get us penalized and cause traffic declines.
  • Set the canonical URL for the duplicated page to be the original article’s URL in the source code of the page. This also probably sounds like Greek to most of you, but it’s important to know what the word “canonical” means as an online publisher.
    • Canonical URL = The one authoritatively correct URL for a page. When a page can be accessed via multiple URLs, a canonical URL should be identified (in the source code of the page). (source)

    So, we could post www.domain1.com/article on www.domain2.com/article, but require that the content syndication partner (the other website owner) place the following line of code into the source code of the web page on their website:

    • <link rel=’canonical’ href=’http://www.domain1.com/article’ />

    This way…we’re telling search engines: “No no…it may look like http://www.domain2.com/article is the URL for this page, however http://www.domain1.com/article is the only URL that you should index. Search engines accept this, and encourage us to use it.

These are two of the most commonly accepted solutions to avoiding duplicate content created by content syndication. If your content syndication partner balks at implementing such technical requests, then you need to seriously assess the value of the relationship compared to the real danger of Google Panda penalty. Is it worth losing the majority of your traffic from Google?

Using Google’s Site: Command to Check for Duplicate Content

The site: command can be used in a Google query to find specific pages on any website. It’s a flexible tool that can be used in conjunction with other search operators to help find internal duplicate content. The following are some of the more effective ways of using this tool:

  • site:www.websiteurl.com – This pulls up the number of pages in Google’s index under this URL. Check the number of URLs for your website.
  • site:www.websiteurl.com “this block of text content from a page” – This shows you if other pages on your site have the same body of content on more than one page. A variation of this – just searching a body of text in quotes, can quickly show you if other sites are taking your content.
  • site:www.websiteurl.com inurl:sort= – Use this command to see if duplicate URLs are being indexed by Google due to a sort/search function of the site.
  • site:websiteurl.com inurl:https – Can be used to show if the secure/https version of your URL is potentially causing duplicate content issues. For example: site:www.writersmarket.com inurl:https shows Writer’s Market has only two.

Learn more about Google operators in order to broaden your knowledge on using these commands to find duplicate content.

Managing Stolen/Scraped Content

Scrapers and unethical webmasters across the internet will steal your content if you’re not careful. If you find that your content has been published elsewhere online without your written consent, you have a few options:

  • Email the Webmaster or Author - This is the first place to start. Look for a “Contact” page on the website who is stealing your website. If there is none, look up the domain owner via www.whois.com and either email or call if there is contact information listed. If you are able to make contact, be polite and start off the conversation casually letting the webmaster know that you found your content while browsing their site. If only a portion of your page is being duplicated, then us this as a link building opportunity. Simply offer them to keep the content on their site in exchange for a link back to the full source of that content snippet on your website. If your entire page is being duplicated, then ask that the webmaster remove it, and kindly mention duplicate content will hurt the traffic on both of your sites. As consolation, you can offer a short summary/snippet of your content for the webmaster too use in exchange for a link to the full content on your website. This is a win/win for both involved, and you get a link!
  • File a Search Removal/DMCA complaint with Google.
    • Search Removal – If manually reaching out to the webmaster provided no resolution, then you have the option to file a DMCA complaint by using Google’s Search Result Removal Tool. Be sure to select “web search” and specify the nature of your request (privacy, trademark infringement, etc.).
    • Reporting Scrapers – If your website content was scraped by a scraper, use this tool. Scraper websites merely copy other sites’ content and re-publish it on their own site(s) in order to drive search traffic for their own revenue purposes. Simply put, it’s stealing. Option #1 be darned with these folks – they are scum and must be eradicated!
    • Stolen YouTube Videos – If you need to file a removal request or DMCA complaint for a YouTube video of yours which was stolen, use this tool.