Search engines consider every URL to be a unique object or page. Every instance of duplicated content, regardless of the purpose of the page, will negatively affect rankings if it is allowed to be crawled by a search engine. It is sometimes necessary to have two (or more) pages with the same content; however, even if the content is helpful to users and makes sense, it’s presence in the search engine indices will cause ranking problems. It is recommended to exclude exact (or even similar) copies of any content from the search engines, or if possible to avoid having duplicate content to being with.
Duplicate content can be caused by a number of things, including URL parameters, printer-friendly versions of pages, session IDs, and sorting functions. These kinds of pages tend to be a normal, helpful part of a website but they still need to be addressed in order to avoid serving a duplicate page to the search engines. There are several recommended methods one can go about in fixing duplicate content: 301 redirects, the rel=”canonical” tag, robots.txt exclusions, and noindex meta tag.
A 301 redirect, or permanent redirect, sends both users and spiders who arrive on a duplicate page, directly to the original content page. These redirects can be used across subfolders, subdomains and entire domains as well.
<link rel=”canonical” href=”http://www.example.com/original-content.html” />
The rel=”canonical” attribute acts similarly to a 301 redirect, with a few key differences. The first being that while the 301 redirect points both spiders and humans to a different page, the canonical attribute is strictly for search engines. With this method, webmasters can still track visitors to unique URLs without incurring any penalty. The tag which can carry the canonical attribute is structured as follows.
The <link> tag would be placed in the <head> of the HTML document which needs to assign attribution to the page which the search engines should deem the original.
Webmasters can also exclude pages from search engines through the use of a noindex meta tag on specific pages. Using the noindex meta tag, webmasters can ensure the content of that page will not be indexed and displayed in the search result pages.
- <meta name=”robots” content=”noindex” />
- <meta name=”robots” content=”noindex, nofollow” />
The final recommended method involves using a robots.txt file. Using robots.txt, webmasters can provide directives to search engine spiders to keep them from indexing certain parts of a website. The URL of these pages may still show up in some search engine indexes, but only if the URL of the page is search for specifically. Tip: While official search engine bots (spiders) will follow robots.txt protocol, malicious bots often ignore them entirely.
If placed within the robots.txt file the below directive would prohibit the bing spiders from crawling and indexing the ‘widgets’ directory. User-agent: bingbot Disallow: /widgets To learn more about the robots.txt file, please see our robots.txt Tutorial (going live soon!)
Thin content describes, both, pages which have very little content, or pages which may have a lot of content of little value. The latter is more accurate a description as there can be pages with very little content which are useful (i.e., if a topic only takes a few sentences to cover/describe, then it is not necessary to generate a encyclopedic volumes of content for it).
According to Matt Cutts, the head of Google’s web spam team, thin content contributes either very little or no new information to a given search. This problem is particularly common for e-commerce sites that may have hundreds or thousands of pages for different products with only minimal product details and information.
The best long-term solution to this problem is simply to create unique content for every web page which might contain duplicate or lackluster information. By supplementing repeated information with sections of unique text, like a thorough description, review, opinion, video, or brief editorial, webmasters can increase their website’s relevance to search engines.
The canonical can be used to help avoid creating duplicate content by specifying the original publication page of a piece of content. One specific use for the canonical tag would be on a page which lists products, and that has sorting functions which produce different URLs depending on how the products are being sorted. In this case, any variation in sorting from the default presentation can utilize a canonical tag to indicate that the the original URL is the only one that should be indexed.
The canonical tag on www.example.com/widgets.html&sort=price would look like this and be placed in the <head> of the document.
<link rel=”canonical” href=”http://www.example.com/widgets.html”>
For example, if a webpage is listing a variety of widgets, and has the URL www.example.com/widgets.html, and offers a sorting link (www.example.com/widgets.html&sort=price) to allow for the widgets to be sorted based on price, then it will become necessary to utilize the canonical tag on www.example.com/widgets.html&sort=price to indicate that the original content is housed at www.example.com/widgets.html, and that www.example.com/widgets.html&sort=price should not be indexed.