Synopsis — Awhile ago, contributor Jaimie Sirovich wrote a piece for Search Marketing Standard on defusing the duplicate content situation, with specific tips on how to avoid penalties for duplicate content from the search engines. In this article, he updates the advice in the light of Google’s Panda, and expands the discussion. The focus this time is on three newer tools that will help you deal with duplicate content issues on your website with fewer application changes needed. The first part of this topic can be found at Defusing Duplicate Content.
Defusing The Duplicate Content Situation: Part II –
Batteries Programmers Not Included
Nobody has one canonical solution (no pun intended) to problems caused by Panda updates, but removing and canonicalizing thin or duplicate content is sensible regardless. Thankfully, search engines are also providing more tools to help us help to do so — making it possible without complicated changes to underlying application architecture. In Part I of this article, we reviewed the then-available tools to cure duplicate content, including:
2. Parameter handling
3. Exclusion via robots.txt or meta tag
4. Partial exclusion via meta tag (noindex, follow)
Even if your website has plenty of unique content, advanced navigational features and architectural missteps may still cause damage, and algorithms designed to weed out content farms and other low-quality content may start nipping at your traffic. Particularly irksome borderline edge cases such as quasi-duplicate content arising from pagination and faceted navigation also exist.
Fortunately, three relatively new tools will help combat the problems. Even better, these new tools require fewer application changes as compared to other available approaches.
1. Google’s URL Parameter Handling Tool
Help that little robot out with parameter hints via the rechristened URL Parameters (discussed at googlewebmastercentral.blogspot.com/2011/07/improved-handling-of-urls-with.html).
Google has added a new “how does this change page content” hint to its parameter handling tool, which allows the webmaster to explicitly indicate that a URL parameter does nothing — such as being a tracking parameter or filter — as in faceted navigation, paginates, or sorts. Previously Google was on its own to determine this with its heuristics.
Note that if a parameter is rewritten as part of a URL path, this feature is rendered completely useless. Google states that, “This is difficult for Google to interpret and this parameter configuration tool cannot help for such case.” Naive URL rewriting is also something I’ve cautioned about in Professional Search Engine Optimization with PHP (2007) where I pointed out that in the case [of] duplicate content, using static-looking URLs may actually exacerbate the problem. This is because while dynamic URLs make the parameter values and names obvious, rewritten static URLs obscure them. Search engines have been known to, for example, attempt to drop a parameter they heuristically guess is a session ID and eliminate duplicate content. If the session parameter were rewritten, a search engine would not be able to do this at all.
Harkening back to Part I in this series, it is interesting to note that this tool can theoretically solve the duplicate content problem arising for faceted navigation without any application architecture changes. However, according to Google’s examples, in the case of a “color” filter:
“[These pages] seem like new pages (the set of items are different from all other pages), but there is actually no new content on them, since all the blue skirts were already included in the original three pages. There’s no need to crawl URLs that narrow the content by color, since the content served on those URLs was already crawled.”
However, most information architects would maintain that sub-categories may also be modeled as facets and vice-versa, so this does seem to be an over-generalization by Google about whether facets are worth indexing. Therefore one should use this technique to address faceted navigation with caution — especially in advanced navigation such as Endeca’s InFront or our adeptCommerce Suite, where a perfectly viable landing page with content may indeed be a facet-filtered category page.
Thus, there may be some caveats for this tool, but the pagination and the “none” URL-parameter hint are incredibly useful for websites experiencing specific indexation problems caused by tracking-parameters or URL-based sessions if required or used in a legacy application instead of making architecture changes.
I wrote in detail about alternative approaches to cope with duplicate content arising from faceted navigation in the Summer 2010 edition. It is also worthwhile to note that Bing has no features that parallel these new advanced parameter hints, which may limit performance in Bing. They are likely to follow suit in the future, however, as Bing has followed a number of Google’s de facto standards, and even coordinated with Google on a few others.
2. Rel=prev and next
Rest in peace “Noindex, follow” for pagination
Like its sister rel=canonical, the rel=prev and rel=next duo provide Google with an indication that quasi-duplicate (and typically less important) pagination content is just that — pagination content. In their absence, Google used its heuristics to determine the existence of pagination links. Used in coordination with a pagination hint in Google’s “URL Parameters” tool, this new tool should theoretically make other proposed approaches to address pagination obsolete, as it provides more specific information and leaves the decision how to index said pagination pages up to the search engine.
Previously, search engine marketing professionals proposed solutions such as using the dubiously supported “noindex,follow,” canonicalizing all pagination pages to page one, or canonicalizing to one show-all page. None of these approaches are ideal.
While “noindex, follow” may actually be a valid approach, it is most likely inferior to this new approach as it is much more explicit.
Both Maile Ohye and John Mu warn against using a rel=canonical tag on paginated results pointing back to the page one. They note, respectively, that the bots will ignore or be prevented from accessing the pagination pages.
Canonicalizing to one show-all page is still valid according to Google. However, this would not be viable on sites with thousands of products in any category, as often occurs on facet-enabled websites. This would yield an excessively heavy and slow web page with many thousands of links.
Another plus about this approach is that it involves only minor application changes. Indeed, this seems to be a theme with many recent new tools — that search engine optimization should be addressable without having to make changes to application code.
3. Schema.org and Microdata
Schema.org and its brethren help search engines to understand and process the data contained within HTML documents, and these on-page mark-up standards can be used to indicate explicitly that content is navigational rather than actual content. Yahoo proposed a similar technique many years ago with its “class=robots-nocontent” (http://www.ysearchblog.com/?p=444).
Doing so helps search engines identify the meat of an HTML document within its navigational framework. On the plus side, these changes can be made with relatively minor alterations to an HTML template in most eCommerce platforms.
To reiterate the conclusion of Part I of this series, duplicate content creates a set of difficult problems. It’s difficult not just for web developers, but search engines as well. This is underscored by the new accoutrements that search engines provide us to assist them in processing information. These tools are aimed at allowing the webmaster to address duplicate content and crawlability issues in new ways — and often without making complex changes to a web application or hiring programmers at every turn.
Note: Jaimie has written a chapter on usability and faceted navigation in Greg Nudelman’s new book Designing Search: UX Strategies for eCommerce Success (John Wiley & Sons, May 2011).