Duplicate Content & Multiple Site Issues

Duplicate Content & Multiple Site Issues


Hello everyone. On August 12th I gave a talk
at SES San Jose about duplicate content and multiple site issues. In order to get that
information to as broad an audience as possible, we’re repeating some of that on our Google
Webmaster Channel. So, first of all, my name is Greg Grothaus. I’m a Software Engineer
at Google who works in Search Quality. I’ve been here for about 4 years, and what we do
in Search Quality is finding the right information and ranking it right. I’m a passionate Googler,
and what we’re doing right here is part of my job of webmaster outreach. So we reach
out the community and explain a lot of things about how Google Search Quality works. So,
we’re gonna be talking about duplicate content today. First and foremost, I wanna clear up
a myth that kind of goes around about duplicate content, called the duplicate content penalty.
Generally speaking, people worry that Google has this penalty for sites that have duplicate
content on them. I think – personally – I think that the reason this happens is people
will see this message. They’re doing a query, they see, “in order to show you the most relevant
results, we have omitted some entries very similar to the ones already displayed. If
you like, you can repeat the search with the omitted results included.” And they click
the “repeat the search” link and they see, “oh no, my website has been omitted from Google
Search Results.” And I’ve seen evidence of people kind of getting worried about this
and thinking that this is actually a penalty that Google’s applying on their site. What’s
actually happening is that we’re looking at the query the user’s doing and we want diversity
in the results that we show a user. So, if someone searches for “fluffy bunnies” we wanna
show as maybe page one the Wikipedia article on fluffy bunnies, but we don’t wanna show
as page two the print version of the same article with the exact same text. So, what
we’re doing for that specific query is we’re omitting the print article. This is not a
penalty. In fact, if you adjusted your query search for fluffy bunnies print version, you’d
probably get the reversed effect, where the print version would be showing instead of
the initial fluffy bunnies Wikipedia article. And so, on this slide I’m gonna show you a
little bit about some information from our Webmaster Guidelines on duplicate content.
Just a couple of snippets. You can find out more information about this by searching for
duplicate content guidelines on Google, but essentially what we say is, “We recognize
that most duplicate content is not deceptive in origin, and so, as a result, we’re not
trying to penalize it.” We’re just trying to show in our search results examples of
pages that are distinct and have useful information that are different from the results we’ve
already shown you above. This is much a per-query thing. There’s some exceptions to this, and
this is what we call “spam” in our search quality. The exceptions, though, aren’t really
a penalty for duplicate content; it’s a penalty for spam. So, someone for example, who comes
along and creates a web page that’s an exact copy of articles on Wikipedia or some other
source without any extra valid content and marks it up so that they can drive traffic
to their e-commerce or something like that, is really doing a disservice to our users
and to Google. As a result, we like to take that kind of content out or reduce its ranking.
This is what we call “intent to manipulate rankings and deceive our users.” Just like
spammers might use bold tags on the page, spammers might use duplicate content. Just
because the bold tags are there doesn’t mean we’re removing or penalizing someone for using
bold tags. In the same way, we’re not penalizing someone for using duplicate content; we’re
penalizing it for spam and duplicate content might be there as well. You can find a lot
more information about this in our Google Webmaster Guidelines. Alright, so on this
slide I’m gonna show you a little bit about what duplicate content is now that we’ve gone
passed the question of this myth. Here I show eight different examples of URLs that are
all identical. The URLs are the same but the content is really – the URLs are different
but the content is pretty much the same, and you can see that pretty clearly. Does this
really happen? Yes. Right here I show three different versions of the royal.gov.uk website,
the British Monarchy. Each version is the exact same content but slightly different
URLs. So this is a pretty big website – lots of websites have this exact problem. So, why
is this a problem? What is the deal here? Obviously there is no penalty associated with
this – we’re not gonna remove the Royal Monarchy website. But what’s going on here is, you
can have some side effects that are much more second-order. So, one example is that your
links, if you have links to different versions of the page, you’re not accumulating all that
link juice in one place. So let’s say you have two pages, they’re the same content but
different URLs, and you have 10 links to one and 10 links the other instead of having 20
links to one page which would get that one page to rank really highly. Both of those
sites now have 10 links to them and they’re both gonna disappear in the rankings – or
potentially, depending on the query. So that’s one problem. The second problem is, Google
will automatically try to figure out that these pages are the same and will collapse
them together in the search results and show only one of the URLs. When we do this, it’s
likely that we’re gonna pick the best URL for the user, but some times we pick it the
wrong one, and you are the best person to know which URL you would like your users to
see as the webmaster. So, if you can help us by making sure that you only have one result,
you can make sure that user-friendly URLs are in our search results and users will actually
click on those more often that the not-friendly stuff. And last, and not least is, if there
are more pages that we’re having to crawl on your website that are essentially all the
same stuff we’re not gonna get as deep into the new stuff on your website that you’d like
us to see. So, it’s always in everyone’s interest to have Google crawl as much of their content
as they can, but if we’re crawling the same thing over and over again you’ll end up with
a problem of not seeing everything you can. So, how do you fix these duplicate content
issues? The first thing is to understand what we call the canonical, and the canonical means
the simplest version of the content that you can come up with without any loss of generality.
So, the canonical is actually referring to the URL that you want to show for that content.
So if you have content that’s available on two different URLs, pick which one you like;
that’s what’s called your canonical URL. Now, once you’ve picked the one you want, there’s
lots of ways you can tell Google that information. The best way to do it is to structure your
site such that all of your links go to the canonical version, and generally the users
end up on this canonical version, so when they link to it they link to the canonical
version as well. But in addition to that you can do a couple of other tricks called 301
redirect – or there’s a new option, which is new from an SES last year – or since SES
last year – which is the rel=canonical tag. We’ll talk about those in a second here. So,
for 301 redirects, these are an HTP server header that gets sent along with your file
when you’re sending it to the user. What it does is, it tells both the browser and Google,
“this is not the URL you want. The URL you want is somewhere else.” And what happens
to a user is, their browser will redirect them to a new place. And Google will more
or less treat it the same way. So if your user arrives on your website at the wrong
URL, using a 301 will take them to the right place and it will also work quite well for
Google. One of the really common places where this gets used is for moving a site, either
to a new domain, new host, or if you just want to modify the structure of the URLs on
your site. And we’ve actually got a lot more information on that on the Webmaster Help
Center if you wanna take a look at it – a little more information that I can talk about
here today. So, 301’s are great, and we’ve had that as an option around for a very long
time. But there’s some cases, we noticed, that really just don’t fit what 301’s should
be used for. A really good example is a Wikipedia page that has some content on it, and then
there’s a link on the left hand side that’s to the print version of the same page. If
your 301 wants the user who got to the print version back to the original content, there’s
no way for the user to, you know, be able to make use of that print version, so it’s
essentially a broken issue – or a broken system. What happens, what we’re offering here is
a new tag, rel=canonical, and I’ll explain that here in just a second. Here’s another
example which is essentially, let’s say you use your URLs as a UI device for your users.
So let’s say you come to stuff.com and you wanna buy some read tent bags. You click on
tents, you click on bags, and you buy red tent bags. Another person comes along, they
click on bags, they click on tents, and then they buy red tent bags. You’re showing in
the URLs different bread crums, which indicate the user’s path. It gives the user a feeling
of where they’ve been, where they’re going, and it helps them structure in their mind
how your site works. So, different content – or the same content can be found by different
paths. This is OK, but the problem is, if you had to use 301s to fix that, you’d lose
that value of the URL being a UI component. These are some of the tough issues that rel=canonical
hopes to solve. So what is rel=canonical? It’s just an HTML tag. You put it on your
pages and you say, “this page here, I want to splat its canonical over to this other
page. So, let’s say a user arrives on – the Googlebot arrives on red tent bags and you
want red bags tent to be the canonical. So, you just tell the Googlebot with this little
tag on the first page that the canonical is this other page and Googlebot essentially
treats it as a 301 redirect whereas your users won’t see anything. This works really great
for any of the cases I’ve graded before, but it also works really great if you don’t control
of maybe HTP responses your server sends back to users or any other reason that you just
might not want to use a 301. So, let’s go through a couple of questions and answers
on this. So, how does rel=canonical work, and what are the rules for using rel=canonical?
The rule is that you can splat from one URL to another as long as it’s on the exact same
domain. This works across hosts – different hosts. So, for example, zeta.zappos.com could
splat over to www.zappos.com. But it doesn’t work across domains, so zetas.zappos.com couldn’t
redirect over to google.com, for example, with rel=canonical. You can use it for protocols
like HTP vs HTPS, and you can use it for ports as well. Should you use 301s or should you
use rel=canonical? It’s totally up to you. This is really just another tool in your arsenal;
another option you’ve got for you. And the last question that we get a lot is, do these
pages have to be identical? One of the problem is, if we come along – as Google – to the
same page two different times we may see, like, a date on this page, like, last updated
on blah or whatever – we might notice that they’re slightly different. So we recognize
that clearly these pages, we can’t expect them to be always completely identical, but
we do expect them to be very similar. So slight differences are totally OK. So let’s talk
a little bit about multiple domains really quick. It’s a pretty common problem for webmasters
to try to figure out what they wanna do when they wanna have multimple domains. This really
commonly arises in the case of different domains for different country codes, like I wanted
a German version of the site and a French version of the site, so maybe I’d have a .de
and a .fr. Google thinks these are great, and we think multiple domains are totally
fine. But there’s a couple of things to keep in mind with this. The same concerns we raised
before with your content split across multiple URLs apply here. So, for example, if you have
a German and a French version of your site, and you have links to the German version and
you have link to the French version those don’t get accumulated; they’re applying individually
to each domain. So, you’re making a trade off here. Maybe you want to have that reputation
accumulated per language or maybe you wanna accumulate it onto one site across all of
the countries that you’re servicing. Also, Google’s gonna tend to pick only one of the
domains for a single query. We’re gonna pick the one that’s best. So, if you have content
on two different domains, let’s say in the same language, say an Australian version and
a British version of the same page, both in English and on different domains, we might
notice that and we’re gonna pick one. This can get you some times – most of the time
we’re gonna get exactly the one that you’d want us to get, the co.uk for the British
and the co.au for Australia, but in some cases we’ll get it wrong, and there’s a lot you
can do to help us out with webmaster console. You can log in and set each domain for a particular
locale. But by splitting this up you do run the risk of us getting it wrong once in a
while. And there’s this last little thing that most people don’t think it’s probably
that important but I really like, which is that you lose the advantage of a “tabbed”
UI. So if you do a Google search for something that would bring up your website and there’s
two different pages on that site that match that query, will show as the first site regular
in the search results, and the second result right below it tabbed over. And we’ll also
show a link that says “show more results from mysite.com.” So, all this stuff really draws
users’ attention to that block of information on the Google Search results page. It’ll get
a lot more attention and possibly a lot more clicks, so it’s a pretty useful feature. But
if you wrote that content splut across different domains, we’re no longer gonna give you that
advantage. So that’s another trade off to consider when you go into multiple domains.
But that’s it. In a lot of cases multiple domains are really useful, especially when
you’re talking about different languages because users really want to see the stuff in their
own locale and you wanna create that experience per country that you’re working with. So,
that’s generally everything that we’ve got on duplicate content here. It was a pretty
short session on SES, and thank you very much.

Comments

  1. Post
    Author
  2. Post
    Author
  3. Post
    Author
    BijouMind Interactive

    What we've known all along is now being acknowledged – Content is KING – EVERYWHERE. Throw in some links from relevant, high PR sites to that content and it will rank.

  4. Post
    Author
  5. Post
    Author
    iMarket

    The Internet is a research medium, not an advertising medium. Unique information gives you the most traffic. Keep your documents subject-specific. Mixing up information results in keyword washout. All important pages should have a text link directly from the home page. Meta tags should be to the point. They should answer HOW, WHAT and WHERE. The longer your title, description or keywords are, the less important each word becomes.

    Gary Skrzek
    iMarket Canada

  6. Post
    Author
  7. Post
    Author
  8. Post
    Author
  9. Post
    Author
    CreativeProfiteer98

    Whoa. So that means all those people autoblogging could possibly get hit with spam penalty. That's why I stay far away from that stuff.

  10. Post
    Author
  11. Post
    Author
    TORMY VAN COOL

    I have a question about. In one website I use WordPress 3.1.1 with its SEO package. In WordPress I can sselect one of the pages to appear as index (website/index.php for is. Neverthelss, I can also call the page with it's original link (i.e. website/page-name.html). In this case what can I do? Do I will be penalized by google?

  12. Post
    Author
    humanyoda

    Greg, you are a smart guy, but you speak too fast. You swallow ending of words, etc, which makes understanding your speech more difficult than necessary and takes the joy out of listening to you. Getting information is important, but enjoying the process is also important.

  13. Post
    Author
    pcpancake

    Wow. This was waay above me but quite interesting Greg. Well Spoken and I enjoyed the annotation about you at 0.22 I am a Geek. I understood your speech perfectly. I had been searching if it was bad to post the same video to more than one youtube channel… any thoughts on that? Have a great Day.

  14. Post
    Author
    Thomas van Schalkwyk

    Google should not force us to implement all kinds of workarounds and limitations to our websites. Instead they should come up with a better solution.

    Right now I need to have two different domain names pointing to the same website. (a business agreement between two companies to share content, but with different branding) But I just learned the I cannot use the canonical attribute across domains, so I'm screwed again.

    Does anyone have any ideas on how to overcome this?

  15. Post
    Author
  16. Post
    Author
  17. Post
    Author
    Master Hughes

    google ranking does not work like they claim, it designed to force people to pay for advertising, and they are alway indirect when you ask for what allowed and what not , they wont be direct and tell you

  18. Post
    Author
  19. Post
    Author
  20. Post
    Author
  21. Post
    Author
    Ernő Horváth

    I'm pretty confused where is the limit of content duplication 'today'. I personally try to avoid duplicate content, however I see some big company still using this and honestly it's working 🙁
    freessl com and rapidssl com is different on the main page, however if you open another page it's exactly same. Both ranked in Google and seems no penalty…

  22. Post
    Author
    CleanCreative

    REMEMBER: All of these "solutions" require that the end user not only knows, but also applies the correct usage of Google and search engines in general.

    I find an overwhelming majority haven't a clue how to fine-tune and really use Search Engines they just type & hit search. Either that or time is a constraint as it is with most people.

    Therefore there is no "one fix" which works for all in all instances. If most don't use an engine correctly, most may not see something you intend them to.

  23. Post
    Author
    CleanCreative

    keep in mind that, that translates to, you could be getting technically "penalized", not by google, rather by something not showing up for many users because they don't know that.

    For instance, site A is supposed to show but site B does, so user just searches something else or leave because site B isn't what they wanted and they don't know there's a site A.

  24. Post
    Author
  25. Post
    Author
  26. Post
    Author
  27. Post
    Author
  28. Post
    Author
    Vleporama vinyl car decals and 3d prints

    ok so two pages on same webpage. Product A and product B 99% are same. They got diffrent pictures. So, for me there is no point to change them becouse it is diffrent products. Both of them got diffrent cannonicals. Are this ok or should be changed.

  29. Post
    Author
  30. Post
    Author
    Stewart W.

    So the new Search Console does not have the facility which the old version had!

    It was a simple matter there to specify the exact version of the page, sitewide which you wanted indexed!

    That appears now to be gone. I have 301 redirects and canonicals operating to ensure that result, but google still will not correctly index the pages!

    Pathetic.

Leave a Reply

Your email address will not be published. Required fields are marked *