Googlebot: SEO Mythbusting

Google+ Pinterest LinkedIn Tumblr

A lot of confusion revolves around SEO because no one understands how the Googlebot actually works. Hello and welcome to another episode of SEO myth busting. With me today is Suz Hinton from Microsoft. Suz, What do you do at work? And what is your experience with front-end and SEO? So right now I’m doing less front-end these days. I focus more on IOT. So in the time you were front-end developer, yeah I was a friend of development for I think 12 or 13 years and so I got to sort of work on lots of different Contexts in front-end development different websites things like that cool today I wanted to like just address like a bunch of stuff about Google but specifically and nerd out about Googlebot because That was the side of things that I was sort of the most confused about at the time So Googlebot is basically a program that we run that does three things.

The first thing is it crawls and it indexes, and then, last but not least, there’s another thing that is not really Googlebot anymore. That is the ranking bits. so we have to basically grab the content from the internet, and then we have to figure out what is this content about, what is the is the stuff that we can put out to users looking for these things. And then last, but not least, is which of the many things that we picked for the index is the best thing for this particular query in this particular time, right? Yeah so but the ranking that the last bit where we like move things around that is Informed by Googlebot that it’s not part of Googlebot Is that because like there’s this bit in the middle of the indexing like the Googlebot is responsible for the indexing Yes and making sure that that content is useful for the ranking engine to kind of Absolutely, you can’t imagine like someone has to in the library Someone has to like figure out what the books are about and I get the index of the bits and a catalog The catalog being our index really and then someone else is using that index to make informed decisions And and like going like here this book is what you’re looking for I’m really glad you use that analogy because I worked in the library for four years.

And I was that person, people be like I want Italian cookbooks and I’m like well at 641.5495. You just say If I would come to you as a librarian and ask a very specific question like so what is the best book on Making apple pies really quick Would you be able to like figure out from the index of you probably have lots of cookbooks… We did Yeah, we had a lot but given that I also put lots of books back on the shelf I knew which ones were popular I’ve no idea if we can link this back to Googlebot but it does it’s it’s the yeah It’s pretty much so you have the index that yeah probably doesn’t really change that much unless you add new books to new edition Right exactly Yeah so you have this index which Googlebot provides you with but then we have the second the librarian the second part that basically based on how the Interactions with the index work figure out which books to recommend to someone asking for it.

So that’s that’s it Pretty much the exact same thing there like someone figures out what goes into the catalogue and then someone uses their I love this this Makes total sense to me, but I guess that’s still not necessarily all the answers you need, right? Yeah I just want to know like what does it actually do? Like how often does it crawl sites? Like what does it do when it gets there? Like what is it sort of how is it generally behaving like does it behave like a web browser? Like it was a good question Yeah Generally speaking It behaves a little bit like a browser at least part of it does so the very first step the crawling bit is pretty much browser coming to your page either because we found a link somewhere or you submit a sitemap or There’s something else that basically fed that into our systems You can use search console to give us a hint and ask for reenacting and that triggers a crawl before done that We ask for it to be done and that is perfectly fine But the problem then obviously is is how often do you crawl things? And how much do you have to crawl and how much can the server bear right if you’re on the back-end side? You know that you have a bunch of load and that might not be always the same thing if it’s like Black Friday Then the load is probably higher then on any other day So what Googlebot does is it tries to figure out from what we have in the index already Is that something that looks like we need to check it more often.

Does that probably like a newspaper or something got it? Yeah Or is that something like a retail site that does have offerings that change every couple of weeks? Or even do not change at all because this is actually the site of a museum That changes very rarely like for the for the exhibitions maybe but like a few bits and pieces don’t change that much so we try to like Segregate our index data into something that we call Daily or fresh and that gets crawled relatively Frequently and then it becomes less and less frequent as we discover and if it’s like something that is super spammy or super broken We might not crawl it as often or if you specifically tell us Oh, don’t no do not do not Index this do not put this in the index This is something that I don’t want to Show up in the search results And we don’t come back every day and check right? So you might want to use the reindex feature if that changes you might have a page that you go like No, this shouldn’t be here And then once it has to be there you want to make sure that we are coming back and next thing again So that’s the that’s the browser bit That’s the crawler part, but then a whole slew of stuff happens in between that happening us fetching the content from your server and The index having the data that is then being served and ranked So the first thing is we have to make sure that we discover if you have any other resources on your page Right.

The crawling cycle is very important so what we do is the moment we have some HTML from you we check if we have any links in there or images for that matter or video something that we want to want to crawl as well and That feeds right back into the the crawling mechanism. Now if you have a gigantic Retail site. Let’s say Just hypothetically speaking We can’t just like crawl all the pages at once both for our restorative resource constraints. But also we don’t want to overwhelm your service so we basically Try to figure out how much we can put how much strain we can put on your service and how much resources we’ve got available As well and that’s called the crawl budget oftentimes, but it’s pretty tricky to determine so one thing that we do is we crawl a little bit and then basically ramp it up and when we start seeing errors, we Ramp it down a little bit more.

So like oh, sorry for that. Oh, So whenever your service serves us 500 errors and there are certain tools in search console that allow you to say like hey Can you can you maybe like chill out a little bit? But generally we don’t try to get all of it at once and then then ramp down We’re trying to like carefully ramp up rent down again ramp up again run down like answer it fluctuates a little bit There’s a lot more detail in there than I was even expecting like I didn’t even know that I guess I never considered that a Googlebot like sort of crawling event could put strain on somebody’s website like That sounds like it’s a lot more common than I even thought it does It does happen Especially if we discover Say a page that has like lots of links to sub pages then all of these go into the crawling queue got it and then you might like these have links to let’s say you have like a 30 different categories of stuff and each of these have A few thousand products and then a few thousand pages of products so we might go like oh cool Crawl and then we might crawl like a few hundred thousand pages and if we don’t Spread that out a little bit.

So it’s a weird balance right on one hand If you add a new product you want that to be surfaced in Search as quickly as possible, on the other hand You don’t want us to take all the bandwidth that you serve I mean cloud computing makes that a little less scary I guess but I remember the days I’m not sure if you remember the days But you had to like call someone And they asked you to send a form or fax a form and then like two weeks later you get the confirmation Letter that you server has been stuck. Yes I remember the days when we would have to call and then we would basically pay $200 to have a human like go down the aisles like push the physical reset button on the server.

So yeah those those times And then imagine you basically renting five servers somewhere in the data center Yeah, and that taking a week and then we come in and scoop up all your bandwidth Hey, we’re offline today because Google has its crawl day that that’s not what we want Yeah these days it’s more of like hacker news kind of moment waiting. Yeah, exactly So I feel like you have much more Considerate and yeah, we try to not overwhelm anyone and we respect the robots.txt. So that works within the crawl step as well And once we have the content, we can’t put strain on your infrastructure anymore. So that’s fantastic But modern web apps being mostly JavaScript We then put that in a queue. And then once we have it we have the resources to render it We actually use another headless browser kind of thing.

We call that the web rendering service then there’s other crawlers as well That might not have the capacity or the need to run JavaScript. This is like social media butts for instance They they come and look for metadata if that meta tag is coming into the JavaScript you usually have a bad time and they’re just like Sorry yeah, so that’s always been a big myth is and I remember when single page applications or SPAs really came into vogue a lot of People were really concerned.

There’s a lot of FUD around if Crawlers in general don’t execute JavaScript, then they’re gonna see a blind page and how do you get around that? So so contextually within Googlebot it sounds like Googlebot executes JavaScript Even if it does do it at a later point. Yes, so that’s Good, that’s good. But like is there anything that people need to be aware of beyond just oh well It’ll just run it and then it’ll see exactly the same thing as like a human with a phone or a desktop Let’s see.

There’s a bunch of things that you need to be aware So the the most important thing is again, as you said, it’s deferred. It happens at a later point So if you want us to crawl your stuff as quickly as possible That also means we have to wait to find these links that JavaScript injects Wait, they’re basically we crawl we have to wait until javascript is executed Then we get the rendered HTML and then we find the link So there’s a nice little short loop that finds these links very relatively quickly right after crawling will not work right So we will only see the links after we render it and this rendering can take a while because the web is surprisingly big yeah, just a little bit like 30 trillion ducks in 2016 so I’ll say now there’s way more than that.

Yes more than that so so Robots.txt is very effective at being able to sort of tell much how to do a certain thing But in this scenario, like how do you tell that like, it’s Googlebot visiting your site? Yes question Yes So as we are basically using a browser in two steps one of the crawling and one is the the actual rendering At both of these moments we do give you the user agent header But basically there’s the string, it’s literally the string Googlebot in there, that’s so right straightforward Yes, and you can actually use that to help with your SPA Performance as well.

So okay as you can detect on the server side Oh, this is a Googlebot user agent requesting you might consider sending us a pre-rendered static HTML Version and you can do the same thing for the others like all the other search engines and and social media Bots have a specific string saying that they are a robot Okay, so you can then basically go like oh in this case I’m not I’m not giving you the real deal that the single page app. I’m giving you this HTML that we pre-rendered for you That’s called dynamic rendering we have ducks on that as well. The one thing that still doesn’t quite make sense to me is Does the Googlebot kind of have different contexts like Does it sometimes pretend that it’s sort like I I think of it as this little mythical creature that’s pretending to do certain things so like does it pretend to be on a mobile and then desktop like Are they different sort of I guess like user agents? Even though it still says Googlebot and can you differentiate between them you’re asking great questions because yes, we have different user agents So I’m not sure if you heard about more by first indexing being rolled out and happening I’ve heard that like it’s going to affect like how you’re ranked.

Potentially That’s two different things that get conflated so often. So mobile- first indexing is about us discovering your content using a mobile user agent and a mobile viewport. So we are using mobile user agents and and the user agent string says so if it says something about Android and the name And then you’re like aha. So this is the mobile Googlebot you have documentation on there There’s literally a Help Center article that lists all these things So we try to index mobile content to make sure that we have something nice to serve for people who are on mobile But we’re not pretending like random user-agents or anything that we stick to the user agent strings that we have documented as well And that’s more my first indexing where we try to get your mobile content into the index rather than the desktop content, huh? And then there’s mobile readiness or mobile friendliness If your page is mobile-friendly it makes sure that everything is within viewport and you have large enough tap targets and all these kind of lovely things and that just Is a quality indicator we call these signals we have over 200 of them That’s a lot So Googlebot collects all these signals and then stuffs them as metadata into the index And then when ranked we’re like, okay, so this user’s on mobile So maybe this thing that has a really good mobile friendliness Signal attached to it might be a better one and the thing where they have to like pinch zoom all the way out to be Able to read anything and then can’t actually deal with the different links because they’re too close to each other so that’s one of the many it’s not the signal it’s one of the many signals is one of the over 200 signals to to deal with I Had no idea.

They were 200 right? That’s like me I know that you’re not allowed to like share what they all out because like there has to be a certain mystique around it because I guess like a lot of SEOs abused that in the past. Yeah, yeah Unfortunately that is a game that is still being played and people are doing like weird stuff to try to game us And the interesting thing with this is with the 200 signals. It’s really hard to say which one gets you would like weights And they keep moving and they keep changing So it’s I love when people are like no let’s do this and then look my my rank changes like yeah for this one query But you lost on all the other queries because you did like really weird and funky stuff for that.

So just Build good content for the users and then you’ll be fine I feel like that it feels like less effort as well and like constantly trying to yeah Yeah, but it’s not an easy answer, right? You’ll pay me to make you more successful on on Search engines and I come to you and say like so who are your users and what do they need and how could you express? That so that they know that it’s what they need That’s a hard one because that means I basically bring the ball back to you and you have to think about stuff and figure out strategically whereas if I’m like, okay, I’m just gonna you know Get you get you links, or do some funky tricks here and then you’ll be ranking number one.

That’s an easier answer It’s the wrong answer, but it’s the easier answer so people are like and links are the most important metric ever is I’m like, no, we have over 200 and it’s important, but it’s not that important and Chill out everybody, but this still happens. Yeah. I’m so glad it’s better now Like I feel I feel actually we’re at peace in general with SEO as well Suz thank you so much for being with me here and has been a great pleasure, you know Thanks for I like answering all of my weird and wonderful questions about the Googlebot. Did we bust some myths? I feel like we did Fantastic, I think that’s worth a high five. I say Thanks. Thanks join us again for the next episode of SEO myth-busting where Jamie Alberico and I will discuss if JavaScript and SEO can be friends and how to get there..