Just like a school trip, we got in early and we got the back seats in the Studio at Brighton SEO. This mornings session was called ‘Crawl’ – and here’s what we learnt.
The first talk was by Dawn Anderson, titled Hunting For Googlebot – The Quest For ‘Crawl Rank.’ The talk focused on the core principles of crawling websites and Google’s crawlers themselves. There was a large amount of research, so it’s best to have a good read of the slides here.
Second up was Oliver Mason with a talk titled ‘Server Logs After Excel Fails.’ It was a talk I can relate to all too well. So often we end up with with files that are too big for Excel to open and have to use various tools to open these files. Oliver’s talk discussed in details how to use the command line to run various commands to make sense and use of these otherwise unmanageable files. As most of the presentation was screenshots of chunks of code, I’d suggest checking out the slides.
Having really enjoyed Oliver’s talk, Barry had to step up with something special to maintain the momentum. And that he did. Barry was doused up on painkillers as he’d put his back out earlier in the week. Not sure if that was also his excuse for his foul language through (which the audience enjoyed).
Barry’s talk was titled ‘How to Identify and Fix Crawl Optimisation Issues.’ Whilst most of Barry’s points were based on core fundamentals, it was refreshing to hear from someone so passionate discuss a topic that can have a real and quick impact on organic results.
How Google Crawls Your Site
Barry kicked off by talking about Google’s crawl sources. Googlebot needs to find a website’s pages, here’s a few ways that it does that:
- Site crawl
- XML sitemaps
- Inbound links
- DNS records
- Domain registrations
- Browsing data
Deep Crawl is one of Barry’s favourite tools – I’m not sure if they’re buying his beers – but I’m keen to give it a go following this morning’s talk. He showed us an example of one project he worked on, where a website had 96,000 urls that had been crawled by the bot, however only 412 of these were unique pages.
This is huge crawl waste – Barry made it very clear (including colourful language) that the developers messed this one up. Barry also sang the praises Screaming Frog (a free tool, that also comes with a premium subscription) – we use this one on a daily basis. A second example was a website with over 73,000 url crawled, however using Google’s ‘site’ operators, only shows 250 or so pages in it’s index – Barry definitely highlighted that someone had a big problem here.
However, the height of Barry’s frustration and language came through as he discussed receiving a message of through Google’s Search Console – ‘High number of URLs found on…” – something we’ve helped remedy on behalf of one of our clients previously. So, now we know we’ve got a crawl issue – we’re wasting Google’s time, what should we do about it?
How to optimise crawl time
Check your XML sitemap. Common issues here are the same urls with and without a trailing slash. It’s a simple fix, but spot it early to avoid wasting Google’s time.
Optimise your XML sitemaps:
- Ensure your sitemap contains final URLs only
- Minise 301-redirects or other non-200 status codes
- Use multiple sitemaps to identify crawl waste in GSC
One of Barry’s favourite topics is pagination and faceted navigation – mostly because so many people mess it up. This is particularly an issue for ecommerce websites. Another example that Barry shared was an ecommerce store store with 1,500 products but 1.5m pages in Google’s index.
How to optimise paginated listings
- List more items on a single page – it’s simple really
- Implement rel-prev/next tags (pagination meta tags) – these are awesome. Advise having a list all products page.
- Block sorting parameters in robots.txt, you can do this by adding Disallow: /*?sort=* into your robots.txt file; it’s also possible to do this via Google Search Console.
Faceted navigation can also lead to an unoptimised crawl. Barry provided a few points on how to combat this:
How to optimise faceted navigation
- Decide which facets have SEO value and build static pages for these, e.g. categories or attribute pages. Yes this can take a while, but it’s the right thing to do – so do it.
- For all other facets – simply disallow the Googlebot, can you do that via the robots.txt file, similar to the previous example.
Further crawl waste can minimised by properly managing your internal site search results pages. Googlebot may try throwing random keywords into your site search – meaning you’ll end up with random indexed pages. Again, Barry advises blocking these pages, using the robots.txt file.
Barry emphasised that editing your robots.txt can lead to problems if done incorrectly. Therefore, you should always test any updates in Google Search Console, it’s easy to mess this up – we’re only human – so always give it a test.
Internal redirects is one of Barry’s bugbears. Redirect chains are bad – they have a dampening effect – it has been reported that they can lose about 15% of your link value. If you have multiple redirects, you’re really reducing your link building impact. Don’t have internal redirects if you can avoid it.
Check for layers of redirects – trying to keep a reasonably flat website structure will help with this. To check your current redirects, use the Screaming Frog spider tool to crawl your site and filter to 301/302 redirects.
Another one of Barry’s favourite issues is canonicalised pages. One example Barry showed had 158,276 canonicalised pages.
Therefore, we must use canonicalisation wisely. Adding a canonical tag won’t fix or reduce crawl waste – Google still needs to crawl the page!
Don’t use canonicalisation for
- Faceted navigation
- Pagination and sorting
- Site search pages
Do use it for:
- Separate mobile URLs
- Session specific url parameters
- Content syndication
- Unavoidable content duplication
Barry touched on site speed at the end of his talk. Google has a set amount of time to crawl your website. If you’ve got a slow load speed, you’re limiting your crawl time from Google.
Therefore, you should look to optimise your load speed.Time To First Byte (TTFB) is used to understand the responsiveness of a web server. Google takes direction from TTFB to know how quickly to crawl your website. Barry emphasised having lightweight pages and to make use of caching and compression.
To summarise, Barry said ‘don’t let the search engines do the hard work’ – this will have a negative impact on your website’s organic performance.
The time flew by, it’s now 11am and the end of the first session. Time to jump into the Corn Exchange and check out some of the sponsors!