It’s now 2018 and yet many digital marketers we speak with still know little about the power of log file analysis for search engine optimisation (and user experience) gains.
What is a log file?
When websites and content delivery networks serve your files to your users or search engine bots, a small trace is left behind in a log file — one line per asset served.
The format varies between operating systems, but typically the same information is presented; remote IP address, timestamp, HTTP method, URL requested, status code, size in bytes, user agent string, etc. You can also format these according to specific needs.
It’s from these footprints that technical SEOs can piece together a search engine crawler bot’s activity on a web host, and in turn evaluate, over time, how the site is being crawled. On larger websites, a large concern of technical SEOs is the allocation of Google’s crawl budget.
In addition, without filtering bot traffic, you can sample your entire web server’s browsing history and look for user-related errors in order to resolve them.
How can you analyse a log file?
My favourite post and related talk by far on the subject is this one on ohgm.co.uk, as it caters for enterprise-level, Excel-breaking data, but the favourite tool at Impression is Screaming Frog’s Log File Analyser, which brings data into a readable format a little easier.
If your data set isn’t too large, then typically importing it into Excel is a good beginner route, as with a quick “data to columns” transformation you can set up a pivot table to begin querying the data.
What can you find?
In short, analysing a log file can be done in many ways depending on what you’re looking for, but typically you might want to:
Filter Google Bot traffic
Understanding Google’s allocation of crawl budget is key to technical SEO. Uncover bots visiting pages you don’t want them to, or discover directories getting more than their fair share of time or download resource. Also, identify pages which are visited regularly and ensure these fit into your perception of important pages on your site.
Compare desktop and mobile bot user agent activity
Your browser (or bot) will have a set user agent string – a small sentence that describes the type of machine or software you’re using to request the web page from the server. It’s in this small detail that you can filter in or out traffic from different sources. You can see all of Google’s own bot names and user agent strings listed here.
A few examples;
|Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)|
or (rarely used):
|Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P)|
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96
Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
With Google continuously rolling out better mobile experiences, it’s always a good idea to check that your site is being interpreted as you intend by the smartphone bot variants.
Compare crawls and your sitemaps
Google’s Search Console gives nice stats about the number of submitted vs. indexed pages, but if you want to get granular and identify problem pages, you’ll need to use an indexation checker, like Greenlane’s, and also head to your log files to check that the unindexed pages weren’t crawled.
From there you’ll know whether or not a page was originally crawled — so if it was, and still isn’t indexed, then you’ve likely got other problems at hand.
Establish crawl frequency on top files
If you run a busy website and you’re interested in knowing how search engines are accessing key files, like your sitemap or robots.txt, then you can take a long-term view over the file’s pathname in the logs and check accurate crawl frequency, per bot.
Identify accurate server and URL errors
Again, Search Console tries to give you this information via its Crawl Report, but if you have intermittent issues on your website, then it’s unlikely Google’s report will identify these.
The issues I’m discussing are HTTP responses in the range over 400 (404, 500 being common ones). Without filtering on user agents (i.e. looking at ALL traffic requests) – filter on HTTP response codes. This will also include AJAX form submissions, data requested or loaded after a page load, etc. Identify all troublesome URLs and get down to investigating!
This tip can also be handy outside the realms of SEO — if a form isn’t submitting every 1 in 10 times, or a payment gateway times out 2% of all occurrences, you’ll soon know about it!
Audit request file sizes
If you’re hosting large files on your website, then you will be once again wasting your crawl budget. Ensure the file size (in bytes) is included in your access log format, and check for extremely large files. Think about whether or not these files are useful to a search engine, and then either compress, host elsewhere, or even consider blocking access to the resource in order to streamline crawl budget.
Identify rogue query parameters
Does your website suffer from excessive faceted navigation and an over-reliance on query parameters? (URLs with lots of question marks and ampersands in them). If so, or if it has in the past, check that these URLs aren’t still being crawled. If the base page URL is still the same, then the query parameter version of this URL will still be valid also.
Check up on response times
Ensure that your log format has microtime so that you can complete this step. Google downloads only a predetermined amount of data per website, but it also can time out of an allocated time budget. Time is money, after all — especially with cloud computing!
Audit pages which take over the average and establish what might be causing this issue. If you run a dynamically templated website (most content management systems), then consider heavily caching your website so that it behaves as if the HTML rendered comes from static files. This drastically reduces your server overhead and massively speeds up response times, too.
How to get to your access logs
Depending on the hosting environment you’re using, this may vary slightly.
Typically your log file may be found here: /var/log/nginx/access.log. Bear in mind that this will not automatically rotate – therefore your storage capacity will be limited. See this guide on how to configure access log rotation on NGiNX.
Typically your log files could be stored at /var/log/apache2/access.log or /var/log/httpd/access.log or similar.
CDN / other users
Typically, some setup may be required to ensure full access logging is activated and stored in a useful way. We use Cloudfront for CDN hosting and to achieve access log logging, a configuration with Amazon S3 is required.
This is only a short introductory post – fortunately many industry commentators have already gone into good depth on this topic before. Here are a few of our favourite posts on the topic:
Built Visible: The Ultimate Guide to Log FIle Analysis
Screaming Frog: 22 Ways to Analyse Log Files
Slideshare: Log File analysis with Big Query