Server Log analysis for SEO (Advanced guide)

4 July 2017

15 min en lisant vite

S

There are many ways to run a server log analysis but each analyse requires a long data preparation step: logs recovery, grouping them, split data in several columns, cleaning, etc. These treatments can seems really hard, especially if there are done manually…

In this technical guide, I will show you how to do all these steps quickly and exploit the logs of an OVH web hosting with Dataiku, a Data Science tool.

This tool is a small revolution in the data processing field and you will soon understand why.

I forgot…

There is a FREE version of Dataiku that has the added advantage of being installed locally so that the data remain on your machine.

Chapter 1 : Getting Started with Dataiku DSS for Logs Analysis
- Install Dataiku DSS
- Connecting OVH server logs to Dataiku
- Your first analysis
Chapter 2 : Preparing logs for SEO analysis
- Cleaning server logs
- Filter search engine spiders and robots (GoogleBot, BingBot, etc)
Chapter 3 : Advanced Log Analysis for SEO
- HTTP Status code and crawl errors
- Crawl frequency per URL
- Crawler behaviours on mobile and AMP pages
- Screaming Frog Crawl + Server log Analytics
- How does your website influence crawlers? Find your own correlations

PART 1 : Your first server log analysis with Dataiku

Install Dataiku DSS

Start by instaling Dataiku DSS on your computer : https://www.dataiku.com/dss/trynow/

installer dataiku dss — Dataiku DSS, a data analytics tool used by data scientists, here “hijacked” for SEO use

Connect you OVH logs to Dataiku

Go to administration (gear icon to the top right), then in the Plugins tab.

Search for the OVH Logs Importer plugin (by Jérémy Grèze) and install it.

dataiku ovh logs importer — There are Dataiku plugins for a lot of uses like to request APIs (OVH, Github, Import.io, etc.) and plugins to enrich your datasets (OpenStreeMap, Enrich from Movies, geo enrichment, etc.)

Go back to the home page and click on + to the top left. Chose “New Project”.

tuto dataiku

Click the “+ Import your first dataset” button and select “OVH logs”.
A new page appears.

Fill in the fields with your OVH logins, the domain name you wish to study and the analysis period you are interested in. Then click on the “Test & amp; get Schema” button to start your server logs recovery.

Which period to study?
I advise you to study at least the last 30 days for a representative analysis. It will allow you, for example, to discover the pages ignored by Google Bot.

The analysis may take a few minutes depending on the period you have selected.

Once the recovery is complete, a sample of your logs appears on the right of the screen. As you can see, data are already well classified: magic! Give a name to your dataset and click the “Create” button at the top of the screen.

dataiu seo tutorial

You will now be able to start analyzing the server logs of your website: analyze the traffic, check status codes encountered by visitors / Bots, etc.

We will start with some basic manipulations so that you can take Dataiku DSS in hand quickly.

Foreword: Dataiku DSS uses a sample of 10,000 lines to speed-up processing. You can enlarge the sample according to your computer’s capacities. You just need to click on 10 000 rows under Design Sample in the top left corner.

Your first analysis

On your dataset, click the LAB button at the top right. A pop-up appears, click on “NEW Prepare data and build models” to create your analysis.

filtrage logs seo

Click the green “+ ADD A NEW STEP” button and choose the library you are interested in.

You will have to repeat this operation each time I will talk about analysis creation.

Status code encountered by visitors

Click on the column heading “status” then on “Analysis …”. Congratulations you just made your first server log analysis!

analyse logs status code

Very simplistic, but really interesting: here you have the distribution of all the status codes actually encountered by the visitors (humans and bots) of your site.

Tip: Clicking the gear icon of each line will allow you to perform processing in your dataset. For example, display only the rows that contain the status code 404 to prioritize your strategy.

Visitor IP adresses geolocation

Dataiku DSS offers several ways to enrich your datasets via external APIs and databases.

Here we will enrich the IP addresses with geographical information: countries, regions, cities.

Click on the column heading “host” and then on “Resolve GeoIP”. 7 new columns appear.

logs web geo ip

You have just geolocated all the IP adresses that have visited your site. You now know the origin of your visitors by country, region and city. To get an overview of the distribution of the geographical origin of the visits, click again on the heading of the column that interests you and on “Analysis …”.

Google Bot filtering in your server logs

Here’s the part that should most please SEO!

To filter Google Bot’s hits in the server logs there are 2 options:

The first one is to click on the column heading “agent” then “Filter” the character string “Googlebot”. This method has the disadvantage of capturing the funny people who pretend to be GBot … You want to know the full chains of Google’s user-agents, check this page.

The second method, that appears as more reliable, is to filter GoogleBot IPs by manually entering the 66.249 prefix (“host” column).

Once you have chosen your method, click “+ Add as a step” to validate the operation by following the same setting as in the screenshot below.

filtrage de google Bot dans logs web — Here I have filtered the server logs of Google Bot on its IP adress that start with 66.249

Only lines based on the Google Bot user agent remain.

This is a good basis for performing much more detailed and detailed analyzes of Google Bot behavior on your website: status encountered, crawled pages, and so on.

I hope you enjoyed these first “exercises”. Let’s move on to a log analysis for SEO but before that, let’s prepare data.

PART 2 : Web Server logs Preparation

Cleaning logs for an SEO analysis

The goal here is to keep only the most interesting web server logs for an SEO analysis. So we’ll remove the lines that concern css, js, images, etc.

Create a new analysis (as seen in PART 1 Your first analyzes) and choose the “Filter rows / cell on value” library in the “Filter data” set.

Then set the filtering of your logs as follows:

filtrer logs web — We filter here all the lines containing calls to .js .css /fonts/ .png .jpg .php .gif .ico /#comment- /?replytocom=

The values that has to be filtered may vary depending on the type of site you are analyzing. Here the values (“has values”) are particularly adapted to an analysis of a WordPress website but do not hesitate to check your dataset to complete them.

Finally for a better readability, we will remove the HTTP / 1.1 at the end of the queries.
In the request column, double-click on any cell containing HTTP / 1.1 and then select “Replace HTTP / 1.1 by …”

We just deleted 8831 lines (CSS, images, scripts, etc.) and process 1158 lines for cleaning HTTP / 1.1. Don’t forget that all these treatments are made on our sample of 10,000 lines 😉

Filtering crawlers and bots: GoogleBot, BingBot, etc

Start by classifying user agents by type. Thanks to this, we will be able to keep only bots. To do this, click on the agent column header, then on “Classify user agent”. New columns appear with additional details: agent_type, agent_category, and agent_brand.

To only keep rows that concern bots, click a cell containing the “bot” value in the agent_type column, and then select “Keep only rows equal to bots.”

user agent log web

Tip : Dataiku DSS offers a data anonymization option thanks to the plugin “Data Anonymizer”. Once installed, you will be able to anonymize any column of your datasets with a few clicks (on the Analysis tab click “+ ADD A NEW STEP”, look for “Anonymize data processor” and follow the instructions)

PART 3 : Advanced Log Analysis for SEO

In this part we will go further in studying crawlers’s behaviours: visits frequency, status codes encountered, behavior on mobile pages.

Status code encountered by bots

Go to the Charts tab.

On the left there are all variables your dataset includes. Drag them to the right part on X and Y to create a visualization of your data.

The first visualization is a macro view of the status codes encountered by the Bots: it regroups the number of visits of each bot on the site and the status codes encountered.

status codes user agent — Number of hits for each crawler and HTTP codes encountered

Do you like this kind of analysis?

Let’s go on!

Bots hits per URL and page type

Let’s now analyze in detail the number of hits for all the Bots on each of your URLs, in order to detect particular paterns. We will also classify the pages according to whether they are AMP or not.

Start by creating a new dataset based on our previous treatments (filtering, cleaning). We will need it for grouping hits per URL.

To achieve this, click the DEPLOY SCRIPT yellow button at the top right. Name it, choose its location and format (CSV), and click DEPLOY.

dataiku deploy script

Is your new dataset ready?

Go to the FLOW (Dataiku menu, top left). Consider the FLOW as a history of your operations.

dataiku flow

We will now regroup the lines per queries. The goal is to count the number of bots’s hits per URL.

Click on your new dataset, then in the Visual recipes section click on the yellow picto “Group”.

Select the Group Keys “request” and “agent_brand” then validate your new dataset by clicking the RUN button at the bottom.

group by dataiku dss

Go back to the FLOW: your new dataset now appears. Click on it and then click the blue “LAB” button on the right.

A window appears: click on “NEW Prepare data and build models” to start a new analysis.

With a table view (Charts tab), you can generate an observatory of Bots’s hits on each URL, in a few clicks:

Do bots spend time on the right pages of your website?

In the Charts tab, with few drag and drop you can easily display the pages most visited by crawlers.

pages vues user agent — This analysis highlights the differences between Bing and Google’s crawlers behaviours. The first one often uses robots.txt and the homepage whereas the second one prefers homepage and sitemaps.

I also invite you to try a visualization of the hits per day for each crawler. This will allow you to discover abnormal behavior (eg a bot that suddenly stops visiting some of your pages) and also know the date of last crawl of your URL.

Bots’s daily visit for each URL, with a filter on bots and URLs

Bots’s behaviours on mobile pages and AMP

We will now split AMP and non-AMP URLs.

Cut the request column thanks to a Split with delimiter / and specify the name of the generated column (in Output columns prefix box). Choose Output as Multiple columns and check Truncate. Change the number of columns to keep to 1 and choose Starting from End.

You should get something like this:

split dataiku

The column you just created now contains pieces of URLs including “amp” which we are particularly interested in here. To keep only “amp” and delete the contents of the other cells without erasing the lines, proceed as follows:

Click on a cell containing amp, and double-click on “Keep only rows containing amp” and then in the Script part on the left of the screen, select the “Clear non matching cells” action with amp.

filtres data dataiku datascience — The small green bar below the amp column gives you an idea of the volume of amp pages.

Your column is now cleaned and allows you to locate the rows of your dataset corresponding to amp pages at a glance, which will be very useful for filtering.

You can now filter the crawl hits per Bot (Google, Bing, Yahoo, etc.) and per page type (amp / non amp) and even sort the lines per number of hits for better readability.

Let’s go back to the Charts tab to see the distribution of the crawl on the AMP vs non-AMP pages.

amp logs site web — Here we can see the kind of page (AMP in green vs non-AMP in blue) crawled by the bots

To go further, you can also enhance the visualization of filters: by category of page, by type of bot (mobile vs non mobile).

Cross server logs with crawl data from Screaming Frog

Run a crawl and server log analysis
A cross analysis will allow you to detect website’s areas where Googlebot does not come, and also ensure that it spends time on the right pages. It is a precious help to discover the orphan pages or spider traps.

Start by crawling your website with Screaming Frog to retrieve your website’s onsite data for each URL (level, inlinks, size, etc.), filter the HTML and export the data in .xlsx format.

analyse croisee log crawl

Go back to Dataiku DSS, re-open your previous dataset (the one with the 4 columns: request, amp, agent_brand, count). We’ll prepare it for the merge with the data from the crawl Screaming Frog. Now create a new Analysis.

In the “request” column, replace “GET /” with your domain names as it appears in your internal file of Screaming Frog. This step will allow us to match the lines between the 2 datasets .

For this analysis, you can also keep only the lines corresponding to GoogleBot (Keep rows where agent_brand is Google)

dataiku data science studio filter

Click DEPLOY SCRIPT at the top right. A new page opens, click on RUN to create a new dataset based on your last treatments, we will call it “ovh_crawl_logs_ready”.

Let’s move to the merge.

Click + DATASET and upload your Screaming Frog file.

flow dataiku — On the left there is the file internal_html from Screaming Frog, on the right the file of logs and its different treatments

Before you can join the 2 files and keep all data from the 2 datasets, you must store them in a PostgreSQL SQL database as explained on Dataiku forum .

Click on your ovh_crawl_logs_ready dataset and then click Export (in the FLOW sidebar). A window opens, go to the EXPORT TO DATASET tab. Give your dataset a new name (ovh_crawl_logs_ready_postgresql), then choose Store in PostgreSQL.

postgresql dataiku dss

Let’s do this again with your Screaming Frog dataset “internal_html”.

Once you have finished creating your PostgreSQL databases, click on your new “ovh_crawl_logs_ready_postgresql” dataset to bring up the FLOW sidebar.

In the Visual recipes area of the sidebar, click “Join with”.

A window appears. The value of First input dataset is “ovh_crawl_logs_ready_postgresql”. For the Second input dataset, choose internal_html_postgreSQL and click CREATE RECIPE. A new page appears.

Let’s move on to the merge setting.

In the Join step, click the line that connects Address to request. A window appears. Click the JOIN TYPE tab and select Outer join to keep all the rows that match between the 2 datasets but also the lines that do not match.

This will allow us, for example, to discover the orphan pages (pages seen by the bots but not linked on the site) or the pages not crawled by Google while they are linked on the site.

diagramme venn analyse logs

To speed up the processing, we will filter some columns. To do this, go to the Selected columns step.

The most interesting data in your Screaming Frog file are in the Adress, Size, Word_Count, Text Ratio, Level, Inlinks, Outlinks, External Outlinks, Response Time columns. You can add a prefix (eg SF_) for easier reading in your future dataset.

merge 2 datasets postgresql dataiku

Click RUN to start merging the 2 datasets.

You are now ready for a cross logs / crawl analysis! Open your new dataset.

Orphaned pages and never-crawled pages

Your dataset has 2 similar columns: request and SF_Address.
Play with the filters to create the following 3 configurations:

Request empty / URL in SF_Adress : pages never crawled but well linked on your site
URL in request / URL in SF_Address : everything is OK, crawlers see pages that are well linked on your site
URL in request / SF_Address empty : these are orphan pages, pages that are no longer linked to your site but that are still crawled by bots. If you have heard of crawl budget, reducing these pages should be part of your SEO strategy.

analyse logs pages jamais crawlees — List of URLs never crawled by Googlebot but well linked to the website according to Screaming Frog’s crawl.

analyse logs pages orphelines — List of URLs for orphan pages

analyse logs crawl logs — Lists of pages crawled by Google and linked to the website. These pages seems OK 😉

Find your own correlations between logs and crawl

Go to the Charts tab. The goal here is to analyze the impact of each factor (from the Screaming Frog analysis) on the crawl frequency.

Choose count (AVG) on the Y axis and test different combinations on the X axis.

Here are some ideas:

crawl google profondeur pages — The traditional crawl hits per page depth level

crawl texte ratio nombre mots — Looking for a correlation between ratio text / html, number of words and GoogleBot hits

google bot inlinks correlation — Looking for correlation between internal linking and Googlebot hits

What’s next?

The analyzes we have done so far are descriptive analyzes.

In a next post, we will use Dataiku DSS to make prediction using Machine Learning algorithms.

Do not hesitate to ask your questions in comments 😉 I will enrich the post according to your feedbacks!

Follow me on Twitter to be informed of the updates and next posts:
Follow @Remibacha

Rémi Bacha

Passionate about SEO and Data Science. Founder of LABS ON MARS agency and co-founder of DATA SCIENCE SEO trainings.

Server Log analysis for SEO (Advanced guide)

Contents