Do you really need backlinks? (R tutorial)

D
Formation Data SEO Labs

“Everyone knows, backlinks can boost rankings on Google!”

Are you really sure?

You have probably already noticed that it is not uncommon to see pages that rank without ANY backlink. In this tutorial, I share a first approach to analyze your link needs in the form of a SERP exploratory analysis.

Prerequisites

We will use the R language and the R Studio IDE to run our script.

You will also need a Majestic API key, which will be used to automate the recovery of netlinking data.

If this is the first tool you develop with R, no fear, I will explain a simplified method here and we will build the tool step by step.

What will we do in practice?

tuto seo majestic r positions google

We will create a pre-audit tool for netlinking that will do 3 operations :

  • Get the Google Top 100 URLs for a keyword using web scraping
  • Automatically retrieve more than 70 netlinking variables (CF, TF, Root Domain, etc.) for each URL, using Majestic API
  • Analyze the correlations between the position of URLs on Google and their Majestic data
Note that if you want to go further
a multivariate analysis method, which takes into account netlinking as well as other SEO criteria (content, speed, etc.), is taught in the “Machine Learning” module of Data SEO Labs (Level 1) Training. You will discover in particular how to build a Google ranking prediction function.

Ready to analyze the SERP? Let’s GO!

1. Install needed R packages

The following script will check in your library if you already have the packages or not, and install them only if you don’t have them (A piece of code to bookmark).

list.of.packages <- c("stringr", "httr","XML", "dplyr", "majesticR","Hmisc","corrplot","PerformanceAnalytics")

new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)

library(stringr)
library(httr)
library(XML)
library(dplyr)
library(majesticR)
library(Hmisc)
library(corrplot)
library(PerformanceAnalytics)

2. Get Google Top 100 urls

We start by creating our SERP scraper that we will call get_google_serp_urls(). This is a function that will build the SERP URL from a keyword, retrieve its content, and then extract its URLs.

The function here takes 5 parameters:

  • querie : the keyword you want to study the SERP
  • number_of_results : the number of results you want to retrieve (10, 100, etc)
  • country_code : The SERP country (see the list of country codes on Wikipedia)
  • language_code : The SERP language (see the list on Wikipédia)
  • user_agent : the user agent with whom you want to make the request

Copy and paste the following code into an R script and run it to install the necessary packages and create the get_google_serp_urls() function.

#Create the R function to scrap the SERP
get_google_serp_urls <- function(querie, number_of_results, country_code, language_code, user_agent){
  
  serp_url <- paste0("https://www.google.com/search?q=",querie,"&num=",number_of_results,"&cr=country",country_code,"&lr=lang_",language_code)

  serp_url <- str_replace_all(serp_url,"\\s+","+")
  serp_url <- as.character(serp_url)
  
  request <- GET(serp_url, user_agent(user_agent))
  
  doc <- htmlParse(request, encoding = "UTF-8")
  doc <- htmlParse(request, asText = TRUE, encoding = "UTF-8")
  
  urls_df <- data.frame(url = xpathSApply(doc, '//*[@class="r"]/a', xmlGetAttr, 'href'))
  urls_df <- urls_df %>% mutate(position = row_number()) #dplyr
  
  return(urls_df)
  
}

Here is an example of a test of the get_google_serp_urls() function. Try it to get your Top 100 Google:

#Set the 5 parameters
querie <- "machine learning"
number_of_results <- 100
country_code <- "FR"
language_code <- "fr"
user_agent <- "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"

#Let's try the scraper
my_url_dataset <- get_google_serp_urls(querie, number_of_results, country_code, language_code, user_agent)

We now have a dataset (my_url_dataset) that contains the first 100 URLs that rank on the “machine learning” querie.

google scraper

NB: If for any reason you want to avoid scrapping the results page of your favorite search engine by yourself, note that SEMrush also offers a SERP URL retrieval function. I have integrated this feature into SEMrushR, the R package I created (French) to use the SEMrush API.

Now let’s recovering the Majestic data for each of these URLs.

3. Get netlinking data with majesticR (API)

You may have seen it on my Twitter account, I started to work on majesticR, an R package to easily connect to the Majestic API and getting data. It is now an official R package (#OFFISHAL !) published on CRAN R, the reference website for R users.

For the rest of the tutorial, I will use the majestic_url() function of my majesticR package which allows you to retrieve the data (CF, TF, External backlinks, etc…) for an URL. In order to automatically recover data from all the URLs in the list we created above with our web scraper (my_url_dataset), I have integrated the majestic_url() function into a loop.

Run the following script by replacing the XXX with your Majestic API key, and enjoy life, you just created a bot that recovers the data for you:

api_Key <- "XXXXXXXXXXXXXXXXXXXXXXXXXXX"
majestic_out <- data.frame(matrix(nrow=0, ncol=71))

for (i in 1:nrow(my_url_dataset)) {
  majestic_doing <- majestic_url(my_url_dataset[i, "url"], api_Key)
  majestic_out <- rbind(majestic_out, majestic_doing)
}
tuto API majestic r-min
The recovery of CF, TF and other Majestic data for a list of URLs is done in a few seconds

Bravo ! You have just retrieved all the netlinking data from the top 100 Google on the “machine learning” querie and everything has been stored in a dataset called “majestic_out”. We’ll be able to analyze all this!

4. Prepare data

Before going to the analysis step (correlations between the position and the different netlinking variables that we have recovered), a quick data preparation step is required. We will add 3 new columns here (which we will use in the rest of this tutorial), then create a new dataset that will contain only the columns we will analyze.

#Rename the ItemNum column
majestic_out$ItemNum <- my_url_dataset$position
colnames(majestic_out)[1] <- "position"

#Create 3 new columns to help us find correlations
majestic_out$isTop3 <- 1
majestic_out$isTop3[majestic_out$position>3] <- 0

majestic_out$isTop5 <- 1
majestic_out$isTop5[majestic_out$position>5] <- 0

majestic_out$isTop10 <- 1
majestic_out$isTop10[majestic_out$position>10] <- 0

#Keep only some variables
myvars <- c("position","ExtBackLinks","RefDomains","RefIPs","CitationFlow","TrustFlow")
subset_majestic_out <- majestic_out[myvars]
[TEASING] In the following tutorial, we will use the 3 categorical columns we have just created (isTop3, isTop5 and isTop10) to analyze the backlinks influence on Google ranking on Top 10, Top 5 and Top 3?

analyse netlinking majestic
The dataset is now composed by numerical data.

What you just did is super cool! Be proud of it.

Data recovery and preparation are essential steps (loved by data scientists) in any data science project. The next step is the exploratory analysis of the data to find interesting things

5. Analyze correlations between netlinking and ranking

5.1. Analysis with Majestic URL data

Let’s start with simple correlation tests between 2 variables by testing Pearson and Spearman correlations :

#Correlation between 2 continous variables
cor(subset_majestic_out$position,subset_majestic_out$ExtBackLinks, method="pearson", use = "complete.obs")
#[1] -0.1516199

cor(subset_majestic_out$position,subset_majestic_out$ExtBackLinks, method="spearman", use = "complete.obs")
#[1] -0.1076594

cor(subset_majestic_out$position,subset_majestic_out$RefDomains, method="pearson", use = "complete.obs")
#[1] -0.2188013

cor(subset_majestic_out$position,subset_majestic_out$RefDomains, method="spearman", use = "complete.obs")
#[1] -0.09217801

As a reminder, here is how the values you just returned (the correlation coefficients) are assessed:

correlation
Feel free to visit mathsisfun.com for some reminders about correlations

If you want to evaluate the dependence between several variables at the same time, I advise you to create a correlation matrix which contains the correlation coefficients (they are used to measure the strength of the relationship between 2 variables), as well as a matrix with the p-value (which allows to measure the quality of the correlation matrix):

albert statistique expliquee a mon chat-minTo learn more about p-value, I recommend this excellent video from La statistique expliquée à mon chat (French).

Let’s go back to our matrices….

my_big_matrix <- rcorr(as.matrix(subset_majestic_out))
my_big_matrix

# Extract the correlation coefficients
correlation_matrix <- round(my_big_matrix$r,2)

# Extract p-values
pvalue_matrix <- round(my_big_matrix$P,2)
Correlation coefficients matrix
P-values matrix

Let’s finish with some visualization.

We will create a correlogram, a visualization of the correlation matrix. At a glance the correlogram allows to see the correlation intensity (the size of the circles) and their direction: the positive correlations are in blue and the negative ones in red. Crosses represent correlations that are not significant according to their p-value. I myself specified the level of significance at 0.05 (sig.level = 0.05).

corrplot(correlation_matrix, type="upper", 
         p.mat = pvalue_matrix, sig.level = 0.05, insig = "pch",
         tl.col = "black", tl.srt = 45)

correlation backlinks seo position google

Finally, the view that brings everything together.

chart.Correlation(subset_majestic_out, histogram=TRUE, pch=19)

correlations seo netlinking backlinks

This view may seem scary at first, but it all becomes clear when you know where to look!

We can see:

  • The value of the correlation coefficient in the upper “triangle
  • A scatter plot visualization in the lower “triangle”, with a line that runs through the points and gives the direction of correlation
  • The distribution of each variable in the diagonal (cells with variable names)
  • The level of significance (p-value) of each correlation through the symbols *** (p-value very close to 0), ** (0.001), * (0.05), . (0.1) and empty space (1)

6. Analyses interpretation + Following steps

The correlation coefficients between the page positions and the 5 netlinking variables with the Majestic point of view (ExtBacklinks, RefDomains, RefIPs, CitationFlow, TrustFlow) are quite low here.

The hypothesisthat emerges from this first analysis is that the CF and TF seem to have some influence on the position for the “machine learning” query on Google FR.

This is a first approach to try to take advantage of the data at our disposal. It is now necessary to go further to deepen the analysis :

  • Use categorical variables (isTop3, isTop5 and isTop10) to analyze correlations
  • Add other variables corresponding to the different ranking factors (page speed, number of internal links, content size, semantic score, backlinks to the root domain, etc… etc…)
  • Work with more observations (data augmentation) and do a multivariate analysis using a machine learning approach in order to be as close as possible to search engine behaviour, and to take into account the interactions between the different variables.

I hope you enjoyed this new tutorial! Feel free to comment if you are interested in the next analysis steps!

To receive other R scripts and tutorials, subscribe to my newsletter and follow me on Twitter.

A propos de l'auteur

Rémi Bacha

Ajouter un commentaire

Recent Posts

Rémi Bacha

Keep in touch

RDV sur les réseaux sociaux pour discuter et être informé de mes prochains articles :