Python statistics for beginners
Pearson correlation coefficient
What is correlation?
All of us have heard the word “correlation” before. Although we might not be sure what it exactly means, we know that it is some kind of indicator on how strong two variables are related to each other. And you know what? That is pretty close to the truth.
Correlation itself is a mathematical technique to examine a relationship between two quantitative variables as for example the price of a car and its engine size.
So what types of correlation are there? Let us draw some graphs to get a better understanding:
# Import working libraries
import pandas as pd
import numpy as np
# Positive correlation
x = np.arange(start=0, stop=25, step=1)
# Negative correlation
x = np.arange(start=25, stop=0, step=-1)
# No correlation
x = np.random.rand(25)
Positive correlation indicates that two variables will move in the same direction. In other words, if one variable increases, the other will increase as well and if one variable decreases the other decreases equivalently. A positive correlation can be illustrated as following:
A negative correlation is a relationship between two variables in which the increase in one variable leads to the decrease in the other. A good example for a negative correlation is the amount of oxygen in relation to altitude. With an increase in altitude the oxygen levels in the air will decrease (a common problem for extreme mountaineers). A negative correlation looks like this:
And last but not least, the third form: No correlation. In this case, the data plots are completely random and do not show any sign of correlation. No correlation can be illustrated like this:
Now, correlation comes in different forms. It can be linear, non-linear or monotonic. Let us draw some more plots to illustrate the differences:
# Import additional working libraries
import seaborn as sns
# Linear correlation plot
data1 = 20 * np.random.random(100) + 100
data2 = data1 + (10 * np.random.random(100) + 50)
# Non-linear correlation plot
x = np.arange(0, 10, 0.1)
ytrue = np.exp(-x / 10000) + 2 * np.sin(x / 3)
y = ytrue + np.random.normal(size=len(x))
sns.regplot(x, y, lowess=True)
# Monotonic non-linear correlation plot
def f(x, a, b, c, d):
return a / (1. + np.exp(-c * (x - d))) + b
a, c = np.random.exponential(size=2)
b, d = np.random.randn(2)
n = 100
x = np.linspace(-10., 10., n)
y_model = f(x, a, b, c, d)
y = y_model + a * .2 * np.random.randn(n)
A linear correlation is a trend in the data where both variables change at a constant rate. For example, suppose a car dealer wants to estimate the impact of fuel comsumption on car prices. They find that for every additional litre of fuel consumption the price of cars decreases by $1000. This describes a linear relationship between fuel consumption and car prices where car prices are influenced by fuel consumption on the constant rate of $1000. A linear correlation can be identified by a straight line:
A correlation is non-linear when the two variables do not change at a constant rate. In result, the relationship between the variables does not graph as a straight line but causes a somewhat curved pattern in the data. The following graph illustrates how that can look like:
Similiar to a linear relationship both variables in a monotonic relationship will move in the same direcation. However, the variables do not necessarily have to move at a constant rate. The next plot shows this behaviour where both variables are concurrently increasing but not at the same rate.
Pearson’s correlation coefficient
Graphing the data and its relationship helps us to determine which type of correlation we are confronted with and which form it has. However, it doesn’t tell us how strong the correlation we are looking at actually is. To quantify the relationship between the two variables, we need to calculate the correlation coefficient.
The correlation coefficient is a statistical measure that quantifies the relationship between two variables. The coefficient’s value ranges between -1.0 and 1.0 while a calculated number larger than 1.0 indicates an error in the function. A coefficient of -1.0 shows a perfect negative correlation and 1.0 a perfect positive correlation. A coefficient of 0.0 on the other hand means that there is no relationship between the two variables.
There are many different ways to calculate the correlaion coefficient of two variables. The most common one is the so called Pearson’s correlation coefficient (r). It is a test to measure the strength of a linear relation between two normally distributed variables. If the data is not normally distributed, the Kendall and Spearman tests can be used.
Since (r) is ranging between -1.0 and 1.0, its interpretation can become difficult when approaching zero. There are many rule of thumbs on how to interpret the different values of (r) but the probably most common one was published in 2003 by Dennis E. Hinkle and his co-authors within their introductionary text for applied statistics for behavioral sciences. It was designed to teach students and provided a rule of thumb for interpreting the size of a correlation coefficient:
By measuring the (r) value we have quantified how strong the relationship between the two variables is. But that only tells us half the story right? The reason for that is that the correlation coefficient we have calculated only represents a sample and not the entire population. So while we know how the correlation looks like in our sample, we cannot be sure whether the correlation we have quantified is representative for the entire population. We are therefore now going to conduct a statistical significance test that can tell us whether our observation can be expected to be true in the entire population or not. Let us set up our test step by step:
Step 1: Formulating a hypothesis
In hypothesis testing we always have to formulate two hypothesis. One is called the null hypothesis while the other one is called the alternative hypothesis. They usually state opposite outcomes:
The null hypothesis (Ho) is the hypothesis that we are trying to disprove. In our case, it is the hypothesis that there is not a significant linear correlation between the two variables in the given population.
The alternative hypothesis (Ha) is the hypothesis that we try to provide evidence for. In this example, we will try to prove that there is a linear correlation in the population.
Step 2: Conducting a T-Test
Without going into the heavy math, the T-Test (also called Student’s T-Test) allows us to test an assumption on an entire population. In our case it will help us to find out whether the correlation we have observed within our sample can be reproduced within the entire population.
The T-Test will give us a number (t) that we in turn have to interpret. The higher (t), the higher the likelyhood that the correlation is repeatable within the population. But why is that?
Simply put, the T-Test calculates how much two groups differ. In our case, we use the T-test to compare one group with no correlation and one with a proven relationship. If the correlation we have observed occured due to coincidence, (t) will equal something around 0, indicating that the two populations are the same.
However, if the two groups differ significantly, the t-value will be much higher and located at the extremes of the population. This indicates that the correlation we have observed is not due to coincidence and cannot be found in a population with no correlation:
# Creating a bell curve for illustration purposes
from scipy.stats import norm
x = np.arange(-4, 4, 0.001)
y = norm.pdf(x,0,1)
fig, ax = plt.subplots(figsize=(9,6))
Step 3: Interpreting the p-value
Since (t) is a theoretical test statistic that is depending on the population, we do not know whether the t-value we observe is high or not. In other words, population with a range from 0 to 1,000 will have other t-values than a population with a range from 0 to 1. To overcome this little problem, every T-Test also calculates the so called p-value (p).
The p-value shows us the probability that we can observe this t-value within the population of Ho. You can also think of it as a game of darts. The higher (p), the higher the chance that your dart lands somewhere near 0 in the graph above. The lower (p), the lower the chances that you will hit the middle.
So the lower (p), the more likely it is that our t-value is located at one of the extremes and therefore is “high enough” to indicate a significant difference between our two populations. Or in other words: The lower (p), the higher the probability that our observed correlation has not occured due to coincidence.
Step 4: Deciding upon the hypothesis
Let us get to the final part and look at the last question — how low does (p) need to be?
Again, it depends. It depends on the significance level (a) you set for your experiment. The significance level is your threshhold for the p-value. Typically, (a) is set to 0.05 or 5%, meaning that a p-value larger than that will lead to accepting the null hypothesis. Or in other words: if you are not 95% sure that the two groups differ significantly, you cannot reject the null hypothesis.
Assuming we have receive a p-value of 0.002 while we set (a) to 0.05. In this case (p) is below our significance level and we can come to the following conclusion:
We can reject Ho in favour for Ha because we have found that the correlation we have observed in our sample can be repeated in a larger population and does not occur by coincidence.
But enough of the theory for now — let us look at a real life example and how to do all that with Python!
What do you need?
To work through this tutorial you will need to have the following specs:
- An editor like VS Code or Spyder
- Python 3.8.3 or higher
- The Automobile Data Set from the UCI Machine Learning Depository
What will we do?
In this tutorial we will have a look at the Automobile Data Set provided by the UCI Machine learning Depository. It is a data set that includes technical and price data as well as insurance information for cars from the year 1985. The data set is open source and typically used to train/test regression models.
We will perform a number of formattings on this data set to make it work for us and then explore the correlation of a few variables with the variable price.
Step 1: Importing working libraries
First, we need to import the libraries we will need to perform our analysis. If you have not installed them yet, please check the respective documentation and do so before executing the below code.
# Import working libraries
import pandas as pd
import numpy as np
import seaborn as sns
from scipy import stats
Step 2: Load the data
If you have not done it already, please download the “imports-85.data” from the UCI website and save it in your working directory. Please note, that you have to change “.data” to “.csv” to execute the following code.
# Load data and show first 5 rows
df = pd.read_csv("imports-85.csv")
Step 3: Formatting the data
Have you noticed — our data set has no headers, and how some values show a question mark? That is something we should fix before we run our correlation analysis as our code otherwise probably won’t work and obtained results might be distorted.
Let us in the first step add some column names so that it will be easier for us to understand the data set and work with it.
# Add headers to df and display first 5 rows
headers = ["symboling", "normalized-losses", "make", "fuel-type", "aspiration", "num-of-doors", "body-style", "drive-wheels", "engine-location", "wheel-base", "length", "width", "height", "curb-weight", "engine-type", "num-of-cylinders", "engine-size","fuel-system", "bore", "stroke", "compression-ratio", "horsepower", "peak-rpm", "city-mpg", "highway-mpg", "price"]
df.columns = headers
As that seemed to work well, we should now have a look at these questions marks in the normalized-losses column. We cannot use them for our further analysis, so we will replace them with NaN values. Note that we will call the pandas.replace function on the whole dataframe and not just on the one column. We do that to ensure that we will later not encounter another question mark somewhere else in the data set.
# Replace ? with NaN and show first 5 rows of df
df = df.replace('?',np.NaN)
Since we have no overview of where we have missing values now, we should run a quick analysis to find out the NaN values. We can do that by checking the data frame for NaN values first and than print out our analysis using a for loop.
# Check df for missing values
missing_data = df.isnull()
# Run for loop to check for missing values
for column in missing_data.columns.values.tolist():
If you scroll through the analysis, you will find that we have missing values (=True) in the following columns:
Since we need a functioning data set, we have to find a way to deal with the missing data. In general, there are two options. Either we drop columns or rows with missing data or we replace the missing data.
Since our data set is not large enough that we can simply drop all rows/columns with missing values, we will replace the missing values with the means of the remaining data.
The only exception will be the num-of-doors column as this data is not numerical. Here we will replace the missing data with the most common variant. In this case, that is “four”.
# Replacing NaNs with mean / most common variant
avg_norm_loss = df["normalized-losses"].astype("float").mean(axis=0)
df["normalized-losses"].replace(np.nan, avg_norm_loss, inplace=True)
avg_bore = df['bore'].astype('float').mean(axis=0)
df["bore"].replace(np.nan, avg_bore, inplace=True)
avg_stroke = df['stroke'].astype('float').mean(axis=0)
df["stroke"].replace(np.nan, avg_stroke, inplace=True)
avg_horsepower = df['horsepower'].astype('float').mean(axis=0)
df['horsepower'].replace(np.nan, avg_horsepower, inplace=True)
avg_peakrpm = df['peak-rpm'].astype('float').mean(axis=0)
df['peak-rpm'].replace(np.nan, avg_peakrpm, inplace=True)
avg_price = df['price'].astype('float').mean(axis=0)
df['price'].replace(np.nan, avg_price, inplace=True)
df["num-of-doors"].replace(np.nan, "four", inplace=True)
What’s left? Right! Checking the data forms. We can do that easily by executing the following code:
# Checking df types
As you can see some columns still have the wrong data type (e.g. bore should have integers). So let us correct that quickly:
# Format data types
df[["bore", "stroke"]] = df[["bore", "stroke"]].astype("float")
df[["normalized-losses"]] = df[["normalized-losses"]].astype("int")
df[["price"]] = df[["price"]].astype("float")
df[["peak-rpm"]] = df[["peak-rpm"]].astype("float")
There is much more we could do in terms of data standardization and normalization but we do not need that for this tutorial, so let us move on.
Step 4: Explore the data
Let us quickly recap what we want to do: We want to look for variables that have some kind of correlation to the price of cars, right? So we need to first check the type and form of the relationship between the two variables — and the easiest way to do that is to draw some more graphs!
To do that we will use the seaborn library and its regplot function. This function will plot us the data points and a linear regression line for our selected variables on a 2D graph. While this does not help us to quantify the correlation, it helps us to identify type and form of the relationship between the two variables.
But without further ado, let’s get started and have a look at the price in relation to:
Have a look at the graph below. We can observe that the linear regression line fits the plotted data quite well. We can also observe that the regression line has a steep upwards gradient. But what does that tell us considering type and form of correlation?
While the clearly increasing gradient of the regression line shows us that we can expect a positive correlation between the engine size and price, the good fit of the regression line indicates that the variables have a linear correlation.
# Create a plot for engine-size and price
sns.regplot(x="engine-size", y="price", data=df)
The graph below shows us the regression plot for highway miles per gallon (mpg) in relation to the cars’ prices. Here, we can observe that the regression line is decreasing, indicating a negative correlation. However, we can see that that the regression line does not cover the outliers between 0 and 25 on the x-axis. This indicates that a linear regression is not a good fit and suggests that the data might be non-linear.
# Create a plot for highway-mpg and price
sns.regplot(x="highway-mpg", y="price", data=df)
The graph for peak-rpm and price looks different again. We see that the data points have a great variation and the regression line is almost horizontal. Based on that observation we cannot determine what form of correlation we can expect but assuming there is a linear relationship, the nearly flat regression line indicates that we can expect a correlation coefficient around 0. We would therefore conclude that there is no correlation between peak rpm and price.
# Create a plot for peak-rpm and price
sns.regplot(x="peak-rpm", y="price", data=df)
Since it would take ages to look at each variable pair individually, we can shorten the process by using panda’s corr() function:
# Using pandas corr() function
As you can see, the corr() function by pandas creates you a correlation matrix with all variables contained in our data set. You might ask — why we didn’t do this from the beginning?
The answer is relatively simple. While the corr() function calculates by default the Pearson correlation coefficients, it does not give you any information regarding the form of correlation. It can therefore serve you as a great starting point to identify potential high correlations but checking the variable pairs individually will still be necessary.
Step 5: Calculate correlation and (p) values
Let us quickly summarize the outcome of our above analysis before we move on:
- We can expect a positive, linear correlation between engine size and price
- We have observed a negative correlation between highway-mpg and price that seems to be non-linear.
- We assume no correlation between peak-rpm and price and we have no reliable information of the form of correlation.
We can thus conclude that the variables highway-mpg and peak-rpm seem to be not suitable for conducting a further analysis with the pearson correlation coefficient.
To have a closer look at the engine size, we can compute the Pearson correlation coefficient as well as the p-value with the help of the scipy.stats library.
# Calculate pearson coefficient and p-value
pearson_coef, p_value = stats.pearsonr(df['wheel-base'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)
In result, we receive a Pearson correlation coefficient of ~ 0.87 and a p-value of pretty much 0.
Since our coefficient is >0.7 but <0.9, we conclude that we are observing a high positive correlation between engine size and price.
Finally, we can observe that our p-value is definitely below our significance level (a) of 0.05. We can therefore conclude that the correlation we have calculated above is not a coincidence and therefore significant.
This would not be a good tutorial, if I would not make the following, famous remark:
Correlation does not imply causation.
We have learned that correlation is a measure to descripe the extent of a relationship between two variables. Causation on the other hand is the relationship between cause and effect between two variables.
Even if we have observed a correlation, we cannot conclude that one variable causes a change in the other. Unless we have been able to consider all different variants, we have to assume that there is still a chance for coincidence or that a third factor might be causing both of our variables to change.
It is therefore always important to conduct a more comprehensive experiment before drawing final conclusions regarding correlation figures.
Python statistics for beginners: Pearson correlation coefficient was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.