### Python statistics for beginners

#### Pearson correlation coefficient

### What is correlation?

All of us have heard the word “correlation” before. Although we might not be sure what it exactly means, we know that it is some kind of indicator on how strong two variables are related to each other. And you know what? That is pretty close to the truth.

Correlation itself is a mathematical technique to examine a relationship between two quantitative variables as for example the price of a car and its engine size.

#### Correlation types

So what types of correlation are there? Let us draw some graphs to get a better understanding:

# Import working libraries

import pandas as pd

import numpy as np

# Positive correlation

x = np.arange(start=0, stop=25, step=1)

plt.plot(x, 'o')

# Negative correlation

x = np.arange(start=25, stop=0, step=-1)

plt.plot(x, 'o')

# No correlation

x = np.random.rand(25)

plt.plot(x, 'o')

**Positive correlation** indicates that two variables will move in the same direction. In other words, if one variable increases, the other will increase as well and if one variable decreases the other decreases equivalently. A positive correlation can be illustrated as following:

A **negative correlation** is a relationship between two variables in which the increase in one variable leads to the decrease in the other. A good example for a negative correlation is the amount of oxygen in relation to altitude. With an increase in altitude the oxygen levels in the air will decrease (a common problem for extreme mountaineers). A negative correlation looks like this:

And last but not least, the third form: **No correlation. **In this case, the data plots are completely random and do not show any sign of correlation. No correlation can be illustrated like this:

#### Correlation forms

Now, correlation comes in different forms. It can be **linear**, **non-linear** or **monotonic**.** **Let us draw some more plots to illustrate the differences**:**

# Import additional working libraries

import seaborn as sns

# Linear correlation plot

data1 = 20 * np.random.random(100) + 100

data2 = data1 + (10 * np.random.random(100) + 50)

sns.regplot(x=data1, y=data2)

# Non-linear correlation plot

x = np.arange(0, 10, 0.1)

ytrue = np.exp(-x / 10000) + 2 * np.sin(x / 3)

y = ytrue + np.random.normal(size=len(x))

sns.regplot(x, y, lowess=True)

# Monotonic non-linear correlation plot

def f(x, a, b, c, d):

return a / (1. + np.exp(-c * (x - d))) + b

a, c = np.random.exponential(size=2)

b, d = np.random.randn(2)

n = 100

x = np.linspace(-10., 10., n)

y_model = f(x, a, b, c, d)

y = y_model + a * .2 * np.random.randn(n)

sns.scatterplot(x, y)

sns.lineplot(x, y_model)

A **linear correlation** is a trend in the data where both variables change at a constant rate. For example, suppose a car dealer wants to estimate the impact of fuel comsumption on car prices. They find that for every additional litre of fuel consumption the price of cars decreases by $1000. This describes a linear relationship between fuel consumption and car prices where car prices are influenced by fuel consumption on the constant rate of $1000. A linear correlation can be identified by a straight line:

A correlation is **non-linear **when the two variables do not change at a constant rate**.** In result, the relationship between the variables does not graph as a straight line but causes a somewhat curved pattern in the data. The following graph illustrates how that can look like:

Similiar to a linear relationship both variables in a **monotonic **relationship will move in the same direcation. However, the variables do not necessarily have to move at a constant rate. The next plot shows this behaviour where both variables are concurrently increasing but not at the same rate.

### Pearson’s correlation coefficient

Graphing the data and its relationship helps us to determine which type of correlation we are confronted with and which form it has. However, it doesn’t tell us how strong the correlation we are looking at actually is. To quantify the relationship between the two variables, we need to calculate the **correlation coefficient**.

The **correlation coefficient** is a statistical measure that quantifies the relationship between two variables. The coefficient’s value ranges between -1.0 and 1.0 while a calculated number larger than 1.0 indicates an error in the function. A coefficient of -1.0 shows a perfect **negative correlation** and 1.0 a perfect **positive correlation**. A coefficient of 0.0 on the other hand means that there is no relationship between the two variables.

There are many different ways to calculate the correlaion coefficient of two variables. The most common one is the so called **Pearson’s correlation coefficient (r)**. It is a test to measure the strength of a linear relation between two normally distributed variables. If the data is not normally distributed, the Kendall and Spearman tests can be used.

Since ** (r) **is ranging between -1.0 and 1.0, its interpretation can become difficult when approaching zero. There are many rule of thumbs on how to interpret the different values of

**but the probably most common one was published in 2003 by Dennis E. Hinkle and his co-authors within their introductionary text for applied statistics for behavioral sciences. It was designed to teach students and provided a rule of thumb for interpreting the size of a correlation coefficient:**

*(r)***Significance**

By measuring the *(r)* value we have quantified how strong the relationship between the two variables is. But that only tells us half the story right? The reason for that is that the correlation coefficient we have calculated only represents a sample and not the entire population. So while we know how the correlation looks like in our sample, we cannot be sure whether the correlation we have quantified is representative for the entire population. We are therefore now going to conduct a statistical **significance test** that can tell us whether our observation can be expected to be true in the entire population or not. Let us set up our test step by step:

#### Step 1: Formulating a hypothesis

In hypothesis testing we always have to formulate two hypothesis. One is called the **null hypothesis **while the other one is called the **alternative hypothesis**. They usually state opposite outcomes:

The **null hypothesis (Ho)** is the hypothesis that we are trying to disprove. In our case, it is the hypothesis that there is not a significant linear correlation between the two variables in the given population.

The **alternative hypothesis (Ha)** is the hypothesis that we try to provide evidence for. In this example, we will try to prove that there is a linear correlation in the population.

#### Step 2: Conducting a T-Test

Without going into the heavy math, the** T-Test** (also called Student’s T-Test) allows us to test an assumption on an entire population. In our case it will help us to find out whether the correlation we have observed within our sample can be reproduced within the entire population.

The T-Test will give us a number ** (t) **that we in turn have to interpret.

**The higher**

**, the higher the likelyhood that the correlation is repeatable within the population. But why is that?**

*(t)*Simply put, the T-Test calculates how much two groups differ. In our case, we use the T-test to **compare **one group with no correlation and one with a proven relationship. If the correlation we have observed occured due to coincidence, (t) will equal something around 0, indicating that the two populations are the **same**.

However, if the two groups **differ **significantly, the t-value will be much higher and located at the extremes of the population. This indicates that the correlation we have observed is not due to coincidence and cannot be found in a population with no correlation:

# Creating a bell curve for illustration purposes

from scipy.stats import norm

x = np.arange(-4, 4, 0.001)

y = norm.pdf(x,0,1)

fig, ax = plt.subplots(figsize=(9,6))

ax.plot(x,y)

plt.show()

**Step 3: Interpreting the p-value**

Since *(t)* is a theoretical test statistic that is depending on the population, we do not know whether the t-value we observe is high or not. In other words, population with a range from 0 to 1,000 will have other t-values than a population with a range from 0 to 1. To overcome this little problem, every T-Test also calculates the so called ** p-value (p)**.

The **p-value** shows us the probability that we can observe this t-value within the population of *Ho*. You can also think of it as a game of darts. The higher *(p)*, the higher the chance that your dart lands somewhere near 0 in the graph above. The lower *(p)*, the lower the chances that you will hit the middle.

So the lower (p), the more **likely **it is that our t-value is located at one of the extremes and therefore is “high enough” to indicate a significant difference between our two populations. Or in other words: The lower (p), the higher the probability that our observed correlation has not occured due to coincidence.

**Step 4: Deciding upon the hypothesis**

Let us get to the final part and look at the last question —** how low does (p) need to be?**

Again, it depends. It depends on the **significance level (a) **you set for your experiment. The significance level is your threshhold for the p-value. Typically,

**, meaning that a p-value larger than that will lead to accepting the null hypothesis. Or in other words: if you are not 95% sure that the two groups differ significantly, you**

*(a)*is set to 0.05 or 5%**cannot reject**the null hypothesis.

Assuming we have receive a p-value of 0.002 while we set *(a)* to 0.05. In this case *(p)* is **below **our significance level and we can come to the following conclusion:

We can reject *Ho *in favour for *Ha *because we have found that the correlation we have observed in our sample can be repeated in a larger population and does not occur by coincidence.

### Python example

But enough of the theory for now — let us look at a real life example and how to do all that with Python!

#### What do you need?

To work through this tutorial you will need to have the following specs:

- An
**editor**like VS Code or Spyder **Python 3.8.3**or higher- The
**Automobile Data Set**

#### What will we do?

In this tutorial we will have a look at the **Automobile Data Set **provided by the UCI Machine learning Depository. It is a data set that includes **technical** and **price **data as well as **insurance **information for cars from the year 1985. The data set is open source and typically used to train/test regression models.

We will perform a number of formattings on this data set to make it work for us and then explore the **correlation **of a few variables with the variable **price**.

#### Step 1: Importing working libraries

First, we need to import the libraries we will need to perform our analysis. If you have not installed them yet, please check the respective documentation and do so before executing the below code.

# Import working libraries

import pandas as pd

import numpy as np

import seaborn as sns

from scipy import stats

#### Step 2: Load the data

If you have not done it already, please download the “imports-85.data” from the UCI website and save it in your working directory. Please note, that you have to change “.data” to “.csv” to execute the following code.

# Load data and show first 5 rows

df = pd.read_csv("imports-85.csv")

df.head(5)

#### Step 3: Formatting the data

Have you noticed — our data set has no headers, and how some values show a question mark? That is something we should **fix** before we run our correlation analysis as our code otherwise probably won’t work and obtained results might be distorted.

Let us in the first step **add **some **column names **so that it will be easier for us to understand the data set and work with it.

# Add headers to df and display first 5 rows

headers = ["symboling", "normalized-losses", "make", "fuel-type", "aspiration", "num-of-doors", "body-style", "drive-wheels", "engine-location", "wheel-base", "length", "width", "height", "curb-weight", "engine-type", "num-of-cylinders", "engine-size","fuel-system", "bore", "stroke", "compression-ratio", "horsepower", "peak-rpm", "city-mpg", "highway-mpg", "price"]

df.columns = headers

df.head(5)

As that seemed to work well, we should now have a look at these questions marks in the normalized-losses column. We cannot use them for our further analysis, so we will **replace **them with **NaN **values. Note that we will call the pandas.replace function on the **whole **dataframe and not just on the one column. We do that to ensure that we will later not encounter another question mark **somewhere else** in the data set.

# Replace ? with NaN and show first 5 rows of df

df = df.replace('?',np.NaN)

df.head(5)

Since we have** **no overview of **where **we have missing values now, we should run a quick analysis to find out the NaN values. We can do that by **checking** the data frame for NaN values first and than print out our analysis using a **for loop.**

# Check df for missing values

missing_data = df.isnull()

missing_data.head(5)

# Run for loop to check for missing values

for column in missing_data.columns.values.tolist():

print(column)

print (missing_data[column].value_counts())

print("")

If you scroll through the analysis, you will find that we have **missing values (=True)** in the following **columns**:

- normalized-losses
- num-of-doors
- bore
- stroke
- horsepower
- peak-rpm
- price

Since we need a functioning data set, we have to find a way to deal with the missing data. In general, there are two options. Either we **drop **columns or rows with missing data or we **replace **the missing data.

Since our data set is not large enough that we can simply drop all rows/columns with missing values, we will **replace **the missing values with the **means **of the remaining data.

The only exception will be the num-of-doors column as this data is **not** numerical. Here we will **replace **the missing data with the most common **variant**. In this case, that is “four”.

# Replacing NaNs with mean / most common variant

avg_norm_loss = df["normalized-losses"].astype("float").mean(axis=0)

df["normalized-losses"].replace(np.nan, avg_norm_loss, inplace=True)

avg_bore = df['bore'].astype('float').mean(axis=0)

df["bore"].replace(np.nan, avg_bore, inplace=True)

avg_stroke = df['stroke'].astype('float').mean(axis=0)

df["stroke"].replace(np.nan, avg_stroke, inplace=True)

avg_horsepower = df['horsepower'].astype('float').mean(axis=0)

df['horsepower'].replace(np.nan, avg_horsepower, inplace=True)

avg_peakrpm = df['peak-rpm'].astype('float').mean(axis=0)

df['peak-rpm'].replace(np.nan, avg_peakrpm, inplace=True)

avg_price = df['price'].astype('float').mean(axis=0)

df['price'].replace(np.nan, avg_price, inplace=True)

df["num-of-doors"].replace(np.nan, "four", inplace=True)

What’s left? Right! Checking the **data forms**. We can do that easily by executing the following code:

# Checking df types

df.dtypes

As you can see some columns still have the **wrong **data type (e.g. bore should have integers). So let us correct that quickly:

# Format data types

df[["bore", "stroke"]] = df[["bore", "stroke"]].astype("float")

df[["normalized-losses"]] = df[["normalized-losses"]].astype("int")

df[["price"]] = df[["price"]].astype("float")

df[["peak-rpm"]] = df[["peak-rpm"]].astype("float")

There is much more we could do in terms of data **standardization **and **normalization **but we do not need that for this tutorial, so let us move on.

**Step 4: Explore the data**

Let us quickly recap what we want to do: We want to look for **variables **that have some kind of **correlation **to the** price **of cars, right? So we need to first check the **type** and **form **of the relationship between the two variables — and the easiest way to do that is to draw some more graphs!

To do that we will use the **seaborn**** **library and its **regplot **function. This function will plot us the **data points** and a **linear** **regression line** for our selected variables on a 2D graph. While this does not help us to quantify the correlation, it helps us to **identify **type and form of the relationship between the two variables.

But without further ado, let’s get started and have a look at the price in relation to:

**Engine size** Have a look at the graph below. We can observe that the linear regression line

**fits**the plotted data quite

**well**. We can also observe that the regression line has a

**steep upwards gradient**. But what does that tell us considering type and form of correlation?

While the clearly increasing gradient of the regression line shows us that we can expect a **positive correlation** between the engine size and price, the good fit of the regression line indicates that the variables have a **linear correlation**.

# Create a plot for engine-size and price

sns.regplot(x="engine-size", y="price", data=df)

**Highway-mpg**The graph below shows us the regression plot for highway miles per gallon (mpg) in relation to the cars’ prices. Here, we can observe that the regression line is

**decreasing**, indicating a

**negative correlation**. However, we can see that that the regression line does

**not cover**the outliers between 0 and 25 on the x-axis. This indicates that a linear regression is not a good fit and suggests that the data

**might**be

**non-linear**.

# Create a plot for highway-mpg and price

sns.regplot(x="highway-mpg", y="price", data=df)

plt.ylim(0,)

**Peak-rpm**The graph for peak-rpm and price looks different again. We see that the data points have a great

**variation**and the regression line is almost

**horizontal**. Based on that observation we

**cannot**determine what form of correlation we can expect but

**assuming**there is a linear relationship, the nearly flat regression line indicates that we can expect a correlation coefficient around

**0**. We would therefore conclude that there is

**no correlation**between peak rpm and price.

# Create a plot for peak-rpm and price

sns.regplot(x="peak-rpm", y="price", data=df)

plt.ylim(0,)

Since it would take ages to look at each variable pair individually, we can shorten the process by using panda’s ** corr()** function:

# Using pandas corr() function

df.corr()

As you can see, the *corr() *function by pandas creates you a **correlation matrix** with all variables contained in our data set. You might ask — why we didn’t do this from the beginning?

The answer is relatively simple. While the *corr()* function calculates by default the **Pearson correlation coefficients**, it does not give you any information regarding the **form **of correlation. It can therefore serve you as **a great starting point** to identify potential high correlations but checking the variable pairs individually will still be necessary.

#### Step 5: Calculate correlation and (p) values

Let us quickly summarize the outcome of our above analysis before we move on:

- We can expect a
**positive, linear correlation**between engine size and price - We have observed a
**negative correlation**between highway-mpg and price that seems to be**non-linear**. - We assume
**no correlation**between peak-rpm and price and we have**no reliable information**of the form of correlation.

We can thus conclude that the variables highway-mpg and peak-rpm seem to be **not suitable **for conducting a further analysis with the pearson correlation coefficient.

To have a closer look at the engine size, we can compute the Pearson correlation coefficient as well as the p-value with the help of the scipy.stats library.

# Calculate pearson coefficient and p-value

pearson_coef, p_value = stats.pearsonr(df['wheel-base'], df['price'])

print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)

In result, we receive a Pearson correlation coefficient of **~ 0.87** and a p-value of pretty much **0**.

Since our coefficient is >0.7 but <0.9, we conclude that we are observing a **high positive** correlation between engine size and price.

Finally, we can observe that our p-value is definitely **below **our significance level *(a) *of 0.05. We can therefore conclude that the correlation we have calculated above is not a coincidence and therefore **significant**.

### Last remarks

This would not be a good tutorial, if I would not make the following, famous remark:

Correlation does not imply causation.

We have learned that **correlation **is a measure to descripe the extent of a relationship between two variables. **Causation **on the other hand is the relationship between cause and effect between two variables.

Even if we have observed a correlation, we **cannot **conclude that one variable causes a change in the other. Unless we have been able to consider all different variants, we **have to** assume that there is still a chance for **coincidence **or that a **third factor** might be causing both of our variables to change.

It is therefore always important to conduct **a more comprehensive experiment **before drawing final conclusions regarding correlation figures.

Python statistics for beginners: Pearson correlation coefficient was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.