By Franki Chamaki

Data normalization is an elegant technique that reduces data inconsistency. Especially when we are dealing with a huge dataset.
I will guide you through this article to the steps I have followed with a couple of choices that I have made.

First of all, let me introduce the issue, describe the expected outcome, and then explain my steps to reach the goal.

Please find here a dummy data that we will use in this article.

The issue :

Suppose we wish to normalize the scores of points of interest (POIs) by reshaping the distribution of the scores.

The expected goals are :

  • All POIs should end up with a score between 0 and 10
  • There should be significantly more POIs with a score between k-1 and k than between k and k+1 for all k in [1..9]
  • Very few POIs should end up with a score between 9 and 10 (such high scores should really only go to the very best POIs in the world — say the top 0.1%-0.2% of POIs)

Note that we have lots of POIs in our database about which we know very little. For such POIs we have no indication that they’re any good (so we score them low) or that they’re any better/worse than other POIs that we don’t know much about (so they should end up scoring similarly). Hence the bottom-heavy distribution.

We will use a Python script to read POI scores from a file and normalize them according to the above criteria.

My strategy:

First of all, I started by initiating my Python libraries and reading the data

import pandas as pd
import matplotlib.pyplot as plt
from scipy import special
import numpy as np


# Read the data
data = pd.read_csv('raw_scores.txt',sep=' ', header=None)
x = data[1]

After that, I perform some standard data analytics operation, checking the data shape, checking for duplicated lines and checking for outliers (outside range 0,10 ).

# Check data description and shape
print (data.describe())



# check if any duplicate lines
duplicateID = data[data.duplicated(subset=0)]
print("Duplicated lines count is :",duplicateID.shape[0])



## check if any outliers (outside range(0,10))
outl=data[data[1] >= 10.0]
print("Outliers count (POI greater than 10) = ", outl.shape[0])
outl=data[data[1] <= 0.0]
print("Outliers count (POI less than 0) = ", outl.shape[0])

Then, I plotted the actual distribution to see how it looks like and how far it is from the requested result.

# Drawing barPlot by [k,k+1] range to see data
data[1].value_counts(bins=[0,1,2,3,4,5,6,7,8,9]).plot.bar(figsize=(10,5))
plt.xlabel('range of POI')
plt.ylabel('Occurences')
plt.tight_layout()
plt.show()

Next, I started some normalization experimentation taking into account the below requirements :

Requirement 1: All POIs should end up with a score between 0 and 10.

Requirement 2: There should be significantly more POIs with a score between k-1 and k than between k and k+1 for all k in [1..9]
In our experimentation we define : dif = minimum(difference between [k-1,k] and [k,k=1]) should be greater than 0 (we raise a bingo if > 5 )

Requirement 3: Very few POIs should end up with a score between 9 and 10. (t2=POI in [9,10] should be very few)

Requirement 4: The relative order of the POI scores should be preserved.

To do this experimentation, I created some useful functions :

countBin() : to calculate how many POI in each [k,k+1]


def countBin(l,i):
if len(l.value_counts(bins=[i, i+1]).tolist()) == 0:
return 0
else:
return l.value_counts(bins=[i, i+1]).tolist()[0]

check_requirements() : function to check if results are aligned to the above requirements.

def check_requirements(l):
t1=countBin(l,0)
print("range [ 0 - 1 ]",t1)
t2=countBin(l,9)
dif = 10
for i in range(1,10):
print("range [",i,"-",i+1,"]", countBin(l,i))
t1 = t1 + countBin(l,i)
if dif > (countBin(l,i-1) - countBin(l,i)):
dif = countBin(l,i-1) - countBin(l,i)
print("total=" ,t1, "dif=", dif, "t2=", t2)
if (t1 == 91897) and (dif>=5) and (t2 in range(5,250)):
print("==========================================")
print("============== BINGO =====================")
print("==========================================")

Experiment_dis(): To experiment with different distribution models and try to find the best model parameters fitting with the requirements.

def Experiment_dis(distribution,l,n,m,step):
for i in np.arange(n, m, step):
if distribution == "zipfian":
y = (l + 1) ** (-i) / special.zetac(i)

if distribution == "pareto":
y = i / l ** (i + 1)

if distribution == "binomial":
y = (1 / i) ** (0.4 * l)

if distribution == "lomax":
y = 1 / (i + l) ** (4)

if distribution == "weibull":
y = (5 / i) * (l / i) ** (5 - 1) * np.exp(-(l / i) ** 5)

y = 1 / y # to preserve order (Requirement4) since all distribution involved will inverse the order.
y = 10 * (y - min(y)) / (max(y) - min(y)) # Normalisation to [0,10]
print("i=", i)
check_requirements(y)
print("-----")
data[2] = y
print(data.head())

The models that we are experimenting with here are :
binomial distribution, Lomax distribution, Weibull distribution, and Zipfian distribution.

Why we did choose exactly theses models distribution?

Because from their graphical representation, it seems they are fitting closely with the requirements, please see below :

So we run our experimentation :

Experiment_dis("zipfian",x,1,5,0.1)
#best score obtained is dif=10 t2=7 for i=2.6

Experiment_dis("pareto",x,1,5,0.1)
#best score obtained is dif=9 t2=7 for i=1.2

Experiment_dis("binomial",x,1,5,0.1)
#best score obtained is dif=10 t2=6 for i=1.8

Experiment_dis("lomax",x,1,10,0.1)
#best score obtained is dif=9 t2=7 for i=7.7

Experiment_dis("weibull",x,1,2,0.1)
# Did not give good result, hence not adapted

We choose then Zipfian with a shape parameter = 2.6 since it represents the best score with regard to requirements.
Experiment_dis("zipfian",x,2.5,2.6,0.1)

Hereafter how our POI’s looks like after normalization:


## Drawing Plot to see new distribution after normalisation using zipfian
data[2].value_counts(bins=[0,1,2,3,4,5,6,7,8,9]).plot.bar(figsize=(10,5))
plt.xlabel('range of POI')
plt.ylabel('Occurences')
plt.tight_layout()
plt.show()


## Saving the output into CSV
data.to_csv(r'submission1.csv')

I have also shared the code in a public gist.

This article gives a simple insight into how Normalization can organize our data, and fit it more accurately to reality without losing information. In general, normalization is a requirement for many machine learning algorithms. If you are unsure which type of normalization suits your data, see more details on these resources Feature scaling.

As always, I hope you’ve learned something new :)

Cheers.


Reducing Data Inconsistencies with POI Normalization was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.