A really important part of any machine learning model is the data, especially the features used. In this article, we will go over where feature engineering falls in the machine learning pipeline, and how to do some feature engineering on numbers using binning, transformations, and normalization. The real benefit of feature engineering is being able to improve the prediction of your models without making the models more complex. In this article, we will do some hands-on coding with data taken from Goodreads.com, which I found in Kaggle.com [1].

The data used this article is about book ratings and the data can be downloaded here: https://www.kaggle.com/jealousleopard/goodreadsbooks

### High Level Overview of Machine Learning Pipeline

A general machine learning pipeline is composed of 5 steps:

**Ask**— What the business or customer wants to do (e.g. predict sales for next quarter, detect cancer, etc.)**Data**— A piece of reality in digital form. For example, the number of sales in a quarter.**Features**— Data transformed in numbers, so that a model may understand. Some of you might say, “Hey, my model accepts red or green as a feature!” Well, under the hood of your model, the code actually transforms that red or green into a 0 or 1, which is typically done with one hot encoding.**Model Selection**— The model is the tool that you use that bridges the gap between data to predictions. For those of you comfortable with math, think of it like a function. Your data is the input, the model is the function, and your prediction is the output.**Predictions**— The answer to solve the request of the “ask.” To relate it to the examples from the ask above, 10,000 units sold in the first quarter (Q1), or patient XYZ has cancer.

The figure below shows a similar process for constructing a machine learning pipeline, which was taken from Microsoft *[2]*:

For the rest of feature engineering techniques (e.g. binning, log transformation, feature scaling) shown below, they are simply just extra tools for your to manipulate the data in order to hopefully make better predictions.

### Binning

Binning is great, when you have a data set that needs to be clumped together into groups. Making these groups or bins helps the model improve prediction accuracy. For example, you have a range of income to predict credit default. One way to apply binning is to make groups of tax bracket income levels. Now, we’ll begin using binning with histograms on book count ratings.

#Load Packages

import numpy as npimport pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn import linear_model

from sklearn.model_selection import cross_val_score

import sklearn.preprocessing as pp

#load data into dataframe

books_file = pd.read_csv('books.csv', error_bad_lines=False)

Put the “error_bad_lines” code in the pd.read_csv function, since there are a few unclean rows. Otherwise, Python will yell at us.

#Taking a look at the data

books_file.head()

Now, we’ll start plotting the histograms.

#Plotting a histogram - no feature engineering

sns.set_style('whitegrid') #Picking a background color

fig, ax =plt.subplots()

books_file['ratings_count'].hist(ax=ax, bins=10) #State how many bins you want

ax.tick_params(labelsize=10)

ax.set_xlabel('Ratings Count')

ax.set_ylabel('Event') #How many times the specific numbers of ratings count happened

What’s wrong here? It appears that most the counts are heavily weighted with the lower ratings count. Makes sense, since there are a lot of books that only get a few ratings. What if we want to capture more of the popular reviewed books in our graph? The answer: log scaling.

### Log Transformation

#Plotting a histogram - log scale

sns.set_style('whitegrid') #Picking a background color

fig, ax =plt.subplots()

books_file['ratings_count'].hist(ax=ax, bins=10) #State how many bins you want

ax.set_yscale('log') #Recaling to log, since large numbers can mess up the weightings for some models

ax.tick_params(labelsize=10)

ax.set_xlabel('Ratings Count')

ax.set_ylabel('Event') #How many times the specific numbers of ratings count happened

Now we’ll make rating predictions from normal values vs. log transformed values of rating counts.

For the code below, we are adding +1, since the log of 0 is undefined, which would cause our computer to blow up (kidding, sort of).

books_file['log_ratings_count'] = np.log10(books_file['ratings_count']+1)

#Using ratings count to predict average rating. Cross Validation (CV) is normally 5 or 10.

base_model = linear_model.LinearRegression()

base_scores = cross_val_score(base_model, books_file[['ratings_count']],

books_file['average_rating'], cv=10)

log_model = linear_model.LinearRegression()

log_scores = cross_val_score(log_model, books_file[['log_ratings_count']],

books_file['average_rating'], cv=10)

#Display the R^2 values. STD*2 for 95% confidence level

print("R^2 of base data: %0.4f (+/- %0.4f)" % (base_scores.mean(), base_scores.std()*2))

print("R^2 of log data: %0.4f (+/- %0.4f)" % (log_scores.mean(), log_scores.std()*2))

R² of base data: -0.0024 (+/- 0.0125)

R² of log data: 0.0107 (+/- 0.0365)

Both of the models are quite terrible, which is not too surprising from only using one feature. I find it kind of funny seeing a negative R squared from a standard statistics 101 perspective. A negative R squared in our case means a straight line given that one feature actually predicts worse (aka does not follow the trend of the straight line/linear regression).

### Feature Scaling

We’ll go over three ways to feature scale: min max scaling, standardization, and L2 normalization.

#Min-max scaling

books_file['minmax'] = pp.minmax_scale(books_file[['ratings_count']])

#Standarization

books_file['standardized'] = pp.StandardScaler().fit_transform(books_file[['ratings_count']])

#L2 Normalization

books_file['l2_normalization'] = pp.normalize(books_file[['ratings_count']], axis=0) #Needs axis=0 for graphing

#Plotting histograms of scaled features

fig, (ax1, ax2, ax3, ax4) = plt.subplots(4,1, figsize=(8, 7))

fig.tight_layout(h_pad=2.0)

#Normal rating counts

books_file['ratings_count'].hist(ax=ax1, bins=100)

ax1.tick_params(labelsize=10)

ax1.set_xlabel('Review ratings count', fontsize=10)

#Min max scaling

books_file['minmax'].hist(ax=ax2, bins=100)

ax2.tick_params(labelsize=10)

ax2.set_xlabel('Min max scaled ratings count', fontsize=10)

#Standardization

books_file['standardized'].hist(ax=ax3, bins=100)

ax3.tick_params(labelsize=10)

ax3.set_xlabel('Standardized ratings_count count', fontsize=10)

#L2 Normalization

books_file['l2_normalization'].hist(ax=ax4, bins=100)

ax4.tick_params(labelsize=10)

ax4.set_xlabel('Normalized ratings count count', fontsize=10)

Now, we’ll make predictions from our 3 feature scaled data.

#Using ratings count to predict average rating. Cross Validation (CV) is normally 5 or 10.

base_model = linear_model.LinearRegression()

base_scores = cross_val_score(base_model, books_file[['ratings_count']],

books_file['average_rating'], cv=10)

minmax_model = linear_model.LinearRegression()

minmax_scores = cross_val_score(log_model, books_file[['minmax']],

books_file['average_rating'], cv=10)

standardized_model = linear_model.LinearRegression()

standardized_scores = cross_val_score(base_model, books_file[['standardized']],

books_file['average_rating'], cv=10)

l2_normalization_model = linear_model.LinearRegression()

l2_normalization_scores = cross_val_score(log_model, books_file[['l2_normalization']],

books_file['average_rating'], cv=10)

#Display R^2 values. STD*2 for 95% confidence level

print("R^2 of base data: %0.4f (+/- %0.4f)" % (base_scores.mean(), base_scores.std()*2))

print("R^2 of minmax scaled data: %0.4f (+/- %0.4f)" % (minmax_scores.mean(), minmax_scores.std()*2))

print("R^2 of standardized data: %0.4f (+/- %0.4f)" % (standardized_scores.mean(), standardized_scores.std()*2))

print("R^2 of L2 normalized data: %0.4f (+/- %0.4f)" % (l2_normalization_scores.mean(), l2_normalization_scores.std()*2))

R² of base data: -0.0024 (+/- 0.0125)

R² of minmax scaled data: 0.0244 (+/- 0.0298)

R² of standardized data: 0.0244 (+/- 0.0298)

R² of L2 normalized data: 0.0244 (+/- 0.0298)

Slight improvement over the log transformations. Since most of the scaling types produced the same shape graphically, no surprise they gave the same values for R squared. Few things to note for each scaling method. Min max is good to get all feature values to be between 0 to 1. Standardization is good to scale the variance of features, so making the mean = 0 and variance = 1 (aka normal distribution style). L2 normalization works by making the features scale into a Euclidean or XY plane norm. Important note, feature scaling does not change the shape of your feature, since under the hood it divides by a constant.

### Conclusion

Awesome, we covered a brief overview of the machine learning pipeline and where feature engineering fits in. Then we went over binning, log transformation, and various forms of feature scaling. Along the way, we also viewed how feature engineering affects linear regression model predictions with book reviews.

Personally, I find min max to work well when dealing with probability distributions with most of the other features in other datasets. On a similar note, standardization and L2 normalization works for scaling numbers down from really big numbers to similar features relative to a chosen dataset analyzed.

*Disclaimer: All things stated in this article are of my own opinion and not of any employer.*

[1] Kaggle, Goodreads-books (2019), https://www.kaggle.com/jealousleopard/goodreadsbooks

[2] Microsoft, What are machine learning pipelines? (2019), https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-ml-pipelines

Machine Learning Pipelines: Feature Engineering Numbers was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.