A really important part of any machine learning model is the data, especially the features used. In this article, we will go over where feature engineering falls in the machine learning pipeline, and how to do some feature engineering on numbers using binning, transformations, and normalization. The real benefit of feature engineering is being able to improve the prediction of your models without making the models more complex. In this article, we will do some hands-on coding with data taken from Goodreads.com, which I found in Kaggle.com [1].

The data used this article is about book ratings and the data can be downloaded here: https://www.kaggle.com/jealousleopard/goodreadsbooks

High Level Overview of Machine Learning Pipeline

A general machine learning pipeline is composed of 5 steps:

  1. Ask — What the business or customer wants to do (e.g. predict sales for next quarter, detect cancer, etc.)
  2. Data — A piece of reality in digital form. For example, the number of sales in a quarter.
  3. Features — Data transformed in numbers, so that a model may understand. Some of you might say, “Hey, my model accepts red or green as a feature!” Well, under the hood of your model, the code actually transforms that red or green into a 0 or 1, which is typically done with one hot encoding.
  4. Model Selection — The model is the tool that you use that bridges the gap between data to predictions. For those of you comfortable with math, think of it like a function. Your data is the input, the model is the function, and your prediction is the output.
  5. Predictions — The answer to solve the request of the “ask.” To relate it to the examples from the ask above, 10,000 units sold in the first quarter (Q1), or patient XYZ has cancer.

The figure below shows a similar process for constructing a machine learning pipeline, which was taken from Microsoft [2]:

For the rest of feature engineering techniques (e.g. binning, log transformation, feature scaling) shown below, they are simply just extra tools for your to manipulate the data in order to hopefully make better predictions.

Binning

Binning is great, when you have a data set that needs to be clumped together into groups. Making these groups or bins helps the model improve prediction accuracy. For example, you have a range of income to predict credit default. One way to apply binning is to make groups of tax bracket income levels. Now, we’ll begin using binning with histograms on book count ratings.

#Load Packages
import numpy as npimport pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model
from sklearn.model_selection import cross_val_score
import sklearn.preprocessing as pp
#load data into dataframe
books_file = pd.read_csv('books.csv', error_bad_lines=False)

Put the “error_bad_lines” code in the pd.read_csv function, since there are a few unclean rows. Otherwise, Python will yell at us.

#Taking a look at the data
books_file.head()

Now, we’ll start plotting the histograms.

#Plotting a histogram - no feature engineering
sns.set_style('whitegrid') #Picking a background color
fig, ax =plt.subplots()
books_file['ratings_count'].hist(ax=ax, bins=10) #State how many bins you want
ax.tick_params(labelsize=10)
ax.set_xlabel('Ratings Count')
ax.set_ylabel('Event') #How many times the specific numbers of ratings count happened

What’s wrong here? It appears that most the counts are heavily weighted with the lower ratings count. Makes sense, since there are a lot of books that only get a few ratings. What if we want to capture more of the popular reviewed books in our graph? The answer: log scaling.

Log Transformation

#Plotting a histogram - log scale
sns.set_style('whitegrid') #Picking a background color
fig, ax =plt.subplots()
books_file['ratings_count'].hist(ax=ax, bins=10) #State how many bins you want
ax.set_yscale('log') #Recaling to log, since large numbers can mess up the weightings for some models
ax.tick_params(labelsize=10)
ax.set_xlabel('Ratings Count')
ax.set_ylabel('Event') #How many times the specific numbers of ratings count happened

Now we’ll make rating predictions from normal values vs. log transformed values of rating counts.

For the code below, we are adding +1, since the log of 0 is undefined, which would cause our computer to blow up (kidding, sort of).

books_file['log_ratings_count'] = np.log10(books_file['ratings_count']+1)
#Using ratings count to predict average rating.  Cross Validation (CV) is normally 5 or 10.
base_model = linear_model.LinearRegression()
base_scores = cross_val_score(base_model, books_file[['ratings_count']],
books_file['average_rating'], cv=10)
log_model = linear_model.LinearRegression()
log_scores = cross_val_score(log_model, books_file[['log_ratings_count']],
books_file['average_rating'], cv=10)
#Display the R^2 values.  STD*2 for 95% confidence level
print("R^2 of base data: %0.4f (+/- %0.4f)" % (base_scores.mean(), base_scores.std()*2))
print("R^2 of log data: %0.4f (+/- %0.4f)" % (log_scores.mean(), log_scores.std()*2))
R² of base data: -0.0024 (+/- 0.0125)
R² of log data: 0.0107 (+/- 0.0365)

Both of the models are quite terrible, which is not too surprising from only using one feature. I find it kind of funny seeing a negative R squared from a standard statistics 101 perspective. A negative R squared in our case means a straight line given that one feature actually predicts worse (aka does not follow the trend of the straight line/linear regression).

Feature Scaling

We’ll go over three ways to feature scale: min max scaling, standardization, and L2 normalization.

#Min-max scaling
books_file['minmax'] = pp.minmax_scale(books_file[['ratings_count']])
#Standarization
books_file['standardized'] = pp.StandardScaler().fit_transform(books_file[['ratings_count']])
#L2 Normalization
books_file['l2_normalization'] = pp.normalize(books_file[['ratings_count']], axis=0) #Needs axis=0 for graphing
#Plotting histograms of scaled features
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4,1, figsize=(8, 7))
fig.tight_layout(h_pad=2.0)
#Normal rating counts
books_file['ratings_count'].hist(ax=ax1, bins=100)
ax1.tick_params(labelsize=10)
ax1.set_xlabel('Review ratings count', fontsize=10)
#Min max scaling
books_file['minmax'].hist(ax=ax2, bins=100)
ax2.tick_params(labelsize=10)
ax2.set_xlabel('Min max scaled ratings count', fontsize=10)
#Standardization
books_file['standardized'].hist(ax=ax3, bins=100)
ax3.tick_params(labelsize=10)
ax3.set_xlabel('Standardized ratings_count count', fontsize=10)
#L2 Normalization
books_file['l2_normalization'].hist(ax=ax4, bins=100)
ax4.tick_params(labelsize=10)
ax4.set_xlabel('Normalized ratings count count', fontsize=10)
The graph above shows a histogram of the normal data, min max scaled transformation, standardized transformation, and normalization transformation. Overall, the transformations are pretty similar, but you would want to pick one over the other dependent upon other features in your dataset.

Now, we’ll make predictions from our 3 feature scaled data.

#Using ratings count to predict average rating. Cross Validation (CV) is normally 5 or 10.
base_model = linear_model.LinearRegression()
base_scores = cross_val_score(base_model, books_file[['ratings_count']],
books_file['average_rating'], cv=10)
minmax_model = linear_model.LinearRegression()
minmax_scores = cross_val_score(log_model, books_file[['minmax']],
books_file['average_rating'], cv=10)
standardized_model = linear_model.LinearRegression()
standardized_scores = cross_val_score(base_model, books_file[['standardized']],
books_file['average_rating'], cv=10)
l2_normalization_model = linear_model.LinearRegression()
l2_normalization_scores = cross_val_score(log_model, books_file[['l2_normalization']],
books_file['average_rating'], cv=10)
#Display R^2 values. STD*2 for 95% confidence level
print("R^2 of base data: %0.4f (+/- %0.4f)" % (base_scores.mean(), base_scores.std()*2))
print("R^2 of minmax scaled data: %0.4f (+/- %0.4f)" % (minmax_scores.mean(), minmax_scores.std()*2))
print("R^2 of standardized data: %0.4f (+/- %0.4f)" % (standardized_scores.mean(), standardized_scores.std()*2))
print("R^2 of L2 normalized data: %0.4f (+/- %0.4f)" % (l2_normalization_scores.mean(), l2_normalization_scores.std()*2))
R² of base data: -0.0024 (+/- 0.0125)
R² of minmax scaled data: 0.0244 (+/- 0.0298)
R² of standardized data: 0.0244 (+/- 0.0298)
R² of L2 normalized data: 0.0244 (+/- 0.0298)

Slight improvement over the log transformations. Since most of the scaling types produced the same shape graphically, no surprise they gave the same values for R squared. Few things to note for each scaling method. Min max is good to get all feature values to be between 0 to 1. Standardization is good to scale the variance of features, so making the mean = 0 and variance = 1 (aka normal distribution style). L2 normalization works by making the features scale into a Euclidean or XY plane norm. Important note, feature scaling does not change the shape of your feature, since under the hood it divides by a constant.

Conclusion

Awesome, we covered a brief overview of the machine learning pipeline and where feature engineering fits in. Then we went over binning, log transformation, and various forms of feature scaling. Along the way, we also viewed how feature engineering affects linear regression model predictions with book reviews.

Personally, I find min max to work well when dealing with probability distributions with most of the other features in other datasets. On a similar note, standardization and L2 normalization works for scaling numbers down from really big numbers to similar features relative to a chosen dataset analyzed.

Disclaimer: All things stated in this article are of my own opinion and not of any employer.

[1] Kaggle, Goodreads-books (2019), https://www.kaggle.com/jealousleopard/goodreadsbooks
[2] Microsoft, What are machine learning pipelines? (2019), https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-ml-pipelines


Machine Learning Pipelines: Feature Engineering Numbers was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.