Preparing your data for machine learning

In 2016 CloudFlower published their yearly Data Science Report: an attempt to shed some light on what it meant to be a data scientist, a term that had only recently appeared in common vocabulary. One of the key insights emerging from that report was that the average data scientist spends about 80% of their time collecting, cleaning, wrangling or otherwise preparing data for analysis.

2016 Data Science Report, CloudFlower

A long list of headlines followed, the general idea being that this is the most mundane part of the job, and basically a waste of time and productivity.

The 80/20 rule is pretty accurate, but the concept that this reduces productivity or takes away energy and resources that would be better spent training models is misguided

Articles drawing attention to the 80/20 rule.

Pre-processing involves many steps. Some of them may seem more trivial, like resolving inconsistencies, and some require deeper knowledge of statistics theory, like the imputation of missing data, but they all have one thing in common: they’re all decision points, and the answers can’t be found in any textbook or manual, because they’re very dependent on the problem you’re trying to solve, and the specific dataset your working with.

For this reason, pre-processing isn’t a trivial step of the process, and adds a lot of value to the project: as you’re working your way through inconsistency, outliers and missing data you’re gaining a much better understanding both of the problem at hand, and of the data that is available to solve it, which in turn will make building a model way easier at the end.

This article describes some of problems addressed during pre-processing, and the decisions that are made at each step.

Machine learning models are only as good as the data that they’re trained on. This can be summarized by the concept of garbage in, garbage out: if your training set is noisy, has errors, or incorporates biases, it doesn’t matter how fancy your math is or how much time you spend optimizing your parameters, the results of your model will inevitably be inaccurate.

What is data quality?

Data quality can make the difference between a good and a bad model more than the complexity of the algorithm. But how is quality data defined? A rigorous definition doesn’t exist, but here’s one I like from the Wikipedia page on data quality:

The state of completeness, conformity, consistency, timeliness, duplication, integrity, and accuracy that makes data appropriate for a specific use.

In more generic terms, quality data is data that is “fit for use”, and correctly represents the reality to which it refers.

Let’s look at an example and see what kind of problems we may encounter preparing data for analysis, and the steps we can take to address them.

The mental health in tech survey

The mental health in tech survey, run yearly by OSMI (Open Sourcing Mental Illness), measures attitudes towards mental health in the tech workplace, and examines the frequency of mental health disorders among tech workers.

In the 2014 version, 1259 respondents answered 25 questions. Several of these questions allowed free typing, so the data presents plenty of inconsistencies and errors that need to be addressed before using the data in any analysis or feed it to a model.

What follows, is a selection of examples of inconsistencies, outliers and missing data. A notebook with the complete analysis of this dataset is available on my Github.

Spotting inconsistencies

Step one is to check your data for inconsistencies. This is anything from irrelevant data to syntax errors. Here’s a non-exhaustive list with some examples.

  • Irrelevant data, e.g. rows and columns that shouldn’t be included.
  • Duplicated data.
  • Type conversion errors, e.g. a numerical column that’s been interpreted as a string.
  • Syntax errors, e.g. misspellings of category names.
  • Non standard missing values, e.g. “N/A”, “null”,Not Applicable”.

Running some simple descriptive statistics generally helps spotting these and other inconsistencies. This step is sometimes called profiling.

The first oddity we notice in the mental health in tech survey is the age distribution.

df['Age'].describe()
# count          1259.000
# mean 79428148.311
# std 2818299442.982
# min -1726.000
# 25% 27.000
# 50% 31.000
# 75% 36.000
# max 99999999999.000
# Name: Age, dtype: float64
Initial histogram of the Age distribution

We’re going to deal with outliers later, and for now, we’ll impose a simple business rule — age should be between 14 and 100 — and we’ll consider all values outside of that interval as missing.

The histogram of Age now makes a lot more sense.

Histogram of the Age distribution after adding the rule that the age of a respondent has to be between 14 and 100.

Gender was another open ended answer in the questionnaire so this column includes many, many ways of spelling male, female and non-binary. Fortunately, most of them can be corrected using regular expressions, and some fuzzy matching.

Initial distribution of Gender (counts) in the survey results.
Distribution of Gender (counts) after cleanup.

Moving on, the distribution of Country doesn’t appear to have the same problem (no misspellings), but there is a large group of countries with very low occurrences. No particular action is required here, but one possible option is to limit the analysis to a subset of countries — which ones will depend on the kind of answers we’re looking for. One option here would be to limit the analysis to data from the United States.

Outliers

There is no rigorous definition for outliers either. An outlier is a data point that differs significantly from other observations. This can be caused by actual variability in the measurement, or by an error.

Because there isn’t a single definition of outlier, the way they are defined project by project informs the methodology used to detect them. A common methodology is Tuckey’s fence, also known as box-plot rule. This method is based on the interquartile range (IQR), i.e. the difference between the first quartile Q1 and the third quartile Q3: any observation outside the range [Q1 - k * (IQR), Q3 + k * (IQR)] is flagged as an outlier. Traditionally k=1.5 for regular outliers and k=3 for large outliers.

Boxplot of Age. The box represents the distribution of Age between the first and third quartile, respectively represented by the lower and upper side of the box. The solid line in between Q1 and Q3 represents the median, while the dotted line represents the average. The red and dark blue dots represent respectively regular (k=1.5) and big (k=3) outliers.

Because some models (e.g. linear regression) are less robust then others to outliers, it’s good to be aware of their presence before choosing a model — while we don’t need to act on these right now, we’ll flag them and move onto the next step.

Note that this method works best for data that is somewhat normally distributed, but would be inappropriate, for example, for data with a big skew on either sides.

Missing data

Missing data are data points for which the type is known but not the values, e.g. an unanswered questions in a survey. Most machine learning models don’t handle missing data, meaning missing values have to be either deleted or replaced. There are 3 main mechanisms that can lead to missing data, and it’s important to be able to distinguish between one and the other because the mechanism underlying missing data will determine how to deal with it.

  • Missing Completely At Random (MCAR): the factor that lead to a particular value being missing are independent of the observable (or unobservable) variables. For example, some people don’t fill surveys on mental health implies MCAR — there is no specific reasons this happens.
  • Missing At Random (MAR): there is a non random observable mechanism behind missing data. For example, men are less likely to fill in surveys on mental health for no other reason then being men implies MAR: as long as we can observe the variable gender we can account for the non randomness in missing data and build unbiased estimators.
  • Missing Not At Random (MNAR): missingness is due to some non observable factor. For example, men are less likely to fill in surveys on mental health because of their mental health implies MNAR.

How to treat missing data then? In some cases, for example when a column has a large number of missing values, deletion is a valid approach. However, in most cases, missing values are replaced with some substitute value. This process is called imputation. There are many possible approaches to imputation, some more complex than others.

  • Numerical imputation: replace all missing observations with a single value, usually the mean.
  • Random selection from the distribution.
  • Hot deck imputation: missing values are imputed from similar records in the dataset.
  • KNN imputation: determine the the nearest neighbors and use some aggregate of their values to impute the missing one.

There is no single rule to decide what method to use, and the choice depends on a combination of factors such as the number of records having missing data, the type of data we’re dealing with, the specific use case, and the underlying mechanism for missingness. An important rule to keep in mind is that imputation shouldn’t change the distribution of the data so it’s good practice to compare the distribution before and after imputation to make sure we haven’t injected any bias.

Let’s look at a couple of examples from the mental health in tech survey.

The variable for self employment has a few missings. For each of them we’re going to randomly select a value from the set of all possible values (yes/no) with probability proportional to the distribution of yes/no in the dataset. This ensures that the distribution after imputation is the same.

The probability of being self employed before (p_original) and after (p_imputed) imputation.

In the first stage of this process, we found that Age had a few values that were out of range and we decided to assume them missing, so it’s now time to impute them. For this variable, we’ll use a fancier method: KNN imputation. For each respondent with missing age, we find the n most similar neighbors and impute the missing value with the average Age of those neighbors.

Age distribution before and after imputation.

Process and best practices

Pre-processing is NOT a linear process. You may go back, edit and repeat all steps at any given point of the process.

There are typically 4 main steps.

  1. Inspection: detect unexpected, incorrect, and inconsistent data (Data profiling and Visualization).
  2. Cleaning: remove, correct or impute incorrect data.
  3. Verifying: is the data correct now? Re-inspect and test to verify no assumptions are violated,
  4. Reporting: keep a log of the inconsistencies and errors found and how they were addressed.

This steps are conceptually sequential but the process is far from linear! At any given step, you may have to go back, edit and repeat all steps, and this may even happen after you’ve fed the data to a model, so it’s important that every single one of them is well documented and reproducible and that nothing is edited manually, no matter how simple — future you will be very grateful.

Summary

This isn’t a complete guide to pre-processing, but it should give a more clear prospective on what the process looks like. There are many decisions to be made and the solution isn’t always straightforward, which creates many possible failure points. Spending time profiling and preparing your data will give you a much better understanding of the problem at hand, at the same time as ensuring that the data that is fed to predictive models really is representative of the problem being solved. This is extra important because an increasing number of decision processes rely on machine learning models to make fair decisions, so it’s imperative that these models are trained on data that is reliable, consistent and unbiased.