Detecting and Treating Outliers In Python — Part 3
Hands-On Tutorial On Treating Outliers — Winsorizing and Imputation
An Exploratory Data Analysis (EDA) is crucial when working on data science projects. Understanding your underlying data, its nature, and structure can simplify decision making on features, algorithms or hyperparameters. A critical part of the EDA is the detection and treatment of outliers. Outliers are observations that deviate strongly from the other data points in a random sample of a population.
In two previously published articles, I discussed how to detect different types of outliers using well-known statistical methods. One article focuses on univariate and the other on multivariate outliers.
In this final post, I want to discuss how to treat extreme values once they are detected. After a theoretical introduction, I will provide two practical examples written in python. For this, I will make use of the Boston housing data set like in my previous posts.
Treating outliers: A subjective task
Similar to not detecting outliers at all, handling outliers can bear the risk of having a substantial impact on the outcome of an analysis or machine learning model. In practice, it is often not very obvious what to do with outlying observations.
The good news is: From a mathematical point of view, there is no right and wrong answer on how to treat outlying observations. A more important role, next to mathematics, can be given to qualitative information you have available in the decision process around outliers. For example, knowing how an outlier arose in the first place can be beneficial for decisions on extreme values.
Therefore, I want to highlight possible sources of outliers before going deeper into available options for outlier treatment.
Sources of outliers
Next to the distinction between univariate and multivariate, extreme values can be differentiated by source. An error outlier is an outlying observation that stems from an inaccurate measurement, wrong data entry, or is the result of data manipulation. In this case, these data points are usually not part of the population of interest.
On the other hand, non-error outliers, also called interesting or random outliers, are part of the population of interest and may hold interesting information.
Handling error Outliers
Error outliers should be either removed or corrected. The easiest way would be to remove the observations that emerged through inaccurate measurements or data manipulation. Suppose there is a raw version of the underlying data available. In that case, it can be worth retracing the original entry of a data point to avoid a significant loss of information through deletion.
However, if you do not have a raw data version at hand but you are sure that you are looking at an error outlier (e.g., a human height measurement of 4 meters/200 inches or a fourth class of a dimension even though you know there should be only three classes of a variable), your best option is to simply remove these entries.
Handling non-error outliers
There exist three different options on how to treat non-error outliers:
When most of the detected outliers are non-error outliers and rightfully belong to the population of interest, this is a good strategy. Also, you often cannot easily identify whether or not an extreme value is a part of the population of interest or not.
When keeping outliers, be aware that they can distort the results of your actual task: e.g. lead to a rejection of the null hypothesis or an over-optimistic prediction. Therefore, it might be worth your while to report your findings, including outliers and excluding outliers to highlight the impact they can have.
Another option are robust methods for your actual prediction task or analysis. These methods reduce the influence of extreme values using more robust statistics (e.g., median) or other non-parametric settings (e.g., rank-tests, bootstrapping, or Support Vector Machines).
For univariate and multivariate outliers:
- Collect qualitative information by including and excluding outliers in your analysis to assess their actual impact
- Use robust methods to reduce the impact of outliers
- Keep outliers if they are likely to belong to the population of interest and beware of the risks they bring when making decisions
- And always and most importantly: Report all findings!
The most straightforward option is to delete any outlying observation. However, this strategy bears a high risk of losing information. Especially if you find many outlying data points, try to avoid this. Also, deleting interesting and influential outliers (points that belong to the population of interest) can falsely impact any output, e.g., prediction or test result, you aim to achieve.
For univariate and multivariate outliers:
- Remove outliers on a low scale and those that are not likely to be from another population
- If you choose deletion, always provide two reports of your analysis or outcomes: one with and one without outlying observations
Recoding outliers is a good option to treat outliers and keep as much information as possible simultaneously. This option should always be accompanied by sound reasoning and explanation. There are several different methods to recode an outlier, and in this article, I want to focus on two widely used methods:
Winsorizing was introduced by Tukey & McLaughlin in 1963 and is often recommended in research papers (e.g., 2013 or 2019) dealing with outlier treatment. With winsorizing, any value of a variable above or below a percentile k on each side of the variables’ distribution is replaced with the value of the k-th percentile itself. For example, if k=5, all observations above the 95th percentile are recoded to the value of the 95th percentile, and values below the 5th percent are recoded, respectively. Compared to trimming, winsorizing is a less extreme option by recoding outliers rather than cutting them altogether.
Winsorization also directly intervenes with the process of outlier detection. The data points above or below a certain threshold are treated, and no independent detection method is needed. However, it goes hand-in-hand with Tukey’s boxplot method as k is often recommended to be set at a sample’s outer fence (3 standard deviations around the mean). This is often at k=5 and, therefore, used as the default value.
Let’s look at an example from the previously used Boston housing data set. For the crime rate per capita by town, we found 30 probable outliers (using the Tukey method). First, I will re-use some code from the first tutorial to determine the outer fence.
The upper outer fence for the variable “CRIM” is roughly 14.46, while the lower end is below zero. Because a crime rate below zero is not meaningful, the data should only be winsorized on its right tail. Now, we can look at values at different percentiles to set k.
It looks like the value at 92.5% (13.54) and 95% (15.79) are closest to the upper outer fence. As 95% is more common, I will winsorize the data on k=5 using the winsorize function from scipy:
With winsorizing, the mean crime rate per capita changed from 3.61 to 2.80 (95%).
For univariate outliers:
- Winsorize to keep as much data as possible
- To find the right winsorization level, know your data! A percentage point close to the outer fence is considered best practice
- Zero limits can be meaningful if a variable can not have a value below zero
- Report main statistics (e.g., mean, std) before and after winsorizing
For multivariate outliers:
- For multivariate outliers, winsorizing is done on the ellipsoid (holding information from more than one variable)
- There doesn’t seem to be an existing python package that deals with winsorization on ellipsoids. Yet, there exists a function called mvTopCoding as part of an R package sdcMicro that winsorizes outliers on the ellipsoid defined by the (robust) Mahalanobis distance.
Imputation is a method that is often used when handling missing data. However, it is also applied when dealing with extreme values. When using imputation, outliers are removed (and with that become missing values) and are replaced with estimates based on the remaining data.
There are several imputation techniques. One that is often used, yet comes with a strong bias, is the simple mean substitution. Here, all outlier or missing values are substituted by the variables’ mean. A better alternative and more robust imputation method is the multiple imputation.
In multiple imputation, missing values or outliers are replaced by M plausible estimates retrieved from a prediction model. The outlier becomes the dependent variable of a prediction model (e.g., regression, random forest etc.), and is estimated based on the remaining information non-missing/non-outlier values in an observation.
Choosing the right number of plausible estimates M for a missing value or outlier is frequently discussed in literature and it is often recommended:
Using m=5−20 will be enough under moderate missingness […]
Practically, multiple imputation is not as straightforward in python as it is in R (e.g. mice, missForest etc). However, the sklearn library has an iterative imputer which can be used for multiple imputations. It is based on the R package mice and is still in an experimental phase.
Sklearns’ default version is very basic and uses mean substitution but can be adjusted easily by passing other regressors to the function, like linear regression, KNN or a decision tree. In order to receive multiple estimates, like mice provides in R, the imputer needs to be run multiple times (e.g. in a for-loop).
Coming back to our example of crime rate per capita, we first need to transform the outliers into missing values. For this, I use the list of probable outliers detected by the Tukey method (see in article 1).
As an estimator, I chose regularized linear regression (BayesianRidge), and for simplicity, I will set m=1 (single imputation). As already mentioned and also written in sklearns user-guide, the imputer can be used for multiple imputations
“by applying it repeatedly […] with different random seeds when sample_posterior=True”.
Again, the mean crime rate per capita changed from 3.61 to 2.36.
For univariate outliers:
- Next to treating missing data, often used technique for recoding outliers
- Multiple imputation more robust, single (e.g., mean) imputation biased
- For imputation, R offers more mature and flexible packages than python.
For multivariate outliers:
- Imputation does not really make sense for multivariate outliers as they are defined as outlying observations across multiple variables (for a multiple outlier all entries would turn into a missing value leaving little space for prediction)
The treatment of outlying data points is a highly subjective task as there is no mathematical right or wrong solution. Qualitative information, such as knowing the source of an outlier or an outlier’s influence, can simplify treatment decisions. Error outliers are best to be corrected or deleted, while non-error outliers can be kept, deleted, or recoded. Several recoding methods exist for extreme values, with winsorizing and multiple imputations being among the popular ones.
An essential task is to report the outcome with and without outliers and supplying sound reasoning and explanation when treating outliers.
Detecting and Treating Outliers In Python — Part 3 was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.