Feature Selection — Machine Learning

In this article, we will discuss the importance of the feature selection process, why it is required, and what are the different types of feature selection.

So, let's get started…

What is Feature Selection Process?

  • It is a process of selecting required features that have more impact on the output variable.
  • It means that we need to select only those features (independent variables) which are highly related to the output variable.
  • It is the most important process for creating a machine learning model.

Why Feature Selection is important?

Consider, we have a dataset containing thousands of features. when we train our model with lots of features the accuracy of the model will go down. we call this problem is a curse of dimensionality.

So, to solve this problem we need to use only those features which have more impact on the dependant variable (the output variable). There are many techniques available to measure the impact of the independent variables on the dependent variable.

what are the different types of feature selection?

The below figure shows what are the different types of feature selection.

Feature Selection Techniques. Fig 1.1

We will discuss filter methods first.

  1. Pearson’s correlation (linear).
  2. Spearman’s rank. (monotonic)
  3. ANOVA correlation coefficient (linear).
  4. Kendall’s rank coefficient (nonlinear).
  5. Chi-Squared test (contingency tables).
  6. Mutual Information.

The below figure gives the idea about when to use which method:

How to choose Filter methods. Fig1.2

from the above figure, you will get a much better idea about when to use which method.

  • Input Variable Numerical, Output Variable Numerical: This is a regression problem.
  1. Pearson’s correlation for a linear relationship.
  2. Spearman’s correlation for a monotonic relationship.
  • Input Variable Numerical, Output Variable Categorical: This is a classification problem.
  1. ANOVA for a linear relationship.
  2. Kendall’s for a non-linear relationship.
  • Input Variable Categorical, Output Variable Numerical: This is a regression problem. we face this type of problem very rarely.
  1. ANOVA for a linear relationship.
  2. Kendall’s for a non-linear relationship.
  • Input Variable Categorical, Output Variable Categorical: This is a classification problem.
  1. Chie-Square Test.
  2. Mutual Information.

Sklearn has an implementation for the below Methods.

SciPy has an implementation for the below Methods.

Summary:

In this blog, we have discussed, the importance of the feature selection process, why it is required, and what are the different types of feature selection.

we mainly focused on Filter-Based Feature Selection Methods and their types.

In the next articles, I will show you how to implement these methods practically and also other types of the feature selection process.


Feature Selection in Machine Learning was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.