An explanation for why the bagging fraction is 63.2%
If you have read about Bootstrap and Out of Bag (OOB) samples in Random Forest (RF), you would most certainly have read that the fraction of observations in the ‘bag’ when you build RF with bootstrap is around 63.2%.
This post is a crisp explanation for the origins of the number 63.2%.
The post is organized as:
- Recap of RF terminologies
- Example of Bootstrap
- Generalizing the example
Recap of RF terminologies
RF is a techniques of ensemble learning through Bagging.
Bagging = Bootstrap + Aggregation
Bootstrap means that instead of training on all the observations, each tree of RF is trained on a subset of the observations. The chosen subset is called the bag, and the remaining are called Out of Bag samples.
Multiple trees are trained on different bags, and later the results from all the trees are aggregated. The aggregation step helps reduce Variance.
Example of Bootstrap
Now, coming back to choosing bootstrap samples.
If we have 1000 training observations, then for every bag, a set of 1000 observations are selected from the training set.
Each draw is made with replacement.
Meaning, when the 1st sample is chosen, there are 1000 options to chose from.
For the next sample, again, there are the same 1000 options to chose from.
This process is repeated 1000 times for Bag1.
So its quite possible that there will be repetitions in the observations that make it to the bag.
All the observations that do not make it to the Bag1 are called the OOB for Bag1.
Multiple such Bags are created for training multiple trees.
Generalizing the example
The number of observations in training set = n
Size of Bag1 = n
Probability of an observation being selected in a draw = 1/n
Probability of an observation not being selected in a draw = 1-(1/n)
Number of draws = Size of Bag = n
Therefore, the probability of an observation not being selected in all the draws for Bag1 = [1-(1/n)]^n
As the value of n increases, this values tends to 36.8%
Probability of an observation making it into the bag after n draws
= 1 – 36.8%
Let us look at a simulation with multiple values of n.
I hope this post helps you understand the reason behind the fraction 63.2%. While the number itself is not spoken of much, the process of bootstrapping is, as it forms the core of Random forests.
By training multiple trees on multiple such bootstrapped samples in combination with other parameters like column sampling, each tree gets visibility of a region of the observation space, and thus multiple uncorrelated trees are created.
Additionally, the existence of an OOB sample helps in calculating Feature Importance, and the OOB error gives a good estimate of the Validation error.