Monte Carlo estimation is an essential part of a machine learning engineers’ toolbox. Here we intuitively delve into it to get to know its use-cases, performance and things to keep in mind when using it in general.
Like any other good algorithm introduction, we start with a story about the problem setting that we are trying to solve. The setting is an estimation of integrals.
Suppose that I give you a function f, that is really complicated. I tell you that you can evaluate f, but I am interested in the integral of this function. Unfortunately for you, you cannot analytically compute the integral of this function — maybe it’s just impossible, or maybe your math skills are not at a high enough level. Moreover, the function f might not be explicitly given, this might be some kind of simulation that you need to run to compute f at a certain value. One of the options that you can do is Monte Carlo estimation of the integral.
We are dealing with the problem of estimating integrals, because the function f might be too complicated to allow for analytic computation of the solution.
Indeed, as you might suspect because of the name, the origins of this algorithm stem from gambling games, as a way to calculate expected wins, or expected amount of money that you can win in a gambling game based on samples. Obviously, gambling repeatedly to get more samples in order to improve your estimate is not very smart, but it’s a valid approach to computing your expected wins (losses).
In probability, expectations are defined as integrals for the continuous case.
So, let’s get to the bottom of it. In Monte Carlo sampling we sample from the domain of the function f and we just take the average of the samples to estimate the expected value of the function. Why is this a valid approach? TLDR; It’s all going to boil down to looking at the first two moments of the estimator, i.e. the mean and the variance. Let’s take a look at the math. Very roughly formally we can define the Monte-Carlo estimator of the expectation as the following:
Note the red bolded part above. I want you to keep in mind that all of the samples are drawn from the distribution, i.e. according to the density p(x) with respect to which the target expectation is defined. This is going to allow us to manipulate the expectations in the derivation of the expected value of the estimator. So, without further delay, let's jump into deriving the expected value. Take a look at the following derivation and mediate a bit:
Indeed, our estimated mean of the random variable f(x), which we denote with μ-hat is has the true \mu as the expected value. Why did we arrive at this conclusion? Well because the samples from the distribution p(x) are independent and identically distributed — i.i.d. for short. The rest follows from the linearity of expectation. This also means that our estimator of the true mean is unbiased, which is a fancy way of saying that in expectation, asymptotically it converges to the true mean.
Unbiasedness means that the estimator converges to the true mean
Now you might ask if this kind of estimator is “good”, arguably you would say sure, it’s good because it converges to the true mean and it is unbiased. One more thing that you might want to consider though is the speed of the convergence with the number of samples. For this, we want to take a look at its variance, i.e. expected squared distance to its mean (second (central) moment). We arrive at the following derivation:
In the end we arrive to the variance of the function f divided by the number of samples for the estimator, S. What does this mean? This means that the squared error of the estimator drops with the factor O(1/S). I we want to take a look at the absolute error, taking the square root of the term we would arrive to O(1/sqrt(S)).
Is this convergence rate good? Let’s take a look at an example, estimating the number π. Suppose that we don’t know the first N digits of the number Pi and we want to estimate them. First, we might notice the following, that the are of the unit circle is defined as π. If we take a look at the quarter of the area of the unit circle, we might be able to estimate it by sampling some points from the unit distribution.
Take a look at the following figure, we are looking at the square containing the area of a quarter of the unit circle. What we can notice is that the probability of a point falling into the circle when sampling them uniformly (x,y between 0 and 1) is π / 4, i.e. area of that part of the circle divided by the area of the square. We can estimate this by sampling points and labeling them depending on if they fall within the circle, essentially we are inferring the value of a Bernoulli distribution parameter θ which should converge to π / 4 (its expectation-probability of falling within the circle).
I’ve coded up a simple implementation of this method in JAX, because JAX is fun and I recommend looking at it, but equivalent code could also be written in Numpy. As you can see, it is conceptually really simple, it involves us drawing samples from a uniform distribution, labeling the samples that fall within the circle (line 5) and then calculating and returning the estimate of π (line 6).
The following figure illustrates how the error of the estimator looks like with increasing the number of samples. As expected, the error develops roughly according to the inverse square root of S law O(1/sqrt(S)) that we have derived previously. It is important to keep in mind that we have derived the expected error of the estimator, meaning that estimating it point-wise might fluctuate outside of the expectation, which is also shown by the figure.
Moreover, if we plot the previous figure in log-scale looking at the absolute error we notice that the error drops by an order of magnitude when the sample size increases by an order of magnitude.
Now we are coming back to the question of if this estimator is any good? 1/sqrt(S) convergence rate is not that good for precise calculations, and I will tell you why. Consider the example with the number π. In order to roughly on average be precise (I say roughly because this is a stochastic algorithm which is sample based) at the 2nd digit of the number π, we need 4 orders of magnitude number of samples (10k samples). Similarly, if we want to be roughly precise to the 3rd digit, we would need a million samples. That is quite a lot of samples to estimate only the first 3 digits of π! Luckily there are better methods out there that allow for faster computation.
- Monte Carlo Algorithm is an algorithm that allows us to calculate sample-based estimations of expected values, i.e. to calculate estimations of integrals.
- The error of the estimate in the case of Monte Carlo drops with a rate of O(1/sqrt(S))
- This convergence rate is not that good.
- But the algorithm is simple to implement and it can be used as a first try at estimating a certain quantity before moving on to better approaches.
- Don’t go gambling with it!
Something to Think About
- In this algorithm, we might have just irresponsibly ignored the fact that we assume that we can sample from the distribution p(x), what happens if we can’t sample efficiently from this distribution?
- What if sampling from p(x) is expensive?