Data Science from the Ground Up
Building an intuition for the most important statistical result for data science.
Underlying every poll you see on the news or every estimate of an effect size in a scientific study are two statistical conclusions: the Law of Large Numbers and the Central Limit Theorem. One of these, the Law of Large Numbers, usually strikes people as being fairly intuitive, but the other is a little harder to wrap your head around. It’s a subtle conclusion and its practical uses are non-obvious, yet it might be the single most important piece of statistics for applied data science. It underpins how statisticians and data scientists quantify margins of error, how they test to see if an effect is significant and why normal distributions (those classic, bell curve like distributions) are so common in real world variables. If you work with applied statistics in basically any capacity and want to understand why your tools work the way they do, you first need to understand the Central Limit Theorem.
Refresher on the Law of Large Numbers
The Law of Large Numbers states something that for many people may seem obvious: that the average value of numerous repeated trials should be about, well, the average expected value of any one of those trials. Any one trial might deviate from the average, but as you take more and more trials into consideration, their combined average will almost invariably move closer and closer to the expected average. A common example is with flipping a coin. If you flip enough times, you should get about half heads and half tails. If you were to only flip a coin a couple of times, it’s very possible to get a string of just heads or tails but as you keep going the probability of having a ratio of heads to tails that’s substantially different from the average drops away.
The Law of Large Numbers is also important for surveys and polls. For instance, let’s say that you wanted to figure out what the average height of people was in an age before you could easily look that up on the internet. You decide that performing a survey might be a sensible way to go about this, and set out to ask random people how tall they are. The result of one random survey is itself a random variable — you might pick someone taller than average or someone lower than average and if you choose them randomly you won’t know whether they are taller or shorter ahead of time. You can’t survey just one person — what are the odds you’ll randomly get someone who is exactly the average height! — but if you survey numerous people, your sample average will likely be close to the true average in the overall population, and if you survey more and more people, your sample average will move ever closer and closer to the true average. A sample of 10 is better than a sample of 1 and a sample of 100 is better than a sample of 10.
This is the law of large numbers at work — the bigger your sample the closer to the true average you expect to be. You may realize that there are some caveats here, that it isn’t quite as simple as ‘more surveys equals better results’. In particular, this only works if your trials or surveys are random and belong to the same distribution. You’ll run into trouble if, say, you select your survey subjects randomly from among people in the locker room following an NBA game. The heights of NBA stars are not distributed in the same way as the heights of the general population at large, so don’t be surprised if your sample average is not accurate to the general population. Slightly more subtly, you’ll also have trouble if your choices of who to survey are not independent of each other. Say, for instance, you call up a random person and ask them how tall they are. Great, one more height to add to your survey, but, while you’ve got them on the line, you figure you can save some time calling up strangers and also ask them about the heights of the other people in their family. Now your sample is no longer random. The statistical jargon for a sample or set of random variables that satisfy these requirements is that they are ‘independant and identically distributed’ — or i.i.d.
(There are actually a couple more requirements for the LLN to hold — that the distribution has a well-defined expected value, for instance — but these issues do not typically come up that often in real world use cases.)
The next step: the Central Limit Theorem
Armed with the LLN, you know that all you need to get a good estimate is a sufficiently large sample. But, how large a sample is ‘sufficiently large’? To answer that question we’ll need the Central Limit Theorem. The Central Limit Theorem holds that a sample statistic like the sample average is itself a random variable that is about normally distributed as the size of the sample increases regardless of the distribution of the population from which the sample is drawn.* There’s actually a lot there to unpack, so let’s consider what each part of the theorem means.
A sample statistic like the sample average: An important thing to understand up front is that the CLM doesn’t tell us anything about the distribution of some feature in the overall population or even the distribution of that feature in our sample. Instead, the CLM tells us something interesting about the distribution of the sample average. Remember that a single measurement in our survey was a random variable. The sample average is also a random variable, since it is entirely derived from the measurements, which were themselves random variables. Because of the CLM, we know something important about the shape of the distribution of the sample average — namely, that as the sample size grows, it becomes normally distributed.
Normal distribution: the normal distribution is a class of distributions that people tend to be familiar with as the ‘bell curve’: thin tails surrounding a thicker middle, symmetrically centered around an average value. There’s a formula for the normal distribution, but for our purposes we don’t need to dwell too much on it. What’s important to understand is that the normal distribution is, essentially, just a type of shape with well understood properties, in the same way that ‘circles’ are a type of shape. Just like a circle can be larger or smaller, normal distributions come in different sizes — some are taller and narrower and some flatter and wider — but all have certain things in common. In order to perfectly describe a circle you only need two pieces of information, its radius and where its center is located. Similarly in order to perfectly describe a normal distribution you only need two pieces of information — its average value (analogous to its ‘center’) and its standard deviation.
To get a sense for what it means that the sample average will be normally distributed as the sample size grows, let’s follow a simple example. Instead of a sample of heights, we’ll consider an even simpler random variable, the value of a dice roll. You’ll notice that the distribution of probabilities of a dice roll (assuming it’s a fair die) is decidedly not shaped like a bell curve. Each number is expected to come up with equal likelihood:
There is an ‘average’ expected value to this distribution, but that average is actually 3.5, a number we’ll never see on any single roll. It doesn’t matter how many times you roll a die, the probabilities for the next roll stay the same. If we roll a single die a large number of times and simply record what it landed on, we should get a graph that looks very similar to these basic probabilities, plus or minus some differences due to the random chance of our rolls:
If instead of rolling one die at a time, we roll a handful at once and record the average, something interesting starts to happen. Let’s try rolling 5 dice at once:
When we roll the dice five at a time, we start to see a peak in the middle. This is in a way the beginning of the Law of Large Numbers — when we roll one die we get values far away from the overall expected value like 1 and 6 as often as not, but when we roll multiple dice and consider the average of their values, we tend to get numbers closer to the expected value of 3.5. Additionally, the distribution is beginning to get the bell shape. Trying again with even more rolls continues the trend:
Regardless of the distribution of population from which the sample is drawn: You may be thinking that maybe this isn’t all that impressive of a result, that something as simple as rolling dice might reasonably tend towards this sort of result. What’s incredible is that this result holds regardless of what the underlying distribution of the random variable you’re considering is. Say, for instance, you have a loaded die which doesn’t land on each number with equal probability. Consider these probabilities:
This die favors some numbers, like 4, and actually never lands on 5. If we roll it a number of times, we get a the sort of results you would expect:
But, again, if we roll 5 of these dice at a time and average the results, we start to see the beginnings of our bell-curve forming:
And if we roll 25 at a time, we can no longer really tell that we’re generating these sample averages from a skewed underlying distribution:
It doesn’t matter what sort of distribution the underlying random variable has, the distribution of the sample average will become closer and closer to a normal distribution as the sample size grows.
Remember that any normal distribution can be described with just two pieces of information — the average value and the standard deviation. What will these two values be for our distribution of sample averages? Well, the average value will be the true population average. The standard deviation, however, shrinks as the size of our sample goes up -this is essentially just the LLN again, as the sample grows, the average should move closer to the true population average. As it turns out, the standard deviation of this normal curve is the standard deviation of population divided by the square root of the sample size.
Why is this so important?
Why is it so useful to know that the sample average is normally distributed? Particularly if you only ever see one sample average? Well, unlike the distribution of whatever you’re studying, whose shape you may not even know, the normal distribution has a few well known and reliable features. With a reasonably sized sample, the law of large numbers suggests we should find a sample average that’s near to the true population average, but how near? Knowing that the sample average is normally distributed helps. Recall that the normal distribution is a shape with certain predictable properties, one of which is that it is densest around its average value. Most of the area of the normal distribution is centered around the mean, and the tails drop away pretty quickly. A number picked at random from a normal distribution is much more likely to come from the center area of the curve than from out in the tails.
What’s more, while some normal distributions are taller and narrower than others, we can quantify the amount of area under the curve with the same ratio of standard deviations. To illustrate what I mean by this, consider the following normal curve:
The blue vertical line represents the mean value of the distribution. The red vertical lines are one standard deviation above and below the mean. The red shaded area between them — that is, everything under the curve between -1 and 1 standard deviations from the mean — represents just about 68% of the area under the curve. Similarly, the yellow vertical lines represent two standard deviations above and below the mean and the are between them represents a bit more that 95% of the distribution’s total area. If you pick a value from a normal distribution at random, 95% of the time you’ll get a value from between those two lines. This is true regardless of what value the normal distribution’s mean or standard deviations take on.
As it turns out, even though we only see one sample and one sample average, these facts about the distribution of the sample average and normal distributions in general give us a lot to work with. The sample average is an unbiased estimator of the true population average and the sample’s standard deviation is an unbiased estimator of the population standard deviation (actually, there’s a minor tweak to the sample standard deviation to correct for some bias, but it doesn’t change the logic here and, anyways, most statistical packages will have applied it for you anyways). Recall that the standard deviation for the distribution of sample averages is derived from the population’s standard deviation (by dividing by the square root of the sample size). We have all the information we need to describe the distribution of the sample average from this one sample!
Furthermore, we’re pretty sure that this particular sample average falls somewhere in the center of the distribution. We don’t know if it’s a bit above the true population average or below the true population average, but we’re 95% sure that it’s somewhere in between two standard deviations above and below the mean. If we take our sample average and add two standard deviations to create a top end and subtract two standard deviations to get a bottom, we have a range. In 95% of ranges created this way, the true population mean will fall somewhere in the range. This is called the ‘confidence interval’, because it quantifies a 95% level of confidence. This is also how margins of error on polls are created.
It may help to see this in action, so let’s consider another simple example. Again, we’ll roll dice (or, rather, I’ll have the computer simulate rolling dice). We’ll roll 20-sided dice, so the ‘population’ of dice rolls is every number from one to twenty. The ‘average’, expected value for this population is 10.5. Each sample will be 20 rolls taken together. From this sample we’ll derive an estimate for the population average (the sample average) and a sample standard deviation. From these we’ll construct our 95% confidence intervals in the manner discussed above. Here are the results of 25 simulated trials:
The red dashes are the sample averages and the blue bars surrounding them are the confidence intervals. For instance, in the first trial there was a sample average of close to 12 and a confidence interval that ranged from 9 to around 14. For easy reference, I’ve put the vertical purple line at 10.5 to represent the true population expected value. You’ll notice that the sample averages bounce around due to random chance, sometimes your sample average will be above the true average, sometimes below. But when you construct the confidence interval, the interval almost always contains the true population average within. Out of the 25 simulated trials here, only two samples were so far from the average that the true population average did not appear somewhere in the confidence interval. Of course, with something like rolling dice we sort of know ahead of time what the true expected value should be, but when dealing with some other random variable in the wild, building a confidence interval like this is critically important.
The CLM can also be used to answer the question of big a sample you need. Recall that as the sample size grows, the standard deviation of the sample average falls (it’s the population standard deviation divided by the square root of the sample size). If you did the same experiment up above again, but this time had each trial include 100 dice rolls instead of only 20, you might see results like this:
This looks a lot like the first time we ran this experiment, until you look at the x-axis and notice that the scale is smaller! Our sample averages our generally closer to the true population average and our confidence intervals are tighter. Using the CLM and these insights about the shape of the the normal distribution we can answer the question of how big the sample needs to be in order to hit some pre-specified level of accuracy.
*Formal statements of the CLM typically refer to the normalized sum of independent random variables rather than their simple average, but the conclusion holds for the sample average, and for purposes of this introduction, it’s more straightforward to think about the sample average.