Bayesian Inference — Intuition and Example

with Python Code

Why did someone have to invent the Bayesian Inference?

In one sentence: to update the probability as we gather more data.

The core of Bayesian Inference is to combine two different distributions (likelihood and prior) into one “smarter” distribution (posterior). Posterior is “smarter” in the sense that the classic maximum likelihood estimation (MLE) doesn’t take into account a prior. Once we calculate the posterior, we use it to find the “best” parameters and the “best” is in terms of maximizing the posterior probability, given the data. This process is called Maximum A Posteriori (MAP). The optimization used in MAP is the same as the one used in typical machine learning, such as gradient descent or Newton’s method, etc.

When I look at the famous Bayes rule, understanding the equation analytically is pretty easy. But how would you implement it with the data?

One of the most famous equations in the world of Statistics

Specifically, we will have a large number of data points X. How do we multiply the probability wrt X with the probability wrt θ? There will be so many possible combinations of θ and X.

Apparently, the art of Bayesian Inference lies in how you implement it.

Example:

I have about 2,000 readers per day visiting my Medium blog. Some people clap after reading articles and some don’t. I’d like to make predictions about what percentage of people will engage and clap when I post a new piece in the future. The clapping data that I gathered is binary. 1 indicates a clap and 0 means no clap.

This kind of problem is widely applicable. Try to apply this to your own modeling work — the Click-Through Rate of an advertisement, the conversion rate of customers actually purchasing on your website, the percentage of people who would agree to go on a date with you, the survival chance for women with breast cancer, etc.

Generating Data:

Let’s generate the data X.
(In real life, you don’t have any control over X. This is what you are going to observe.)

import numpy as np
np.set_printoptions(threshold=100)
# Generating 2,000 readers' reponse. 
# Assuming the claps follow a Bernoulli process - a sequence of binary (success/failure) random variables.
# 1 means clap. 0 means no clap.
# We pick the success rate of 30%.
clap_prob = 0.3
# IID (independent and identically distributed) assumption
clap_data = np.random.binomial(n=1, p=clap_prob, size=2000)
In [1]: clap_data
Out[1]: array([0, 0, 0, ..., 0, 1, 0])
In [2]: len(clap_data)
Out[2]: 2000

Bayesian Inference has three steps.

Step 1. [Prior] Choose a PDF to model your parameter θ, aka the prior distribution P(θ). This is your best guess about parameters before seeing the data X.

Step 2. [Likelihood] Choose a PDF for P(X|θ). Basically you are modeling how the data X will look like given the parameter θ.

Step 3. [Posterior] Calculate the posterior distribution P(θ|X) and pick the θ that has the highest P(θ|X).

And the posterior becomes the new prior. Repeat step 3 as you get more data.

Step 1. Prior P(θ)

The first step is to choose the PDF to model the parameter θ.

What does the parameter θ represent?

The clapping 👏 probability.

Then, what kind of probability distributions should we use to model a probability?

To represent a probability, there are a few conditions to meet. First, the domain should be ranged from 0 to 1. Second, it should be a continuous distribution.

Then there are two well-known probability distributions that I can think of:

Beta and Dirichlet.

Dirichlet is for multivariate and Beta is for univariate. We have only one thing to predict, which is a probability, so let’s use the Beta distribution (link).

(An interesting side note: It’s easy to create an arbitrary distribution over (0,1). Just take any function that doesn’t blow up anywhere between 0 and 1 and stays positive. Then, simply integrate it from 0 to 1 and divide the function with that result.)

To use a Beta distribution, there are two parameters, α & β, that we need to decide. You can think of α as How many people clap (the number of successes) and β as how many people don’t clap (the number of failures). These parameters — how big or small α & β are — will determine the shape of the distribution.

Let’s say you have data from yesterday and observed 400 people clapped out of total 2,000 visitors.

How would you write this in terms of the beta distribution?

import scipy.stats as stats
import matplotlib.pyplot as plt
a = 400
b = 2000 - a
# domain θ
theta_range = np.linspace(0, 1, 1000)
# prior P(θ)
prior = stats.beta.pdf(x = theta_range, a=a, b=b)

Let’s plot the prior distribution with respect to all θ values.

# Plotting the prior distribution
plt.rcParams['figure.figsize'] = [20, 7]
fig, ax = plt.subplots()
plt.plot(theta_range, prior, linewidth=3, color='palegreen')
# Add a title
plt.title('[Prior] PDF of "Probability of Claps"', fontsize=20)
# Add X and y Label
plt.xlabel('θ', fontsize=16)
plt.ylabel('Density', fontsize=16)
# Add a grid
plt.grid(alpha=.4, linestyle='--')
# Show the plot
plt.show()
Not familiar with the term “Density” on Y-axis? → Read “PDF is NOT a probability”

It spikes at 20% (400 claps / 2000 readers) as expected. Two thousand data points seem to produce a strong prior. If we use fewer datapoints, say, 100 readers, the curve will be much less spiky. Try it with α = 20 & β = 80.

For those who are wondering how probability density can be greater than 1. 👉Probability Density is NOT a probability.

Step 2. Likelihood P(X|θ)

Choose a probability model for P(X|θ), the probability of seeing the data X given a particular parameter θ. Likelihood is also called a sampling distribution. To me, the term “sampling distribution” is much more intuitive than “likelihood”.

To choose which probability distribution to use to model the sampling distribution, we need to first ask:

What does our data X look like?

X is a binary array [0,1,0,1,...,0,0,0,1].

We also have the total number of visitors (n) and the number of claps, therefore the probability of clap (p) from X.

Ok, n & p What do they scream to you?

Binomial distribution with n & p to predict the # of claps.

# The sampling dist P(X|θ) with a given clap_prob(θ)
likelihood = stats.binom.pmf(k = np.sum(clap_data), n = len(clap_data), p = clap_prob)

Let’s see the graph of P(X|θ) for all possible X’s and a specific θ(=0.3).

# Domain (# of claps)
X = np.arange(0, len(clap_data)+1)
# Likelihood P(X|θ) for all X's
likelihood = stats.binom.pmf(k = X, n = len(clap_data), p = clap_prob)
# Create the plot
fig, ax = plt.subplots()
plt.plot(X, likelihood, linewidth=3, color='yellowgreen')
# Add a title
plt.title('[Likelihood] Probability of Claps' , fontsize=20)
# Add X and y Label
plt.xlabel(’X’, fontsize=16)
plt.ylabel(’Probability’, fontsize=16)
# Add a grid
plt.grid(alpha=.4, linestyle='--')
# Show the plot
plt.show()
Likelihood P(X|θ) given a single θ = 30%

This is the probability of seeing X successes in n trials, assuming a specific parameter θ = 30%.

Step 3. Posterior. P(θ|X)

Finally, let’s answer the question we asked in the beginning:

Specifically, we will have a large number of data points X. How do we multiply the probability wrt X with the probability wrt θ? There will be so many possible combinations of θ and X.
The whole point of Bayesian Inference is to find θ that fits the data the best.

Your initial guess about parameters was P(θ). Now you are upgrading a simple P(θ) into something more informative — P(θ|X) — as more data become available.
P(θ|X) is still the probability of θ, just like P(θ) is. However, P(θ|X) is a smarter version of P(θ).

How do we multiply multiple probabilities wrt X with multiple probabilities wrt θ?

Even though there are thousands of data points, we can convert them into a single scalar — the probability P(X|θ)  by plugging data into the model that you chose (in this example, the binomial distribution.)

Then, calculate P(θ) & P(X|θ) for a specific θ and multiply them together. If you do this for every possible θ, you can pick the highest P(θ) * P(X|θ) among different θ’s.

Basically, we can multiply P(θ) with P(X|θ) by conditioning on θ.

The code is worth a thousand words:

# (cont.)
theta_range_e = theta_range + 0.001 
prior = stats.beta.cdf(x = theta_range_e, a=a, b=b) - stats.beta.cdf(x = theta_range, a=a, b=b) 
# prior = stats.beta.pdf(x = theta_range, a=a, b=b)
likelihood = stats.binom.pmf(k = np.sum(clap_data), n = len(clap_data), p = theta_range) 
posterior = likelihood * prior 
normalized_posterior = posterior / np.sum(posterior)
fig, axes = plt.subplots(3, 1, sharex=True, figsize=(20,7))
plt.xlabel('θ', fontsize=24)
axes[0].plot(theta_range, prior, label="Prior", linewidth=3, color='palegreen')
axes[0].set_title("Prior", fontsize=16)
axes[1].plot(theta_range, likelihood, label="Likelihood", linewidth=3, color='yellowgreen')
axes[1].set_title("Sampling (Likelihood)", fontsize=16)
axes[2].plot(theta_range, posterior, label='Posterior', linewidth=3, color='olivedrab')
axes[2].set_title("Posterior", fontsize=16)
plt.show()

When you look at the posterior graph (the 3rd one), notice it is where the likelihood shifted toward the prior. The clapping probability for the prior was 20%. The clapping probability for the data was given as 30%. Now, the posterior has its peak around 0.25%.

Also, notice the width of the bell curves in prior/likelihood has shrunk in the posterior. Because we incorporated more information through sampling, the range of possible parameters is now narrower.

The more data you gather, the graph of the posterior will look more like that of the likelihood and less like that of the prior. In other words, as you get more data, the original prior distribution matters less.

Finally, we pick θ that gives the highest posterior computed by numerical optimization, such as the Gradient Descent or newton method. This whole iterative procedure is called Maximum A Posteriori estimation (MAP).

Footnote: We calculated the prior by subtracting two stats.beta.cdf instead of using stats.beta.pdf because the likelihood stats.binom.pmf is a probability while stats.beta.pdf returns a density. Even if we use the density to calculate the posterior, it won’t change the optimization result. However, if you want the units to match, converting a density into a probability is necessary. (Still not clear? Read “PDF is NOT a probability, while PMF IS a probability”.)

A few things to note:

  1. MAP can be calculated not only by the numerical optimization but also by the closed-form (for certain distributions only), the modified expectation maximization algorithm, or Monte Carlo method.
  2. Computing a posterior using a closed-form is super convenient because you don’t have to do expensive computation. (Look at the integration for every possible θ in the denominator of Bayes formula. Very expensive.)
    In our example above, the beta distribution is a conjugate prior of the binomial likelihood. This means, during the modeling phase, we already know that the posterior will also be a beta distribution. Then after carrying out more experiments, we can compute the posterior simply by adding the number of successes and failures to the existing parameters α and β respectively.
    When can we use this easy-breezy closed-form for the posterior?
    When your prior has a closed-form (of course) AND is a conjugate prior — when the prior is multiplied by the likelihood and divided by the normalizing constant, it is still the same distribution.
  3. Is the normalization (the integration for every possible θ in the denominator) necessary for finding the maximum posterior?

Nope. You can still find the maximum without normalizing. However, if you want to compare posteriors from different models, or calculate the point estimates, you need to normalize it.

4. MAP estimation can be seen as a regularized ML by incorporating more information through the prior.

5. Are you familiar with maximum likelihood estimation (MLE), but not quite with maximum a posteriori (MAP)?
MLE is just a special case of MAP with a uniform prior.

6. I wrote Prior is your best guess about parameters *before* seeing the data”, however, in practice, once we calculate the posterior, the posterior becomes the new prior until the new batch of data comes in. This way, we can iteratively update our prior and posterior. Markov Chain Monte Carlo sampling methods are based on this idea.