Maximum Likelihood Estimation Method – Meaning, Intuition, Bernoulli Distribution use case

by Nicolae Caralicea

In this post I would like to talk about MLE method providing  a simple use case, and some basic code that might help you in understanding the MLE topic.

MLE is a very important topic in Statistics and is also heavily used in many Machine Learning and Data Mining concepts. Sometimes, it seems like it is overlooked,  but nonetheless it is worth our attention.

Maximum Likelihood (MLE) Definition

First, let’s see what Wikipedia says:

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical model given observations, by finding the parameter values that maximize the likelihood of making the observations given the parameters.


A given sample of n observations. We also know that this sample is derived form a Bernoulli distribution. Derived means that the observed values of the sample were observed or as you will see later even randomly generated somehow from a certain distribution (Bernoulli in our case).

Here are the only things we have:

  1. a sample of n observations: x1, x2,…., xn
  2. the above sample is derived from a Bernoulli distribution

Wikipedia’s definition of a Bernoulli distribution is the following:

Bernoulli distribution is the probability distribution of a random variable which takes the value 1 with probability  and the value 0 with probability  — i.e., the probability distribution of any single experiment that asks a yes–no question; the question results in a boolean-valued outcome, a single bit of information whose value is success/yes/true/one with probability p and failure/no/false/zero with probability q. It can be used to represent a coin toss where 1 and 0 would represent “head” and “tail” (or vice versa), respectively. In particular, unfair coins would have .


Our goal is to estimate the probability p of the Bernoulli distribution that is the most likely to generate our sample.

We need to have in mind that the only things that we have at our disposal are our sample and the information that our sample is derived from a Bernoulli distribution.

Solution & Intuition

Using the MLE method for estimating the value of p in the case of a Bernoulli distribution would give us the estimated p equal to the mean of the provided sample.


So, p hat is the estimated value that maximizes the likelihood of our sample.

As you can see further in the provided code, that estimated p hat is really close to the actual value p of  the probability of the random variable our sample was derived from.

To understand how this was possible please look at the following link: Maximum Likelihood (first section for a Bernoulli distribution)

The above link should give you all the gory details.

Hopefully, the following code will give you some intuition on the practical aspects of the MLE with regard to a Bernoulli distribution.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import bernoulli

# Setting
p = 0.7
sample_size = 10000

# Generate a random sample from a Bernoulli distribution
sample = bernoulli.rvs(p, size=sample_size)

# Goal: Calculate the estimated probability by using the result derived from MLE regarding the estimation of p_estimated
p_estimated = np.mean(sample)

# Results
print('Probability of the random variable: ', p)
print('Estimated probability: ', p_estimated)
print('The two values should be close enough {0} ~ {1}'.format(p_estimated, p))

Here is the output:

Probability of the random variable: 0.7
Estimated probability: 0.697
The two values should be close enough 0.697 ~ 0.7

I hope this post helped a little in understanding the role of MLE.

Remember that for any sample derived from a known distribution we can think of to a similar approach to estimate MLE parameters.

If you want to experiment with this you can find the python notebook on Github at this location: MLE-Bernoulli-intuition.ipynb

Leave a Reply

Your email address will not be published. Required fields are marked *