Maximum Likelihood Estimation

How do we use Maximum Likelihood Estimation for Generative Models (Gaussian Discriminant Analysis, QDA, LDA)?

In Gaussian Discriminant Analysis, we understood the approach of generative models. Particularly, we want to find the probability distribution of the underlying classes rather than just the decision boundary. However, in real life, we do not know those probability distributions accurately. Therefore, we need some sort of estimation tool to figure out the prior probability and the probability distribution of the feature classes.

Equivalently,

We model the class conditional probability distributions $P (X ∣ Y = k)$ for each class k (often assuming that they’re Gaussian).

Estimate the parameters of these distributions—such as the mean vectors, covariance matrices, and the class priors—from the training data (commonly using methods like Maximum Likelihood Estimation).

MLE is used for estimating the parameters of a statistical distribution. In GDA, we need to estimate the normal distribution (continuous) of each class and prior probability (discrete) for each class.

Coin Flipping Exercise (for prior probability)

Flip a Biased Coin with Head probability p and tail 1-p. Suppose that I flip the coin 10 times, and got 8 heads and 2 tails

Question: what is the value of bias p that is the most likely to lead to this outcome.

Recall that the number of heads is binomial distribution: $X \sim B (n, p)$

P [X = x] = (x n) p^{x} (1 - p)^{n - x}

Thus, the probability of getting 8 heads in 10 flips is

P [X = 8] = (8 10) p^{8} (1 - p)^{2}

P [X = 8] = 45 p^{8} (1 - p)^{2}

Let’s define this expression as the “Likelihood”. We will see why we call likelihood instead of probability when it comes to the continuous distribution case.

Therefore, the LIKELIHOOD FUNCTION is:

L (p) = 45 p^{8} (1 - p)^{2}

Optimization problem. $ar g max_{p} L (p)$

We can solve this by finding the critical point of $L$

\frac{d L}{d p} = 0

360 p^{7} (1 - p)^{2} - 90 p^{8} (1 - p) = 0

p = 0.8

Intuitively, this calculation yields what we expected, which is $0.8 = \frac{x}{n} = \frac{8}{10}$

Estimated Prior Probability

Important

Given that we observe an event “A” x times out of n trials and we want to estimate the Prior Probability of event A happening, then the estimated Prior Probability is :
$\overset{π}{^}_{A} = \frac{x}{n}$

Another Definition: Suppose our training set is n points, with x in class C. Then our estimated prior for class C is $\overset{π}{^}_{C} = \frac{x}{n}$ .

Likelihood of a Gaussian

We have training data (sample points) $X_{1} \dots X_{n}$ . We would like to find the best-fit Gaussian (meaning find the best $μ$ and $σ^{2}$ for given training data.)

Difference between Probability and Likelihood

In Continuous Distribution, the probability of getting a particular point is zero. However, in Likelihood, we will ignore this phenomenon and consider that it’s not zero.

Likelihood of Gaussian is defined as:

L (μ, σ, X_{1}, \dots, X_{n}) = f (X_{1}) f (X_{2}) \dots f (X_{n})

To simplify the computation, we take the log of this and called it the Log Likelihood, $l (.)$

l (μ, σ, X_{1}, \dots X_{n}) = ln f (X_{1}) + ln f (X_{2}) + ln f (X_{n})

Recall the PDF of Multivariate Gaussian Distribution is:

f (x) = \frac{1}{( 2 π σ ) ^{d}} \cdot exp (- \frac{∥ x - μ ∥ ^{2}}{2 σ ^{2}})

Each $f (X_{i})$ is a Normal Distribution, thus,

l (μ, σ, X_{1}, \dots X_{n}) = i \sum n l n of Gaussian (- \frac{∥ X _{i} - μ ∥ ^{2}}{2 σ ^{2}} - d ln 2 π - d ln σ)

Similar to the discrete case, we can take the derivative of log likelihood to find the critical point (maximum).

Estimation of $\overset{μ}{^}$

$\nabla_{u} l = i = 1 \sum n \frac{X _{i} - μ}{σ ^{2}} = 0 ⟹ \overset{μ}{^} = \frac{1}{n} i = 1 \sum n X_{i}$
Note that this expression is the same as the Sample Mean.

Estimation of Variance $σ^{2}$

$\frac{\partial l}{\partial σ} = i = 1 \sum n \frac{∥ X _{i} - μ ∥ ^{2} - d σ ^{2}}{σ ^{3}} = 0 ⟹ \overset{σ}{^}^{2} = \frac{1}{d n} i = 1 \sum n ∥ X_{i} - \overset{μ}{^} ∥^{2}$
$\overset{μ}{^}$ is used since we do not know the exact mean of the distribution. Our best is the estimated u ( $\overset{μ}{^}$ )

Takeaway

Use Sample Mean and Sample Variance* in class C to estimate the mean and variance of class C Gaussian. * Almost Sample Variance except we’re using the estimated $\overset{μ}{^}$ .

Conclusion:

QDA: Estimate the conditional mean, $\overset{μ}{^}_{C}$ and conditional variance $\overset{σ}{^}_{C}^{2}$ of each class Separately and estimate Prior Probabilities $\overset{π}{^}_{c}$ .
LDA: Estimate the conditional mean, $\overset{μ}{^}_{C}$ and Prior Probabilities $\overset{π}{^}_{c}$ and One Variance for all Class.
We define “one variance for all class” as:

\overset{σ}{^}^{2} = \frac{1}{d n} C \sum i : i \in C \sum ∥ X_{i} - μ_{C} ∥

Shewchuk

Notice that although LDA is computing one variance for all the data, each sample point contributes with respect to its own class’s mean. This gives a very different result than if you simply use the global mean! It’s usually smaller than the global variance. We say “within-class” because we use each point’s distance from its class’s mean, but “pooled” because we then pool all the classes together.

Basically, the mean and prior probability calculation remains the same, whereas for variance, the calculation of variance is different in such a way that we use “class mean instead of overall mean” in calculating the “one” variance.

dev/brain

Explorer

Maximum Likelihood Estimation

Coin Flipping Exercise (for prior probability)

Estimated Prior Probability

Likelihood of a Gaussian

Conclusion:

Graph View

Table of Contents

Backlinks