Building Classifiers

Preface

I will start this note with not much understanding of the models, but by the end of the note, I will revise the different models again, hopefully - with a better understanding of the concept.

3 Ways to Build Classifiers

Generative Models (Linear Discriminant Analysis)

Core Idea: Try to learn the underlying probability distributions that generate the data for each class separately. Similar to figuring out how “each class” creates its data points.

Discriminative Models (Logistic Regression)

This model Directly Learn the Decision Boundary between classes or the conditional probability $P (Y ∣ X)$ without explicitly modeling the individual class distributions. “The Focus is on Finding WHAT separates the classes from each other .”

Find Decision Boundary (Support Vector Machine)

No explicit calculations of Probabilities. Directly find the Optimal Decision Boundary that separates the classes.

⭕ Come back to recognize the advantages and disadvantages of different models.

Gaussian Discriminant Analysis

Caution: Although the name is called Gaussian “Discriminant” Analysis, please note that this model is a generative model. The key factor is “How it Learns”. The overall technique is:

Assume Gaussian Distributions for each class.
Estimate the parameters (mean and covariance) of these distributions. (Maximum Likelihood Estimation)
Use Bayes’ theorem to get $P (Y ∣ X)$ , the probability of the class given features.

First modeling the probability distribution of each class separately. And learns $P (X ∣ Y)$ The probability of observing features given a class (Hallmark of a generative model). Model $P (X ∣ Y) \to P (Y ∣ X)$ .

Fundamental Assumption: Each class has a Normal Gaussian Distribution.

X \sim N (u, σ^{2})

f (x) = \frac{1}{( 2 π σ ) ^{d}} \cdot exp (- \frac{∥ x - μ ∥ ^{2}}{2 σ ^{2}})

$μ, σ and x are scalars and d = dimension$

How did we get here?

How did we get this Multivariate Gaussian Distribution with a scalar $σ^{2}$ instead of a covariance matrix $Σ$ ?

For each class C, SUPPOSE we know $μ_{C}$ and variance $σ_{C}^{2}$ which gives us the PDF $f_{X ∣ Y = C} (x)$ and we know prior probability $π_{C} = P (Y = C)$ .

🚨 In this example, we are assuming that the variance is a scalar, which results in circular iso-contours and not ellipses. This is called isotropic normal distribution because the variance is the same in all directions. Anisotropic Gaussians = isosurfaces are ellipsoids. Also, the Bayes Decision boundary is an ellipse.

From Decision Theory, we know that the optimal classifier $r^{*} (x)$ should pick the particular class C such that it maximizes the expected loss : $f_{(X ∣ Y = C)} (x) . π_{Y = C}$

Since we’re considering 0-1 Loss Function, we can also recall the main principle : Pick the class with the highest Posterior Probability

For the sake of simplifying the Math - Maximizing $Q_{c} (x) = ln ((2 π)^{d} f_{(X ∣ Y = C)} (x) . π_{Y = C})$ is the same as maximizing $f_{(X ∣ Y = C)} (x) . π_{Y = C}$ . The term $(2 π)^{d}$ is just to cancel out the term in the Gaussian Distribution ^f1b0fd.

Therefore,

Q_{c} (x) = - \frac{∥ x - μ _{C} ∥ ^{2}}{2 σ _{C}^{2}} - d ln σ + ln π_{C}

🔺 Notice that $Q_{c} (x)$ is a quadratic function.

In a 2-class problem, we can also add asymmetric loss function by adding $l n L (not C, C)$ . In a multi-class though, it will be more harder to account for the loss function.

Quadratic Discriminant Analysis (QDA)

Suppose we only have two classes C and D. Then, by Bayes Theorem:

r^{*} (x) = {C, D, if Q_{C} (x) - Q_{D} (x) > 0 otherwise

Bayes Deicions Boundary is where x satisfies $Q_{C} (x) - Q_{D} (x) = 0$

$Q_{C} (x) - Q_{D} (x) = 0$ is a quadratic function and therefore, in 1-dimension, the BDB may have 1 or 2 points. In d-dimension, the BDB is a quadric

🔴 Q: What about the probability that our prediction is correct? This is basically the same as $P (Y = C ∣ X)$

P (Y = C ∣ X) = \frac{f _{X ∣ Y = C} . π _{C}}{f _{X}}

By the law of total probability:

P (Y = C ∣ X) = \frac{f _{X ∣ Y = C} . π _{C}}{f _{X ∣ Y = C} . π _{C} + f _{X ∣ Y = D} . π _{D}}

Substitute equation ^25d885,

P (Y = C ∣ X) = \frac{e ^{Q_{C} (x)}}{e ^{Q_{D} (x)} + e ^{Q_{D} (x)}}

P (Y = C ∣ X) = \frac{1}{1 + \frac{e ^{Q_{D} (x)}}{e ^{Q_{C} (x)}}}

P (Y = C ∣ X) = \frac{1}{1 + e ^{Q_{D} (x) - Q_{C} (x)}}

Definition

Logistic Function / Sigmoid Function (Real-valued input $\to$ Probability) is in the form of :
$s (γ) = \frac{1}{1 + e ^{- r}}$

Putting our probability equation into sigmoid function (monotonically increasing) form:

s (Q_{C} (x) - Q_{D} (x)) = \frac{1}{1 + e ^{Q_{D} (x) - Q_{C} (x)}}

🔴 Answer: Now, we can calculate our correctness of the prediction using the sigmoid function.

$\to$ Recall the decision function is $Q_{C} (x) - Q_{D} (x)$ . The output of the sigmoid function can be interpreted as the probability of the data point belonging to the correct class.

Multi-Class Quadratic Discriminant Analysis (QDA) is quite natural. “multiple decision boundaries that adjoin each other at joints.” The way we do this is by calculating $Q (x)$ of each class and choose the maximum.

**Multi-Class QDA** partitioning the **Feature Space** into Regions.

Notice the variance and the boundary

The dots are the means of each class, and the variances are the circular shapes in the graph. The circular shapes are not spread out equally across different classes, which means that they have different variances.

Also notice that the decision boundary are not linear since our $Q (x)$ are quadratic functions. come back for updates! Ed Thread

Linear Discriminant Analysis (LDA)

Now that we know QDA, let’s explore what happens IF all the Gaussians have the same variance $σ^{2}$ .

QDA allows each class to have its own covariance matrix $Σ_{k}$ where k is a class. LDA is a variant of QDA with Linear Decision Boundaries, where all classes have same covariance matrix $Σ_{k} = Σ, \forall k$ .

LDA is also less likely to overfit since we assume the same covariance matrix (variance) for all class, reducing the number of parameters to estimate.

Recall that we defined Q(x) to be a natural log of the Gaussian PDF: ^25d885

Q_{C} (x) = - \frac{∥ x - μ _{C} ∥ ^{2}}{2 σ _{C}^{2}} - d ln σ_{C} + ln π_{C}

But, we had made an important assumption in LDA that all $σ$ are the same. Therefore, the decision boundary $Q_{C} (x) - Q_{D} (x)$ is simplified to:

Q_{C} (x) - Q_{D} (x) = w \cdot x \frac{( μ _{C} - μ _{D} ) \cdot x}{σ ^{2}} - + α \frac{∥ μ _{c} ∥ ^{2} - ∥ μ _{D} ∥ ^{2}}{2 σ ^{2}} + ln π_{C} - ln π_{D}

Now, the decision boundary equation is evidently in the familiar linear form of $w . x + α$ as the quadratic terms in $Q_{C}$ and $Q_{D}$ cancel out each other.

Similar to what we did in QDA, we can find the correctness probability of our prediction (a.k.a) the posterior probability in the case of 0-1 Loss Function.

P (Y = C ∣ X = x) = \frac{1}{1 + e ^{Q_{D} (x) - Q_{C} (x)}}

P (Y = C ∣ X = x) = s (Q_{C} (x) - Q_{D} (x))

To emphasize the linearity, we can rewrite this as:

P (Y = C ∣ X = x) = s (w . x + α)

Two Gaussians (red) and the logistic function (black)

If $Q_{C} (x)$ is the right Gaussian, the logistic (sigmoid) function is the right Gaussian divided by the sum of two Gaussians. 🔺 Another observation is that the logistic function look 1D even though the Gaussians are 2D. In Higher Dimensions, the logistic function is essentially varying in only one direction and unchanging in all other directions.

A Side Note: In Logistic Regression, we are assuming that the posterior probability is in the “sigmoid form” and don’t care about the underlying class conditional PDFs.

Centroid Method → Special Case of GDA: Same Prior Probability and Same Variance for all classes.

Class C and Class D have same variance as well as the same Prior Probability $π_{C} = π_{D} = \frac{1}{2}$

Then, the Bayes decision boundary is :

(μ_{C} - μ_{D}) . x - (μ_{C} - μ_{D}) . (\frac{μ _{c} + μ _{D}}{2}) = 0

This equation is the same as the “centroid method”.

Multi-Class LDA: choose C that maximizes the Linear Discriminant Function

\frac{μ _{C} \cdot x}{σ ^{2}} - \frac{∥ μ _{C} ∥ ^{2}}{2 σ ^{2}} + ln π_{C}

When we have Classes with Same Variance and Same Prior Probabilities, the decision boundary diagram becomes a Voronoi Diagram.

If we only have Same Variance but different Prior Probabilities, the decision boundary diagram becomes a Power Diagram.

A True **Voronoi Diagram** (Same Priors Probabilities)

Voronoi Diagram : A Voronoi diagram divides a space into regions where each region corresponds to a "generator" point. Every location within a region is closer to its corresponding generator than to any other generator. Power Diagram: A power diagram, also known as a weighted Voronoi diagram, generalizes the concept of a Voronoi diagram by assigning a weight to each generator point. The distance metric is modified to account for these weights, leading to regions defined by the "power distance" rather than the standard Euclidean distance.

Maximum Likelihood Estimation

dev/brain

Explorer

Building Classifiers

Preface

3 Ways to Build Classifiers

Gaussian Discriminant Analysis

Fundamental Assumption: Each class has a Normal Gaussian Distribution.

Quadratic Discriminant Analysis (QDA)

Linear Discriminant Analysis (LDA)

Centroid Method → Special Case of GDA: Same Prior Probability and Same Variance for all classes.

Graph View

Table of Contents