Principal Component Analysis

Principal Component Analysis is an Unsupervised Learning i.e. we have sample points, but we don’t have labels.
Therefore, no classes, no regressions, since there’s no y values. Nothing to predict.
What we want to do is Discover some sort of underlying structure in the data.

Popular Techniques of unsupervised learning

Clustering: Partition data into groups of similar / nearby points.
Dimensionality reduction: Data often lies near a low dimensional subspace (or manifold) in feature space. Think about matrix low-rank approximation (SVD).
Density Estimation: Fit a continuous distribution to discrete data.

https://medium.com/@sebastiannorena/pca-principal-components-analysis-applied-to-images-of-faces-d2fc2c083371

Principal Component Analysis (PCA) Karl Pearson, 1901

Principal Component Analysis lies under Dimensionality Reduction.

Given sample points in $R^{d}$ , find k directions that capture most of the variation.

Example of PCA on hand-written digits (28 x 28) grayscale bitmaps.

One image = 784 dimensional vector → PCA → 2 dimensional vector

As we can see here, the two dimensions may not be enough to capture all the information | the two dimensions may not be enough to maximize the projected data point variance.

Why do we do PCA?

Reducing number of dimensions makes some computations cheaper. (e.g. regression).
Identify and remove irrelevant dimensions to reduce overfitting in learning algorithms. Subset selection; but the features are not axes-aligned. i.e. the new features are not the original features. PCA “combine” several input features into one or more orthogonal “new” features known as principal components.
Find a small basis for representing variations in complex things (faces, genes).

Let Xbe n x d design matrix. No fictitious dimension.
Center the X and also call it X. So, $\sum_{i = 1}^{n} x_{i} = 0$ Find the average across rows, which gives you the mean of each feature.
Let w be a unit vector and $\tilde{x}$ be the projected data point. The Orthogonal projection of point x onto vector w is $\tilde{x} = (x \cdot w) w$
If w is not unit vector, then $\tilde{x} = (\frac{x . w}{∥ w ∥ ^{2}}) w .$
The idea is to pick the best w in the sense that w vector captures the original data the most.
Given orthonormal directions $v_{1}, \dots, v_{k}, \tilde{x} = \sum_{i = 1}^{k} (x \cdot v_{i}) v_{i}$ . $\tilde{x}$ is the linear combinations of vectors $v^{'} s$ .
The $v^{'} s$ are not necessarily in the feature space since they’re principal components. Often we just want the k principal coordinates $x \cdot v_{i}$ in principal component space (just the coefficient; not the entire vector).
We can compute these principal component directions from eigenvalues of $X^{T} X$ which are PSD $d \times d$ matrices.
The eigenvalues of $X^{T} X$ are all $\geq$ 0. Sort them in the order of $0 \leq λ_{1} \leq \dots \leq λ_{d}$ .
Let $v_{1}, \dots, v_{d}$ be the corresponding orthogonal unit eigenvectors. These are the Principal components.

There are Three ways to derive PCA:

1. Fit a Gaussian to data with maximum likelihood estimation.

Choose k Gaussian axes of greatest variance.
Recall that MLE estimates a covariance matrix $\hat{Σ} = \frac{1}{n} X^{T} X$ .

2. Find direction `w` that maximizes sample variance of the projected data.

Recall Rayleigh Quotient
The variance of the projected vectors is $Variance = \frac{1}{n} \sum_{i = 1}^{n} (x_{i}^{T} w - μ)^{2}$
Therefore, Variance is $\frac{1}{n} \sum_{i = 1}^{n} (x_{i} \cdot \frac{w}{∥ w ∥})^{2}$
$= \frac{1}{n} \frac{∥ Xw ∥ ^{2}}{∥ w ∥ ^{2}} = \frac{1}{n} \frac{w ^{T} X ^{T} Xw}{w ^{T} w}$
$X^{T} X$ is a Positive Semidefinite matrix, and therefore, we can apply Rayleigh Quotient.
Thus, to maximize the Variance of the projected data points is the same as:
$maximize_{w} \frac{1}{n} \frac{w ^{T} X ^{T} Xw}{w ^{T} w}$
From Rayleigh Quotient, the optimal solution for this optimization problem(maximize) is $p^{*} = λ_{ma x} (X^{T} X)$
The optimal direction www is the eigenvector corresponding to the largest eigenvalue of $X^{T} X$ .
If $λ_{d}$ is the largest eigenvalue of $X^{T} X$ , then the maximum variance will be $\frac{λ _{d}}{n}$ .
Therefore, eigenvector $v_{d}$ corresponding to the largest eigenvalue $λ_{d}$ is the first principal component.
If we constrain w to be orthogonal to $v_{d}$ , we get the second principal component $v_{d - 1}$ .
Alternatively, using SVD, this corresponds to subtracting $σ_{d} u_{d} v_{d}^{T}$ (rank-1 approximation) from the original matrix $X$ , and applying the same procedure on the residual matrix.

3. Find direction `w` that minimizes mean squared projection distance

Similar to Least Square, they both minimize the mean squared distance.
In Least Square, we measure the distance in y direction.
In PCA, we measure the distance from training point to the subspace (hyperplane), perpendicular distance.
Find w that minimizes $\sum_{i = 1}^{n} ∥ x_{i} - \tilde{x}_{i} ∥^{2}$
$= \sum_{i = 1}^{n} x_{i} - \frac{x _{i} \cdot w}{∥ w ∥ ^{2}} w^{2}$
$= \sum_{i = 1}^{n} (∥ x_{i} ∥^{2} - (x_{i} \cdot \frac{w}{∥ w ∥})^{2})$
$= constant - \sum_{i = 1}^{n} (x_{i} \cdot \frac{w}{∥ w ∥})^{2}$
This is the same as maximizing the latter term, which is the same as maximizing n x the variance of the project points (from part 2).
Minimizing the mean squared projection distance = Maximizing the variance of the projected data points.

PCA Algorithm:

# center matrix X
mean = np.mean(X, axis=0)
X_centered = X - mean
 
# normalize X (Optional: units of measurement different?)
	# yes: normalize
	# no : usually no need to normalize
std = np.std(X_centered, axis=0)
X_centered = X_centered / std
 
# compute unit eigenvectors and eigenvalues of X^TX.
cov_matrix = X_centered.T @ X_centered / (X_centered.shape[0])
eigvals, eigvecs = np.linalg.eigh(cov_matrix)
 
# Sort eigenvalues (and corresponding eigenvectors) in descending order
# note: 189 uses the reversed of the following order (ascending order)
sorted_indices = np.argsort(eigvals)[::-1]
eigvals = eigvals[sorted_indices]
eigvecs = eigvecs[:, sorted_indices]
 
# Choose k. (Optional: use eigenvalues to gauge the optimal k)
k = 3
 
# For the best k-dimensional subspace, pick eigenvectors
top_k_eigvecs = eigvecs[:, :k] # (d, k)
 
# compute the k principle coordinates x.v_i of each training / test point.
X_pca = X_centered @ top_k_eigvecs # (n, k)

Applications

John Novembre: Genes mirror geography within Europe link
EigenFaces (Face Recognition)
- Suppose we have Xcontains n images of faces and d pixels each.
- Face recognition: Given a query face, compare it with all training faces; find the nearest neighbor in $R^{d}$ .
- Runtime for each query: $O (n d)$
- Solution: Run PCA on faces and project onto d' subspaces where $d^{'} < d$ .
- The new Runtime: $O (n d^{'})$
- If you have 500 stored faces with 40,000 pixels each, and you reduce them to 40 principal components, then each query face requires you to read 20,000 stored principal coordinates instead of 20 million pixels.
- Eigenfaces encode both face shape and lighting. Some people say that the first 3 eigenfaces are usually all about lighting, and you sometimes get better facial recognition by dropping the first 3 eigenfaces.

dev/brain

Explorer

Principal Component Analysis

Popular Techniques of unsupervised learning

Principal Component Analysis (PCA) Karl Pearson, 1901

Example of PCA on hand-written digits (28 x 28) grayscale bitmaps.

There are Three ways to derive PCA:

1. Fit a Gaussian to data with maximum likelihood estimation.

2. Find direction `w` that maximizes sample variance of the projected data.

3. Find direction `w` that minimizes mean squared projection distance

PCA Algorithm:

Applications

Graph View

Table of Contents

dev/brain

Explorer

Principal Component Analysis

Popular Techniques of unsupervised learning

Principal Component Analysis (PCA) Karl Pearson, 1901

Example of PCA on hand-written digits (28 x 28) grayscale bitmaps.

There are Three ways to derive PCA:

1. Fit a Gaussian to data with maximum likelihood estimation.

2. Find direction w that maximizes sample variance of the projected data.

3. Find direction w that minimizes mean squared projection distance

PCA Algorithm:

Applications

Graph View

Table of Contents

2. Find direction `w` that maximizes sample variance of the projected data.

3. Find direction `w` that minimizes mean squared projection distance