Decorrelating the Design Matrix

Abstract

We will go through “centering”, “decorrelating”, and “sphering”, which are the preprocessing steps for machine learning or statistical analysis.

Understanding the Design Matrix X

Suppose X is a $n \times d$ design matrix. It contains n data points (samples) with d dimensions (features).
Each row represents a single sample (point in d-dimensional space) $x_{i}^{T}$ .

X = x_{1}^{T} x_{2}^{T} ⋮ x_{n}^{T}

Centering the Matrix X

Centering is basically the same as subtracting the mean from all data points. In other words, we subtract the mean of all rows from each row.
The new matrix $\dot{X}$ has approximately zero mean. $\dot{X} = X - μ^{T}$

Why do we do this?

Many statistical methods assume that the data is centered around zero. In PCA, for example, it’s important to center the data before computing the eigenvalues.

It also simplifies the estimation of covariance matrix.

[ - ] It prevents Bias in machine learning models that might otherwise be influenced by the scale of the data.

Sample Covariance Matrix

Measures how different features in the data vary together.
With a centered Matrix $\dot{X}$ , the sample covariance matrix is defined as:

Va r (R) = \frac{1}{n} \dot{X}^{T} \dot{X}

The general covariance formula is that:

Va r (R) = \frac{1}{n} i = 1 \sum n (x_{i} - \overset{x}{ˉ}_{i}) (x_{i} - \overset{x}{ˉ}_{i})^{T}

= \frac{1}{n} i = 1 \sum n x_{i} x_{i}^{T}

= \frac{1}{n} X^{T} X

Decorrelating $\dot{X}$

When we decorrelate a dataset, we are transforming it into a new coordinate system where the features become uncorrelated. This is useful because many machine learning algorithms work better when features are independent.
We want to decorrelate to make the covariance matrix diagonal. To remove correlation, we apply an eigenvector transformation:

Z = \dot{X} V

where, V is the eigenvectors of Var(R), that is $Va r (R) = V Λ V^{T}$ . The diagonal values of $Λ$ represent the variance along the eigenvector axes. We do right multiply instead of left because the $\dot{X}$ is a design matrix with rows being the data, and rowvar = False in numpy covariance.

Thus, Variance of Z will be:

Va r (Z) = \frac{1}{n} Z^{T} Z

= \frac{1}{n} (V^{T} \dot{X}^{T} \dot{X} V)

= V^{T} Va r (R) V

= V^{T} V Λ V^{T} V

Va r (Z) = Λ

Sphering / Whitening

Make all features have unit variance.
We get this by applying: $W = \dot{X} Va r (R)^{- 1/2}$

Va r (W) = I

Why do we do this?

In algorithms like SVM and neural networks, some features might have much larger values than others.

These algorithms could give larger importance to larger valued features.

Thus, by sphering / whitening, we normalize all the features, ensuring that all the features contribute equally.

dev/brain

Explorer

Decorrelating the Design Matrix

Understanding the Design Matrix X

Centering the Matrix X

Sample Covariance Matrix

Decorrelating $\dot{X}$

Sphering / Whitening

Graph View

Backlinks

dev/brain

Explorer

Decorrelating the Design Matrix

Understanding the Design Matrix X

Centering the Matrix X

Sample Covariance Matrix

Decorrelating X˙

Sphering / Whitening

Graph View

Backlinks

Decorrelating $\dot{X}$