Abstract
We will go through “centering”, “decorrelating”, and “sphering”, which are the preprocessing steps for machine learning or statistical analysis.
Understanding the Design Matrix X
- Suppose X is a design matrix. It contains n data points (samples) with d dimensions (features).
- Each row represents a single sample (point in d-dimensional space) .
Centering the Matrix X
- Centering is basically the same as subtracting the mean from all data points. In other words, we subtract the mean of all rows from each row.
- The new matrix has approximately zero mean.
Why do we do this?
- Many statistical methods assume that the data is centered around zero. In PCA, for example, it’s important to center the data before computing the eigenvalues.
- It also simplifies the estimation of covariance matrix.
- [ - ] It prevents Bias in machine learning models that might otherwise be influenced by the scale of the data.
Sample Covariance Matrix
- Measures how different features in the data vary together.
- With a centered Matrix , the sample covariance matrix is defined as:
The general covariance formula is that:
Decorrelating
- When we decorrelate a dataset, we are transforming it into a new coordinate system where the features become uncorrelated. This is useful because many machine learning algorithms work better when features are independent.
- We want to decorrelate to make the covariance matrix diagonal. To remove correlation, we apply an eigenvector transformation:
where, V is the eigenvectors of Var(R), that is . The diagonal values of represent the variance along the eigenvector axes. We do right multiply instead of left because the is a design matrix with rows being the data, and rowvar = False in numpy covariance.
Thus, Variance of Z will be:
Sphering / Whitening
- Make all features have unit variance.
- We get this by applying:
Why do we do this?
- In algorithms like SVM and neural networks, some features might have much larger values than others.
- These algorithms could give larger importance to larger valued features.
- Thus, by sphering / whitening, we normalize all the features, ensuring that all the features contribute equally.