Ridge Regression (Tikhonov Regularization)

Optimization Problem

Find w that minimizes:
$∥ Xw - y ∥^{2} + λ ∥ w^{'} ∥^{2} = J (w)$
where $w^{'}$ is the weight without the bias term or bias term 0.

$l_{2}$ penalization promotes shrinkage of the weights. i.e. it will encourage our solutions to have smaller weights.
Also makes the least squares problems to always have a unique minimum solution because of the $λ$ term — making the matrix to be Positive Definite instead of just Positive Semidefinite.

Left figure: ill-pose, many minima. Right figure: well-pose, one unique minima.

$l_{2}$ regularization reduces overfitting by reducing the variance while increasing the bias. In general, we don’t like big weights if the data and label are relatively smaller.

The Ridge Regression solutions lie at the tangent between the red isocontour and the blue isocontour. Notice how higher $λ$ would pull the solution towards the $∥ w ∥_{2}^{2}$ minimum.

If we were to solve the optimization problem:

a r g mi n_{w} ∥ Xw - y ∥_{2}^{2} + λ ∥ w^{'} ∥_{2}^{2}

The optimal solution would be:

(X^{T} X + λ I^{'}) w^{*} = X^{T} y

Note that $X^{T} X + λ I^{'}$ is always Positive Definite and thus invertible - leading to a unique solution. $I^{'}$ is an identity matrix with the bottom right element being 0 (due to the fact that we don’t penalize the bias term).

Connection of Ridge Regression and Variance

Assume that the true data model is defined as $y = X v + e$ , where e is noise from a Normal Distribution. Then, the variance of the Ridge Regression at a test (arbitrary) point is:

Va r (z^{T} (X^{T} X + λ I^{'})^{- 1} X^{T} e)

As $λ$ goes to $\infty$ , variance approaches 0, but the bias increases. In practice, we need to tune the $λ$ by cross-validation.

Important Note: Features should be normalized so that the weights get penalized in the same amount, and the features will have same variance.

An Alternative way to achieve is this to use a different diagonal matrix instead of $λ I^{'}$ , which will weight dissimilar amount to different features.

Bias Squared vs Variance as lambda increases.

Warning

$Variance \propto \frac{1}{# of sample points} \propto \frac{1}{λ}$

dev/brain

Explorer

Ridge Regression (Tikhonov Regularization)

Connection of Ridge Regression and Variance

Graph View