Regression And Linear Regression

Linear Regression and Logistic Regression are two different decision making algorithms.

Linear Regression:

produces a continuous output that can take any real value.
Linear Regression Equation:
$y = β_{0} + β_{1} x_{1} + \dots + β_{n} x_{n}, y \in R$
Linear regression uses Mean Squared Error (MSE):

Logistic Regression:

produces a probability (between 0 and 1) which is then used for classification.
Logistic Regression Equation - uses sigmoid function. $P (y = 1) = \frac{1}{1 + e ^{- \dots}}$ $0 \leq P (y = 1) \leq 1$
Logistic Regression uses “Log Loss”.

In GDA, we were regressing over the posterior probability.

Choose Form of Regression Function h(x; w) with parameters w. (h = hypothesis)
Choose a Cost function (objective function) to optimize
- Usually based on a Loss Function.
  - Empirical Risk (based on the data we have)= Expected Loss (Risk) on data.

Some Regression Functions:

Linear: $h (x; w, α) = w . x + α$
Polynomial
Logistic: $h (x; w, α) = s (w . x + α)$ ; $s (r) = \frac{1}{1 + e ^{-} r}$

Some Loss Functions:

Let $\overset{y}{^}$ be the prediction given by $h (x)$ ; $y$ be the true label.

$L (\overset{y}{^}, y) = (\overset{y}{^} - y)^{2}$ - squared error - makes it easier to compute.
$L (\overset{y}{^}, y) = ∣ \overset{y}{^} - y ∣$ - absolute error - not sensitive to outliers. Harder to optimize.
$L (\overset{y}{^}, y) = - y ln \overset{y}{^} - (1 - y) ln (1 - \overset{y}{^})$ - Logistic loss, a.k.a cross-entropy. IMPORTANjT* - $y \in [0, 1], \overset{y}{^} \in (0, 1)$

Observations

Squared Error is smooth quadratic and convex - meaning it has a closed form solution. Just set gradient = 0 for minimum.

Logistic Loss is also smooth but non-quadratic and non-linear - meaning the function is still convex (a single global minimum) but need numerical method to find the optimum.

Some Cost functions to Minimize:

$J (h) = \frac{1}{n} \sum_{i = 1}^{n} L (h (x_{i}), y_{i})$ — mean loss (empirical risk). Note that $\frac{1}{n}$ does not matter in optimization.
$J (h) = ma x_{i} (L (h (x_{i}), y_{i})$ — if you trust your error. Data is robust.
$J (h) = \sum_{i = 1}^{n} ω_{i} L (h (x_{i}), y_{i})$ — weighted sum.
$J (h) =$ cost_1 or cost_2 or cost_3 + $λ ∥ w ∥_{2}^{2}$ — $l_{2}$ penalized / regularization.
$J (h) =$ cost_1 or cost_2 or cost_3 + $λ ∥ w ∥_{1}$ — $l_{1}$ penalized / regularization.

Important Notation

Loss function = a “measure” for each data point.

Cost function = a “measure” for all data point.

Some famous regression methods:

Least Square Linear Regression → Linear Regression Function + Square Error Loss function + Mean Loss.
Weighted Least Square Regression
Ridge Regression:
LASSO:
Logistic Regression:
Least Absolute Deviations:
Chebyshev Criterion:

1 to 4 → Quadratic cost: minimize with calculus.
4 → Quadratic Program
5 → Convex Cost; minimize with gradient descent.
6, 7 → Linear Program.

[Least Square] Linear Regression (Gauss, 1801)

Linear Regression Function + Squared Loss Function + Cost Function = Mean Loss.

Goal of Least Square Linear Regresssion

Find $w, α$ that minimizes $\sum_{i = 1}^{n} (X_{i} . w + α - y_{i})^{2}$ . $α$ is a bias term.

X1, X2 are features. The vertical axis is h(x) and y's. h(x) = predicted y (label).

The cost function that we use in linear regression is the sum of squares of errors $(h (x_{i}) - y_{i})^{2}$ .

Design Matrix Convention

Design matrix is a nxd matrix of sample points and y is a n vector of scalar labels.
$- x_{1}^{T} - - x_{2}^{T} - ⋮ - x_{n}^{T} -$
where $x_{i} \in R^{d}$ . The columns are features and the rows are sample points. Typically, n > d if we have enough sample points.

Fictitious Dimension

Rewrite $h (x) = x . w + α$ as
$x_{11} x_{21} ⋮ x_{12} x_{22} \dots 1 \dots 1 w_{1} w_{2} ⋮ α$
Thus, we will use $X \in R^{n \times (d + 1)}$ and $w \in R^{d + 1}$ . The linear regression problem with bias term can now be rewritten as:
$a r g mi n_{w} ∥ Xw - y ∥^{2} = RSS(w), Residual Sum of Squares$
Interpretation: find w that minimizes the squared error

This is a basic Least Squares problem.

∥ Xw - y ∥^{2} \nabla_{w} ∥ Xw - y ∥^{2} 0 X^{T} X w^{*} w^{*} = (w^{T} X^{T} - y^{T}) (Xw - y) = w^{T} X^{T} Xw - w^{T} X^{T} y - y^{T} Xw - y^{T} y = w^{T} X^{T} Xw - 2 y^{T} Xw - y^{T} y = 2 X^{T} Xw - 2 X^{T} y = X^{T} X w^{*} - X^{T} y = X^{T} y = (X^{T} X)^{- 1} X^{T} y

Note: In this case, there’s always a solution.

If X has full column rank ⇒ $X^{T} X$ is PD ⇒ (the features are not dependent) ⇒ unique solution.
Under-constrained ⇒ $X^{T} X$ is PSD ⇒ multiple solutions.
Over-constrained ⇒ projection (closest in $l_{2}$ norm solution). (Not in our case because $X X^{T}$ is PSD.)

We use a linear solver to find $w = (X^{T} X)^{- 1} X^{T} y$ .

If $(X^{T} X)^{- 1}$ exists → X is full-column rank → pseudoinverse of X = $X^{^{†}}$ = $(X^{T} X)^{- 1} X^{T}$ .

Discussion 6 CS 189 for more.

Now, let’s go back to the original problem of predicting the y values. Suppose we have already calculated the weights w, then the projected y values onto the hyperplane with minimum squared error will be:

\overset{y}{^} \overset{y}{^} \overset{y}{^} = Xw, where w = (X^{T} X)^{- 1} X^{T} y = X (X^{T} X)^{- 1} X^{T} y = X X^{†} y

$X X^{^{†}}$ is also known as H (the hat matrix) since it puts the hat on the y. If you look carefully, you can also see that this is also just a projection of data onto the column space of X.

If $H = I$ , there is no training error since it means all the training points lie on a hyperplane.

Advantages of Linear Regression

Easy to compute; linear system.
Unique, Stable Solution unless the system is underconstrained.

Disadvantages of Linear Regression

Very sensitive to outliers. (because of square errors).
Fails if $X^{T} X$ is singular but easily fixable.

Least Squares Polynomial Regression

Kernel Trick

The idea here is to life the dataset into higher dimensions so that we could do some regression algorithm on the lifted dataset with linear decision boundary. Lifting data into higher dimensions makes it easier to separate (or fit) with a linear model because, in the original space, the relationship is non-linear.

Replace each $x_{i}$ with $ϕ (x_{i})$ with all terms of degree $0 \dots p$

Example: $ϕ (x_{i}) = [x_{i_{1}}^{2} + x_{i_{1}} x_{i_{2}} + x_{i_{2}}^{2} + x_{i_{1}} + x_{i_{2}} + 1$ ]

But, we need to be cautious since it is very easy to overfit. (too many parameters). A large amount of data can tame the high degree polynomials oscillation. Extrapolation is harder than interpolation.

Weighted Least Squares Regression

Linear Regression Function + Squared Loss Function + Weighted Cost Function.

Assign each sample points a weight $w_{i}$ (this comes from domain knowledge). Greater $w_{i}$ means focus more on the same i to minimize $(\overset{y}{^}_{i} - y_{i})^{2}$ .

Weighted Least Squares can be formulated as:

w^{*} = a r g mi n_{w} (Xw - y)^{T} Ω (Xw - y) = a r g mi n_{w} i = 1 \sum n ω_{i} (X_{i} . w - y_{i})^{2} = (X^{T} Ω X)^{- 1} X^{T} Ω y

Normal Equations / Solve by finding the gradient (the same).

Logistic Regression (1958)

dev/brain

Explorer

Regression And Linear Regression

Some Regression Functions:

Some Loss Functions:

Some Cost functions to Minimize:

Some famous regression methods:

[Least Square] Linear Regression (Gauss, 1801)

Least Squares Polynomial Regression

Weighted Least Squares Regression

Graph View

Table of Contents