Vanishing Gradient Problem (ReLUs)

Problem: when unit output s is close to 0 or 1 for most training points, $s^{'} = s (1 - s)$ is going to be very small making the gradient descent very slow and not efficient. Unit is Stuck | Slow Training. (We don’t want the data to be in the flat spot). Also, the middle part (non-flat) part is kinda like a linear region. We want neural networks to have non-linear activation functions.

Solution: Replace Sigmoids with Rectified Linear Units (ReLUs)

Rectified Linear Units (ReLUs)

r (x) r^{'} (x) = ma x (0, x) = {0 i f x < 0 1 i f x \geq 0

def relu(x):
	return np.maximum(0, x)

Sub-gradient at x = 0 is not a very big problem.

In ReLU, Exploding Gradient can be a problem in Deep Neural Network.
Although it’s mostly linear, in practice, it gives enough non-linearity.
Most Neural Network today use ReLU Function as Hidden layers.
ReLUs can get stuck too just like sigmoid functions; but it is rare in practice.

Output Units: chosen to fit the application unlike the hidden layers.

Regression: Linear regression (Last Layers)
Classification: Sigmoid (2 class) | Softmax (multi-classes)

Output Units

Regression

The activation function is the identity function. Usually trained with squared error loss. So, it is equivalent to doing least squares linear regression on learned features (hidden units). Loss function: Squared Error Loss

Binary Classification (Sigmoid)

Given vector h of unit values in the last hidden layer, output layer computes pre-activation value $a = Wh$ , and the applies sigmoid activation function $s (Wh)$ to obtain the prediction $\overset{y}{^} = s (Wh)$ .

Loss function: Logistic Loss | Fixes the vanishing gradients at output (because of the shape of logistic loss).

K-class classification (Softmax)

$y \in R^{k}$ be a vector of labels for training point x.
Choose training labels so that $\sum_{i = 1}^{k} y_{i} = 1$
One-hot encoding Given hidden layer h output layer computes pre-activation value $a = Wh$ and applies softmax activation to obtain prediction $\overset{y}{^}$ , where

\overset{y}{^}_{i} (a) = \frac{e ^{a_{i}}}{\sum _{j = 1}^{k} e ^{a_{j}}}

Loss function: cross-entropy loss | Fixes the vanishing gradient problem at the output.

L (\overset{y}{^}, y) = - i = 1 \sum k y_{i} ln \overset{y}{^}_{i}

Different in a way that the $y_{i}$ are dependent on each other if you compare it to the prediction from sigmoid for example.

output + loss	linear + squared error	sigmoid + logistic loss	softmax + cross entropy loss
$\overset{y}{^}$ , $L (\overset{y}{^}, y)$	$\overset{y}{^} = Wh$ ; $L = ∥ \overset{y}{^} - y ∥^{2}$	$s (Wh)$ ; $L = - \sum_{i} (y_{i} ln \overset{y}{^}_{i} + (1 - y_{i}) ln (1 - \overset{y}{^}_{i}))$	$\overset{y}{^} = so f t ma x (Wh)$ ; $L = - \sum_{i} y_{i} ln \overset{y}{^}_{i}$
$\nabla_{w} L$	$2 (\overset{y}{^} - y) h^{T}$	$(\overset{y}{^} - y) h^{T}$	$(\overset{y}{^} - y) h^{T}$
$\nabla_{h} L$	$2 W^{T} (\overset{y}{^} - y)$	$W^{T} (\overset{y}{^} - y)$	$W^{T} (\overset{y}{^} - y)$ assuming $\sum y_{i} = 1$

Backpropagation

V = [V_{11} V_{21} V_{12} V_{22} V_{13} V_{23}]

V_{1} = [V_{11} V_{21}]

V_{1}^{T} = [V_{11} V_{12} V_{13}]

V_{1}^{T} x = [V_{11} V_{12} V_{13}] x_{1} x_{2} x_{3}

dev/brain

Explorer