Introduction
-
Activation Function : is a function to induce non-linearity into the output of the neuron for the given input.
-
Non-Linearity : Functions that do not satisfy the following linearity.
- $ f(x+y) = f(x)+f(y), \quad f(\alpha x) = \alpha f(x)$
-
Why Non-Linearity ? : because composites of linear functions are linear again
-
if output $f(x)$ of neuron is linear form like $wx + b$,
-
Even if the depth of the layer increases, it is just modeling another linear function $f(f(f(..(x)) = w’x+b'$
-
-
Role of bias : It’s similar to the constant b of a linear function y=ax+b. It allow you to move the line up and down to fit the prediction with the data better.
- Without $b$, the line always goes through the origin (0,0) and you may get a poorer fit.
-
1. Sigmoid / Logistic
-
Advantages :
-
Smooth gradient : preventing “jumps” in output values
-
Clear predictions : for $x$ above 2 or below -2, output tends to be at the edge of the curve, very close to 0 or 1.
-
-
Disadvantage:
-
Vanishing gradient : for very high or low values of X, there’s no change to the prediction, causing a vanishing gradient problem. This can result in the network being too slow to reach an accurate prediction.
-
Outputs not zero centered
-
Computationally expensive
$$ \sigma (x) = (1+e^{-x})^{-1} \ \sigma’(x) = \sigma(x) (1-\sigma(x)) $$
-
2. Tan H
-
Properties :
-
Zero Centered : make it easier to model inputs that have strongly negative,neutral, and strongly positive values
-
Same with Sigmoid funtion
$$ tanh(x) = \frac{sinh(x)}{cosh(x)}=\frac{e^x-e^{-x}}{e^x+e^{-x}} \ tanh’(x) = 1 - tanh^2(x) $$
-
3. ReLU (Rectified Linear Unit)
-
Advantages :
-
Computationally efficient : allows the network to converge very quickly
-
Non-Linearity: although it looks like a linear function, ReLU has a derivate fuction and allows for backpropagation.
-
-
Disadvantage:
- Dying ReLU : when inputs approach zero, or are negative, the gradient of the function becomes zero, the network cannot perform backpropagation and cannot learn.
$$ ReLU = \begin{cases} :x :: (x>0) \\ 0 :: (x<0) \end{cases} $$
4. Leaky ReLU
-
Properties :
- alleviate Dying ReLU : To resolve dying ReLU problem, leaky ReLU has non zero outputs with negative input, thus the gradient does not become zero.
$$ \text{Leaky ReLU} = \begin{cases} :x :: (x>0) \\ 0.01x :: (x<0) \end{cases} $$
5. ELU (Exponential Linear Units)
-
Properties :
-
alleviate Dying ReLU : To resolve dying ReLU problem, ELU has non zero outputs with negative input, thus the gradient does not become zero.
-
more computation : ELU has exponential function that ReLU doesn’t have, thus it costs bit more than ReLU.
$$ \text{Leaky ReLU} = \begin{cases} :x :: (x>0) \\ \alpha(e^x -1) :: (x<0) \end{cases} $$
-
6. Swish (Scaled Exponential Linear Units)
-
Properties :
-
Unboundeness : Unlike sigmoid and Tanh functions, Swish is unbounded above which makes it useful near the gradients with values near to zero. This feature avoids saturation as training becomes slow near zero gradient value.
-
Smoothness of the curve : plays an important role in generalization and optimization. Unlike ReLU, Swish is a smooth function which makes it less sensitive to initializeing weights and learning rate.
-
Bounded Below : Like most of the activation functions out there, Swish is also bounded below which helps in strong regularization effects. Like ReLU and softplus, Swish produces negative outputs for small negative inputs due to its non-monotonicity. The non-monotonicity of Swish increases expressivity and importves gradient flow, Which is important considering that many preactivation fall into this range.
$$ Swish = x * sigmoid(x) = x * (1+e^{-x})^{-1} $$
-
6. SELU (Scaled Exponential Linear Units)
-
Properties :
-
Similar to ReLUs, SELUs enable deep neural networks since there is no problem with vanishing gradients.
-
In contrast to ReLUs, SELUs cannot die
-
SELUs on their own learn faster and better than other activation functions, even if they are combined with batch normalization
$$ selu = \lambda \begin{cases}x :: (x>0) \ \alpha(e^x-1) :: (x<0) \end{cases} $$
-
6. Soft-argmax
-
To get the position where the intensity is maximal in a vector, we usually use an argmax func. But the problem is, this func has no derivative.
$$ Softmax(x) = \frac{e^{x_i}}{\sum_j e^{x_j}} $$
-
Using Sofmax, we get normalized probabilities for each $x_i$, and the expectation of this is the sum of the indices multiplied by their respective probabilities
$$ \mathbb{E[x]} = \sum_i \frac{e^{x_i}}{\sum_j e^{x_j}} i $$
-
However, this mean value is weak if there’s multiple modes. To raise the max and lower the others, we can multiply $x$ by an arbitrarily big $\beta$.
$$ \mathbb{E[x]} = \sum_i \frac{e^{\beta x_i}}{\sum_j e^{\beta x_j}} i $$
def soft_arg_max_khw(A, dim=1):
A_softmax = torch.softmax(A, dim=dim).cuda()
indices = torch.arange(start=0, end=A.size()[dim]).float().cuda()
return torch.matmul(A_softmax, indices).cuda()