0. Introduction
- Optimizer : updates the weight parameters to minimize the loss function.
1. Gradient Descent
- Introduction : calculate the gradient $\frac{\partial c}{\partial w} $ of cost function for weight w, and update the weights using calcuated gradient
$$ \theta_{new} = \theta - \eta \bigtriangledown J (\theta;x,y)$$
-
Batch Gradient Descent : use the entire dataset to compute gradient of the cost function for each iteration
-
Stochastic Gradient Descent : use a single datapoint or example to calculate the gradient and update the weights with every iteration
-
Mini Batch Gradient Descent : instead of single training example, mini-batch of sample is used.
2. Momentum
-
Introduction : For updating the weights it takes the gradient of the current step as well as the gradient of the previous time steps.
-
$\eta$ is learning rate, $\gamma$ is coefficient for momentum
$$ v_t = \gamma v_{t-1} + \eta \bigtriangledown J(\theta;x,y) \ \theta = \theta - v_t$$
3. AdaGrad
-
Introduction : We need to tune the lr in momentum. Adagrad is an adaptive learning rate method that perform larger updates for infrequent parameters and smaller updates for frequent parameters.
-
Parameters that have changed a lot so far are assumed to be near their optimum, so they get smaller step sizes; parameters that haven’t changed much get larger step sizes.
-
For $k$ parameters being trained, $G_t$ is a $k$-dimensional vector holding the sum of squared gradients accumulated per parameter up to time step $t$. When updating $\theta$, the learning rate $\eta$ is scaled inversely with $G_t$.
$$ G_t = G_{t-1} + (\bigtriangledown_\theta J(\theta))^2 \ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \cdot \bigtriangledown_\theta J(\theta) $$
4. RMSProp
-
With AdaGrad, $G_t$ grows monotonically, so the effective learning rate shrinks until training effectively stalls.
-
RMSProp fixes this by replacing the running sum of squared gradients with an exponentially weighted moving average.
-
This keeps $G_t$ from growing unboundedly while preserving a per-parameter learning-rate difference.
$$ G = \gamma G + (1-\gamma)\bigtriangledown_\theta J(\theta_t))^2 \ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \cdot \bigtriangledown_\theta J(\theta)$$
5. Adam
- Adam can be thought of as a combination of RMSProp and Momentum. Like Momentum, it stores an exponential moving average of past gradients ($m_t$); like RMSProp, it also stores an exponential moving average of squared gradients ($v_t$).
$$ m_t = \beta_1 m_{t-1} + (1- \beta_1)\bigtriangledown_\theta J(\theta_t) \ v_t = \beta_2 v_{t-1} + (1- \beta_2)(\bigtriangledown_\theta J(\theta_t))^2$$
-
However, since $m$ and $v$ are initialized to zero, $m_t$ and $v_t$ are biased toward zero early in training, so a bias correction is applied. [details in paper]
-
Expanding $m_t, v_t$ as $\Sigma$ sums and taking the expectation on both sides yields an unbiased expectation; the corrected values are then used for the update.
$$\hat{m_t} = \frac{m_t}{1-\beta^t_1} \ \hat{v_t} = \frac{v_t}{1-\beta^t_2} \ \theta = \theta - \frac{\eta}{\sqrt{\hat{v_t} + \epsilon}}\hat{m_t}$$
-
Typical defaults: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$.
-
Why use learning-rate decay on top of Adam, which already adapts its learning rate? Because the decayed rate acts as an upper bound on the per-parameter step.
-
The learning rates adapt themselves during train steps, it’s true, but if you want to be sure that every update step do not exceed lambda you can than lower lambda using exponential decay or whatever. It can help to reduce loss during the latest step of training, when the computed loss with the previously associated lambda parameter has stopped to decrease.