Training network is iterative process where we want to minimize *Loss function:*

and the appropriate method that we use for this is *Gradient Descent*.

*η — *is a learning rate.

**Stochastic Gradient Descent**

For normal Gradient Descent we need to process whole dataset that is very inefficient and expensive. Solution for solving this problem is random choice of next example that will help to update trainable parameters. In network training we take random batch of samples for each iteration and then do update for *θ*. This is a *Stochastic Gradient Descent* that widely used in training networks:

#### Drawbacks

• Need to find the best learning rate *η*, no adaptive way.

• If there is a saddle point that surrounded by plateau, it will be very difficult for SGD to get out since the gradients are close to zero in such places.

### Adaptive Stochastic Gradient Descent Methods

### Momentum

where* γ* = 0.9.

Momentum proposes to use velocity *Vt *for finding good optimum: if we go down to optimum(the derivative is increasing), Momentum will increase the velocity for faster achieving. Also Momentum can better work with plateau, the velocity doesn’t allow to stuck in plateau.

Momentum uses *Exponential Moving Averages* that is approximate estimation of *θ* history. It economies memory and avoids computations for past *θ* values.

**Applications**

Momentum is very good for *ResNet* architecture for image classification problem. ResNet is very deep network and many researchers say that *ADAM* is the best, but my practical experience showed the Momentum is the best for training ResNet.

#### Drawbacks

Learning rate *η *is still handcrafted hyper-parameter.

### Nesterov Accelerated Gradient

where *γ* = 0.9.

NAG uses the same approach as Momentum, but it has one modification: NAG approximates the next step and then decides which the step size should be. It’s a lookahead approach the allows be more careful around the optimum. NAG calculates gradient as it knows the new *θt+1*.

#### Applications

NAG is good for *RNN*, noticeably improve training.

#### Drawbacks

Learning rate *η* is still handcrafted hyper-parameter.

### Adagrad

Adagrad (Adaptive Gradient) adapts learning rate for each parameter separately. In many gradient descent methods is one learning rate for all trainable variables(features), now Adagrad allows to consider every parameter. If features of parameter occur rarely we make bigger step, i.e. it has bigger learning rate, than parameters, whose features occur often — for such parameters we do smaller step, i.e. it has smaller learning rate.

where *ε* = 10e−8.

In the denominator is the approximation of Hessian matrix. Hessian matrix is matrix of partial derivatives of second order, it’s very computationally expensive to compute on every step such Hessian. That’s why Adagrad works with approximation: we have to store only sum of gradients across whole history. As you see if the gradient’s value close to zero we do big step otherwise smaller.

#### Drawbacks

During training we accumulate gradients that leads to a large number in the denominator, hence learning rate becomes too small and training is stoping.

#### Applications

It suits for sparse data, *GloVe* used this optimizer for training.

***Further all formulas will be for vectors.***

### RMSProp

RMSProp (Root Mean Square Propagation) that was created by famous *Geoffrey Hinton*, but never published as a paper. The main goal of RMSProp is to fix the drawback of Adagrad. Hinton uses the exponential moving averages instead of sum of the squared gradients.

Hinton suggests to use *γ* = 0.9 and *η* = 0.001.

#### Drawbacks

There are no so obvious drawbacks, just that learning rate is still handcrafted, because not for every task the suggested value is appropriate.

#### Applications

Good for training CNNs, deep networks. MobileNets was trained with RMSProp.

### Adadelta

Adadelta was developed at the same time with RMSProp, but published as a paper. Adadelta has the same goal as RMSProp: fix the accumulation of the squared gradients in the denominator.

- Adadelta accumulate gradients over window, but it can be inefficient to move with window during training. Instead of it method uses exponential moving averages as RMSProp:

Adadelta suggests to use *γ* = 0.9.

2. But Adadelta thinks about units of the

Now the *∆θt* is unitless, because we reduce the gradient in the numerator and in the denominator. SGD, Momentum also suffer with units problem. Adadelta proposes to use exponential moving averages for *∆θt* and put it to the numerator:

The final update rule for parameters:

where

for

As we see there is no *η,* so we don’t need to select learning rate, this method adaptively set up learning rate. In the update rule the value of *RMS[∆θ]* is for t-1 step, it means that on the first step t = 1 we don’t know *RMS[∆θ]t−1* and Adadelta proposes set up 1 for the first step. This fraction

allows to be* *robust for large sudden gradients, because denominator increases first, before the numerator can react.

#### Drawbacks

There are also no so obvious drawbacks, but it’s not widely used for deep neural networks.

#### Applications

Good for *NLP*.

### Adam

Adam (Adaptive Moment Estimation) works with momentums of first and second order. The intuition behind the Adam is that we don’t want to roll so fast just because we can jump over the minimum, we want to decrease the velocity a little bit for carefully search.

where *β1* = 0.9, *β2* = 0.999, *ε* = 10−8.

Finally, formulas for mean and uncentered variance are bias-corrected. There are few reasons why we should do this:

- If there is no bias correction it can lead to initial steps that are much larger.
- Authors of Adam optimizer wanted to know how the expected value of the exponential moving average at timestep
*t*relates to the true second moment:

where

You can get this formula if recursively insert previous value if *v* at step *t-1*. Let’s take the expected value from *v *at step* t *for finding relation:

Now we have to consider two cases 1) if our training process is *stationary* *stochastic process* (true second moment is stationary process) we have the same mean and variance values every timestep, they are constant during whole process. It’s mean that *g* is the same on every step:

The sum of *βt−i* is geometric progression, so we can rewrite formula as

In case of 2)*non-stationary stochastic process* we don’t get completely different value, it will be differ on *ξ* value:

That’s why we have such bias correction.

In paper they also talk about so obvious thing that our effective step size depends on learning rate, that’s for some models we know in advance the good optima and can choose appropriate learning rate, because with right learning rate Adam converge very fast.

#### Drawbacks

Learning rate is handcrafted.

#### Applications

Very good for deep CNNs. Also I trained ResNet 50, but all convolutions were pre-trained and just trained last fully connected layer, and it shows very good result: fast convergence, rapid achievement of high accuracy. All other optimizers showed worse results. Learning rate was 3e-4 as Andrej Karpathy said:

Source: Deep Learning on Medium