Training network is iterative process where we want to minimize Loss function:

θ is a trainable variable.

and the appropriate method that we use for this is Gradient Descent.

Update rule.

η — is a learning rate.

Stochastic Gradient Descent

For normal Gradient Descent we need to process whole dataset that is very inefficient and expensive. Solution for solving this problem is random choice of next example that will help to update trainable parameters. In network training we take random batch of samples for each iteration and then do update for θ. This is a Stochastic Gradient Descent that widely used in training networks:

i + n is a number of samples.


• Need to find the best learning rate η, no adaptive way.
• If there is a saddle point that surrounded by plateau, it will be very difficult for SGD to get out since the gradients are close to zero in such places.

Adaptive Stochastic Gradient Descent Methods


where γ = 0.9.

Momentum proposes to use velocity Vt for finding good optimum: if we go down to optimum(the derivative is increasing), Momentum will increase the velocity for faster achieving. Also Momentum can better work with plateau, the velocity doesn’t allow to stuck in plateau.

Momentum uses Exponential Moving Averages that is approximate estimation of θ history. It economies memory and avoids computations for past θ values.


Momentum is very good for ResNet architecture for image classification problem. ResNet is very deep network and many researchers say that ADAM is the best, but my practical experience showed the Momentum is the best for training ResNet.


Learning rate η is still handcrafted hyper-parameter.

Nesterov Accelerated Gradient

where γ = 0.9.

NAG uses the same approach as Momentum, but it has one modification: NAG approximates the next step and then decides which the step size should be. It’s a lookahead approach the allows be more careful around the optimum. NAG calculates gradient as it knows the new θt+1.

Rough explanation of the difference between NAG and Momentum.


NAG is good for RNN, noticeably improve training.


Learning rate η is still handcrafted hyper-parameter.


Adagrad (Adaptive Gradient) adapts learning rate for each parameter separately. In many gradient descent methods is one learning rate for all trainable variables(features), now Adagrad allows to consider every parameter. If features of parameter occur rarely we make bigger step, i.e. it has bigger learning rate, than parameters, whose features occur often — for such parameters we do smaller step, i.e. it has smaller learning rate.

where ε = 10e−8.

In the denominator is the approximation of Hessian matrix. Hessian matrix is matrix of partial derivatives of second order, it’s very computationally expensive to compute on every step such Hessian. That’s why Adagrad works with approximation: we have to store only sum of gradients across whole history. As you see if the gradient’s value close to zero we do big step otherwise smaller.


During training we accumulate gradients that leads to a large number in the denominator, hence learning rate becomes too small and training is stoping.


It suits for sparse data, GloVe used this optimizer for training.

**Further all formulas will be for vectors.**


RMSProp (Root Mean Square Propagation) that was created by famous Geoffrey Hinton, but never published as a paper. The main goal of RMSProp is to fix the drawback of Adagrad. Hinton uses the exponential moving averages instead of sum of the squared gradients.

Hinton suggests to use γ = 0.9 and η = 0.001.


There are no so obvious drawbacks, just that learning rate is still handcrafted, because not for every task the suggested value is appropriate.


Good for training CNNs, deep networks. MobileNets was trained with RMSProp.


Adadelta was developed at the same time with RMSProp, but published as a paper. Adadelta has the same goal as RMSProp: fix the accumulation of the squared gradients in the denominator.

  1. Adadelta accumulate gradients over window, but it can be inefficient to move with window during training. Instead of it method uses exponential moving averages as RMSProp:

Adadelta suggests to use γ = 0.9.

2. But Adadelta thinks about units of the

Now the ∆θt is unitless, because we reduce the gradient in the numerator and in the denominator. SGD, Momentum also suffer with units problem. Adadelta proposes to use exponential moving averages for ∆θt and put it to the numerator:

The final update rule for parameters:



As we see there is no η, so we don’t need to select learning rate, this method adaptively set up learning rate. In the update rule the value of RMS[∆θ] is for t-1 step, it means that on the first step t = 1 we don’t know RMS[∆θ]t−1 and Adadelta proposes set up 1 for the first step. This fraction

allows to be robust for large sudden gradients, because denominator increases first, before the numerator can react.


There are also no so obvious drawbacks, but it’s not widely used for deep neural networks.


Good for NLP.


Adam (Adaptive Moment Estimation) works with momentums of first and second order. The intuition behind the Adam is that we don’t want to roll so fast just because we can jump over the minimum, we want to decrease the velocity a little bit for carefully search.

where β1 = 0.9, β2 = 0.999, ε = 10−8.
Finally, formulas for mean and uncentered variance are bias-corrected. There are few reasons why we should do this:

  1. If there is no bias correction it can lead to initial steps that are much larger.
  2. Authors of Adam optimizer wanted to know how the expected value of the exponential moving average at timestep t relates to the true second moment:


You can get this formula if recursively insert previous value if v at step t-1. Let’s take the expected value from v at step t for finding relation:

Now we have to consider two cases 1) if our training process is stationary stochastic process (true second moment is stationary process) we have the same mean and variance values every timestep, they are constant during whole process. It’s mean that g is the same on every step:

The sum of βt−i is geometric progression, so we can rewrite formula as

In case of 2)non-stationary stochastic process we don’t get completely different value, it will be differ on ξ value:

That’s why we have such bias correction.

In paper they also talk about so obvious thing that our effective step size depends on learning rate, that’s for some models we know in advance the good optima and can choose appropriate learning rate, because with right learning rate Adam converge very fast.


Learning rate is handcrafted.


Very good for deep CNNs. Also I trained ResNet 50, but all convolutions were pre-trained and just trained last fully connected layer, and it shows very good result: fast convergence, rapid achievement of high accuracy. All other optimizers showed worse results. Learning rate was 3e-4 as Andrej Karpathy said:

Source: Deep Learning on Medium