Prerequisites: Basic understanding of neural networks and loss functions


In today’s world, a world that is advancing exponentially (mind this word, gonna come in handy ahead), you definitely want to be updated with the latest activities, be it sports, news, technology, music, etc. However, current technologies would still not function if not for the basic components that made them possible. Smartphones without transistors.. eh?

A similar analogy can be made to one of the most popular optimization techniques that is used in quite many State Of The Art deep learning models. Before we dive into Momentum, we will have a brief look into a few other topics which you would need to understand before we study Momentum.


Loss function for Mean Squared error

Let us say we have a training data set of n training examples and a given model which has to be trained. Let us assume a loss function L with parameters y and y with a hat. An example of a loss function is given above in the image, where mean squared error is used as a loss function. y is the result of the hypothesis function h(W) and represents the actual labels of the training data set and y with a hat represents the output that we obtain from the model when trained on the data set.

Loss function is pretty much the bread and butter in learning models. In statistics, typically a loss function is used for parameter estimation, and the event in question is some function of the difference between estimated and true values for an instance of data.

Training a model usually begins by setting the weights ( parameters ) of the model to some random initial values. As and when iterations pass, the weights are updated and optimized so as to minimize the loss. The formula for gradient descent for any j-th parameter is as follows:

α represents the learning rate of the gradient descent algorithm. It’s there to control how big of a step we’re taking each iteration. It is the most important hyper-parameter to tune when training neural networks.

The loss function is differentiated with respect the respective weight so as to get the gradient value of the loss function w.r.t. that weight. This allows us to update the weights in the optimal direction.

The value of m in the image can be 1, in which case it is called Stochastic Gradient Descent ( updating the weights on each training example) or any number >1 and ≤n, in which case it is called Batch Gradient Descent(training the weights once per set of m training examples).


If we have a sequence V having n elements, then the simple moving average is defined as :


Now let us take another sequence S with n elements. The exponential moving average defines a new sequence V as :

V — New sequence. S — original sequence.

Here β is called the smoothing constant. This is how the new sequences’ terms pan out:

On solving a bit more,

From this equation we see, that the value of Tth number of the new sequence is dependent on all the previous values 1..t of the original sequence S. All of the values from S are assigned some weight. This weight is β to power of i multiplied by (1- β) for (t — i)th value of S. Because beta is less than 1, it becomes even smaller when we take β to the power of some positive number. So the older values of S get much smaller weight and therefore contribute less for overall value of the current point of V. The recent values of S get higher weight and contribute more to the current value of V.


We’ve defined a way to get ‘moving’ average of some sequence, which changes alongside with data. How can we apply this for training neural networks ?

They can average over our gradient values. Let me explain about how momentum works and why it works.

So, gradient descent does not exactly provide us with the direction in which the loss function is headed i.e. the derivative of the loss function. Therefore, we might not always be headed in the optimal direction. This is primarily because the earlier derivatives of the loss function act as a noise in the later stages of the updating the weights. This causes slow training and convergence.

Momentum technique helps us solve this issue of slow convergence. Consider a situation where a ball on a hilly terrain is trying to reach the deepest valley. If the slope of the hill at some stage is very steep, then the ball gains a lot of momentum and is able to pass through slight hills in its way. As the slope decreases the momentum and speed of the ball decreases, eventually coming to rest in the deepest position of valley.

Exponential! Remember the name??

The momentum technique modifies the Gradient Descent method by introducing a new variable V representing the velocity and a friction coefficient/smoothing constant β which helps in controlling the value of V and avoids overshooting the minima and simultaneously allowing faster convergence.

Recall the equation for Exponential Moving Average we discussed earlier. We can apply that equation along with Gradient Descent updating steps to obtain the following momentum update rule:

Another way to do it is by neglecting the (1- β) term, which is a less intuitive.

This is pretty much identical to the first pair of equation, the only difference is that you need to scale learning rate by (1- β) factor.

There are two major reasons why momentum works with gradient descent:

  1. Exponential moving average helps us to give more importance to the most recent values of the derivatives of the loss functions and can provide us a better estimate which is closer to the actual derivative than our noisy calculations.
  2. Sometimes, loss functions tend to have a structure like this
Pathological curve: Image source:

The bluish area represents a ravine like structure.

Ravine is an area, where the surface curves much more steeply in one dimension than in another. Ravines are common near local minimas in deep learning and Gradient Descent has troubles navigating them.

If at any iteration we enter this ravine region, the loss function may keep bouncing off the walls of the ravine like below. This region below is knows as a pathological curvature.

Some may say “Why don’t you just decrease the learning rate?”

Well, it makes sense when you are approaching a minima, but think of a case where you are in a pathological curvature and you have a whole lot of distance to cover to reach the minima. This is where some momentum helps.

When the Gradient Descent reaches a point towards the middle phase of the ravine, the momentum technique helps recognize the recent derivatives and hence boost the direction of gradient descent in that way. In the above image, notice that each gradient update has been resolved into components along w1 and w2 directions. If we will individually sum these vectors up, their components along the direction w1 cancel out, while the component along the w2 direction is reinforced. In an update, that counts as the direction of w2 being enhanced whereas zeroing out the w1 component, resulting in moving faster towards the minima.

In practice, the value of momentum is usually initialized at around 0.5 and is slowly annealed to 0.9 and closer to it. I will be posting articles about practical simulations of momentum also.


The basic idea behind momentum to decrease the convergence time by accelerating Gradient Descent in a relevant and optimal direction. This technique is used in various types of Deep Neural Network models where noisy data must be reduced. To refer more into the detailed math, I suggest you refer the following article

Why Momentum Really Works


[1] momentum-a84097641a5d