ADAM in 2019 — What’s the next ADAM optimizer

If ADAM is fast and SGD + momentum converge better, why couldn’t we start with ADAM and switch to SGD later?
Visualization of loss landscape from https://www.cs.umd.edu/~tomg/projects/landscapes/
  1. Introduction
  2. Quick review of Adam and its family
  3. Radam
  4. LookAhead Optimizer
  5. LAMB
  6. Dynamics of Learning Rate and Batch Size
  7. What’s the best learning rate schedule?

Introduction

Deep Learning has made a lot of progress, there are new models coming out every few weeks, yet we are still stuck with Adam in 2019. Do you know when did the Adam paper come out? It’s 2014, compare to that, the BatchNorm paper was published in 2015! 🎈

the original Adam paper
Batch Norm paper was published in 2015

5 years is like a century for the deep learning community, I am curious about the development of optimizer recently, so I Google around and try to summarize what I have found. Sebastian has a great overview of optimization algorithms, but it is not up to date and missing some of the progress that came out recently.

Meet Adam’s Family

You can easily find a lot of optimizers from PyTorch documentation.
Adadelta, Adagrad, AdamW, AdaMax
, and the list go on. It seems that we really cannot get rid of Adam. o let’s have a quick review of Adam. If you are familiar with it already, feel free to skip this part.

Adam = Momentum + RMSProp

Adam is the combination of Momentum and RMSProp. Momentum (v) give short-term memory to the optimizer, instead of trusting the current gradient fully, it will use previous gradients and current gradient and add them up with a ratio β1. While RMSProp takes 1 /square root of the gradient into consideration, intuitively it means that optimizer will take a larger step if the variance is small (confident), vice versa.

Adam Optimizer update
Adam updates rules

Adam is great, it’s much faster than SGD, the default hyperparameters usually works fine, but it has its own pitfall too. Many accused Adam has convergence problems that often SGD + momentum can converge better with longer training time. We often see a lot of papers in 2018 and 2019 were still using SGD.

Recent Progress

So what have we learned about training a neural network? I will try to keep it short and supplements articles with details.

Adam

RAdam update

On the Variance of the Adaptive Learning Rate and Beyond introduce the Recitified-Adam. The paper argues that the initial training of Adam is unstable, due to the fact that there are only a few data points for calculating the exponential moving average for the second moment (s_t) term. RAdam addresses this problem by rectifying the adaptive learning rate only if it is confident about the variance of the gradient. Otherwise, it switches off the adaptive learning rate and essentially making it fall back as SGD + momentum. The warm-up learning rate has a similar effect, as the learning rate starts from a lower value (high variance, few data points) and slowly increases to a larger learning rate (stable variance, more data points). One of the benefits of using RAdam is that it is more robust on the choice of learning rate and you don’t have to decide the length of the warm-up period.

Warm-up learning rate

If you want to learn more about RAdam, you can refer to this nice article.

LookAhead Optimizer

Jimmy Ba was one of the authors of the Adam Optimizer, and we all know about Geoffrey Hinton already. :)

The Lookahead Optimizer consists of two parts, slow weights, and fast weights. The fast weights are energetic, it takes a few steps for every update, the slow weights are cautious, instead of following the fast weights, it interpolates the parameters and takes some intermediate steps base on exponential smoothing. The LookAhead optimizer does not work on its own. If you take a look at the algorithm, the inner loop is just the normal mini-batch update, so it requires an optimizer like SGD or Adam.

The blue line is the trajectory of the fast weights, there are multiple data points because we will only do 1 update for the slow weight (purple) for every k update of fast weights (blue), and then we will reset the fast weights. Basically, the fast weights are sentry, it takes a few more steps to make sure the surroundings are safe, and then the slow weights take some intermediate step. For the example below, you can see that the slow weights get closer to the yellow area (high accuracy) with just a single update.

LARS / LAMB

The highlighted part is the trust ratio calculation, the φ is a scaling function which the paper similar use identity function φ(z) = z

The LARS and its successor Layer-wise Adaptive Moments optimizer for Batch training (LAMB) optimizer take one step forward. LAMB (1) keeps the square root of the gradient like Adam as part of the adaptive learning rate, but (2) it has one extra component called “trust ratio”. This article explains LAMB with greater details if you are interested. The gist of it is, it does not only takes the variance of the gradient as a scaling factor, but it also takes the ratio of the norm of the layer weights and the layer gradients. It makes a lot of sense, comparing large weights and small weights, you need to scale the learning rate according to their ratio to have a similar effect. By scaling with the trust ratio, it ensures for each layer the update to be unit l2-norm, which helps training deep networks by mitigating gradient vanishing problem.

Discussion

If ADAM is fast and SGD + momentum converge better, why couldn’t we start with ADAM and switch to SGD later?

I have no good answer to it, I sometimes see people doing this on Kaggle and this paper suggests that it works. With the goal of training faster, more accurate neural nets, there are lots of attempts to improve the current optimizer. In deep learning, explanations often come after empirical results. So what do we really know about optimizer or optimization of a neural net at all?

The A walk with SGD paper has some interesting explanations and I encourage you to read it. I think it explains some of the connections to recent development.

Why does linear scaling learning rate and batch size does not work upto certain batch sizes?

The A Walk with SGD paper compare Gradient Descent and Stochastic Gradient Descent (SGD) and suggest that the learning rate and batch size plays two very different roles in SGD Dynamics

Learning Rate

By observing the loss of interpolating the parameters before and after each update, it finds that that the SGD valley is roughly convex, which roughly matches our understanding. However, the paper also points out that the floor of the SGD valley is highly non-linear and full of barriers. The learning rate controls the height of SGD, so using a large learning rate simply moving over the barriers, instead of crossing it. This is different from our understanding, we usually think of larger learning rate causing SGD to bounce harder and help to avoid local minima. The hypothetical topology is illustrated in the below graph. This may be counter-intuitive that how does it minimize the loss if the learning rate keeps at a height above the floor? Turns out our over-simplified illusion that SGD just walk down from a static hill is wrong. In reality, the “hill” itself keeps changing as well, as it is just our projection from high-dimensional space to 3-dimensional spaces. Check this beautiful and bizarre video of the loss landscape out create by https://losslandscape.com/gallery/!

Batch Size

While batch size controls stochasticity which helps exploration. The paper supports this argument by observing the SGD leads to much larger parameter distance compare to Gradient Descent(GD).

The upper graph represents the angle between each update, -1 means it is the opposite direction. The lower graph measures the ||parameter distance||² from initialization, a larger value means it is further from the initialization points.

There are a few things to learn from this figure. (1) The angle between each update has very different characteristics of Gradient Descent and SGD. (2) The parameter distance is much larger in SGD after 40 iterations. The angle tells us some interesting stories. First, for Gradient Descent, it plateau at -1 quickly, which means SGD is likely oscillating in a valley-like landscape. While for SGD, the value is larger and goes up and down, indicating it is oscillating along the wall of a valley but with more exploration instead of staying at a local valley. The larger parameter distance also supports this argument that SGD helps exploration.

In fact, I think this insight may be related to the fact that why linearly scaling up learning rate/batch size ratio does not work until the batch-size hits a certain threshold. It makes sense if the batch size has a different dynamic fundamentally. The LAMB optimizer successfully trains BERT with a very large batch (65536/131072). It may be interesting to study how’s LAMB changes the dynamic of batch size and help the optimization with the layer-wise scaling learning rate.

What’s the best learning rate schedule?

Base on the RAdam paper, we learned that the warm-up schedule helps to reduce the variance of gradients. Does it mean that we do not need warm-up anymore? Probably no, as we know that a high learning rate also helps in terms of optimization. The SGD paper also proposes a trapezoidal schedule where we can separate the schedule into 3 stages. fast.ai also use a similar strategy that they call a modified version of one cycle training

(1) Warm-up stage, start from a low learning rate and linearly increase to maximum
(2) Highway training stage, stay at a maximum learning rate
(3) Converging stage, learning rate linearly decay to its minimum

Please leave me a comment and don’t forget to thumb up if you think it is useful. 👍👍👍👍👍

Keywords: Adam, SGD, RAdam, LookAhead, LAMB

Reference

[p1] Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, Jiawei Han On the Variance of the Adaptive Learning Rate and Beyond

[p2] Michael R. Zhang, James Lucas, Geoffrey Hinton, Jimmy BaLookahead Optimizer: k steps forward, 1 step back

[p3] Chen Xing, Devansh Arpit, Christos Tsirigotis, Yoshua Bengio A Walk with SGD

[p5]Leslie N. Smith, Nicholay Topin Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates

[p4] Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, Cho-Jui Hsieh Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

[p5] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, Andrew Gordon Wilson Averaging Weights Leads to Wider Optima and Better Generalization

[p10] Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural network optimization problems. arXiv preprint arXiv:1412.6544, 2014.