Meet AdaMod: a new deep learning optimizer with memory

AdaMod is a new deep learning optimizer that builds on Adam, but provides an automatic warmup heuristic and long term learning rate buffering. From initial testing, AdaMod is a top 5 optimizer and readily beats or exceeds vanilla Adam, while being much less sensitive to the learning rate hyperparameter, smoother training curve, and requires no warmup mode.

AdaMod converges to same point even using up to 2 order magnitude different learning rates vs SGDM and Adam end up with different results.

AdaMod is proposed in the paper “An Adaptive and Momental Bound Method for Stochastic Learning” by Ding, Ren, Lou and Sun.

How AdaMod works: AdaMod maintains an exponential, long term average of the adaptive learning rates themselves, and uses that to clip any excessive adaptive rates throughout training. The results is improved convergence, no need for warmup, and less sensitivity to the actual learning rate chosen. The degree of memory is controlled by a new parameter, Beta3.

This longer term memory and clipping from it, addresses the main issues found with many adaptive optimizers such as vanilla Adam — the lack of longer term memory can result in non-convergence as the optimizer is trapped in bad optima due to spikiness of the adaptive learning rates. This non-convergence issue was raised by Reddi, Kale, Kumar (2018) and has resulted in a number of improvements to optimizers to offer ways to overcome it.

No warm-up needed: Very similar to Rectified Adam, AdaMod is able to control the variance of the adaptive learning rates from the start of training, and thus ensures stability at the beginning of training. The AdaMod researchers did several experiments to show the effect of warmup on Adam, as well as show how AdaMod does with no warmup.

Blue line is Adam with no warmup…and poor results.

In some cases (transformer models for NLP for example), Adam simply never gets going with out a warmup, which is shown in the above image.

The reason for this issue is exactly the same as what the Rectified Adam researchers noted — excessive variance in learning rates at the start of training (without a warmup) means Adam can make poor, excessive jumps at the start and trap itself in a bad minima with no escape.

Very large learning rates with no warmup vs much lower and controlled steps with warmup.

The image above from the paper is very similar to the image shown in the Rectified Adam paper in that warm-up tames the excessive learning rates. I’ve added a red centerline at 0 in the two images above to better highlight the much lower learning rates that result from Adam with warmup (and thus why warmup is vital if only using regular Adam).

While Rectified Adam manages stable learning rates at the start, AdaMod does this but also continues to control the variance throughout training via clipping if the learning rates are higher than the long term memory, or average. This helps convergence and may avoid some of the non-convergence issues pointed out by Reddi with regular Adam.

AdaMod outperforming Adam with warmup

Limitations of AdaMod: While AdaMod generally outperforms vanilla Adam, SGDM can still outperform AdaMod under longer training conditions. The authors note this as an item for future work.

(Note that DiffGrad does outperform SGDM on similar testing but DiffGrad does not handle training variance (i.e. warmup) as well as AdaMod…hence why I made ‘DiffMod’ which combines DiffGrad and AdaMod :)

A good recommendation would be to converge with AdaMod and then run some extra training with SGDM and see if you can push for any additional improvements:

DenseNet training. AdaMod outperforms Adam but SGDM converges (much later) at better accuracy.
core algorithm for AdaMod (note line 9, which is the long term average, and 10, which clips if needed).

Using AdaMod: AdaMod is a drop in replacement for Adam. The only change is a new hyperparameter called B3, or Beta3. This controls the degree of lookback for the long term clipping average. The authors recommend .999 — .9999.

Results can vary based on Beta3 selection.

Note that while the official code uses B3 with the decimal notation as above, I made a slight change in my github repo with some variations of AdaMod (combining DiffGrad + AdaMod = DiffMod) for it, so you can pass in total number of batches to recall for the B3 constant as “len_memory”. i.e. 1,000 ( which is same as .999) or 10,000 (same as .9999), but easier to remember and track. That lets you also more easily test things like 5,000 or 2,500, etc.

len_memory = b3 in easier to use format. Specify the memory len ala 5000, b3 is computed.

Video on AdaMod — I’ve also made a video for AdaMod that reviews the key points of AdaMod, and does a quick walk-through with AdaMod in action on a live server using FastAI:

Source code, so you can use AdaMod:

1 — Official Github repo (PyTorch):

2 — Unofficial AdaMod and optional DiffMod variant (PyTorch, FastAI):

Summary: AdaMod represents another step forward for deep learning optimizers as it provides three improvements:

1 — No need for warmup (similar to Rectified Adam)

2 — Reduced sensitivity to learning rate hyperparam (converges to similar results)

3 — Usually outperform Adam in final results due to improved stability throughout training (via the longer term memory and clipping from it).

In testing AdaMod on some datasets along with other optimizers, I find that AdaMod is consistently a top 5 optimizer.

Ranger and DeepMemory are also in the top 5 for reference, so AdaMod is in an elite group and a strong contender for use in your deep learning training!

PyTorch, Deep Learning, Object detection, Stock Index investing and long term compounding.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store