AdaMod is a new deep learning optimizer that builds on Adam, but provides an automatic warmup heuristic and long term learning rate buffering. From initial testing, AdaMod is a top 5 optimizer and readily beats or exceeds vanilla Adam, while being much less sensitive to the learning rate hyperparameter, smoother training curve, and requires no warmup mode.
AdaMod is proposed in the paper “An Adaptive and Momental Bound Method for Stochastic Learning” by Ding, Ren, Lou and Sun.
How AdaMod works: AdaMod maintains an exponential, long term average of the adaptive learning rates themselves, and uses that to clip any excessive adaptive rates throughout training. The results is improved convergence, no need for warmup, and less sensitivity to the actual learning rate chosen. The degree of memory is controlled by a new parameter, Beta3.
This longer term memory and clipping from it, addresses the main issues found with many adaptive optimizers such as vanilla Adam — the lack of longer term memory can result in non-convergence as the optimizer is trapped in bad optima due to spikiness of the adaptive learning rates. This non-convergence issue was raised by Reddi, Kale, Kumar (2018) and has resulted in a number of improvements to optimizers to offer ways to overcome it.
No warm-up needed: Very similar to Rectified Adam, AdaMod is able to control the variance of the adaptive learning rates from the start of training, and thus ensures stability at the beginning of training. The AdaMod researchers did several experiments to show the effect of warmup on Adam, as well as show how AdaMod does with no warmup.
In some cases (transformer models for NLP for example), Adam simply never gets going with out a warmup, which is shown in the above image.
The reason for this issue is exactly the same as what the Rectified Adam researchers noted — excessive variance in learning rates at the start of training (without a warmup) means Adam can make poor, excessive jumps at the start and trap itself in a bad minima with no escape.
The image above from the paper is very similar to the image shown in the Rectified Adam paper in that warm-up tames the excessive learning rates. I’ve added a red centerline at 0 in the two images above to better highlight the much lower learning rates that result from Adam with warmup (and thus why warmup is vital if only using regular Adam).
While Rectified Adam manages stable learning rates at the start, AdaMod does this but also continues to control the variance throughout training via clipping if the learning rates are higher than the long term memory, or average. This helps convergence and may avoid some of the non-convergence issues pointed out by Reddi with regular Adam.
Limitations of AdaMod: While AdaMod generally outperforms vanilla Adam, SGDM can still outperform AdaMod under longer training conditions. The authors note this as an item for future work.
(Note that DiffGrad does outperform SGDM on similar testing but DiffGrad does not handle training variance (i.e. warmup) as well as AdaMod…hence why I made ‘DiffMod’ which combines DiffGrad and AdaMod :)
A good recommendation would be to converge with AdaMod and then run some extra training with SGDM and see if you can push for any additional improvements:
Using AdaMod: AdaMod is a drop in replacement for Adam. The only change is a new hyperparameter called B3, or Beta3. This controls the degree of lookback for the long term clipping average. The authors recommend .999 — .9999.
Note that while the official code uses B3 with the decimal notation as above, I made a slight change in my github repo with some variations of AdaMod (combining DiffGrad + AdaMod = DiffMod) for it, so you can pass in total number of batches to recall for the B3 constant as “len_memory”. i.e. 1,000 ( which is same as .999) or 10,000 (same as .9999), but easier to remember and track. That lets you also more easily test things like 5,000 or 2,500, etc.
len_memory = b3 in easier to use format. Specify the memory len ala 5000, b3 is computed.
Video on AdaMod — I’ve also made a video for AdaMod that reviews the key points of AdaMod, and does a quick walk-through with AdaMod in action on a live server using FastAI: https://youtu.be/vx8thj3XZfw
Source code, so you can use AdaMod:
1 — Official Github repo (PyTorch): https://github.com/lancopku/AdaMod
2 — Unofficial AdaMod and optional DiffMod variant (PyTorch, FastAI): https://github.com/lessw2020/Best-Deep-Learning-Optimizers/tree/master/adamod
Summary: AdaMod represents another step forward for deep learning optimizers as it provides three improvements:
1 — No need for warmup (similar to Rectified Adam)
2 — Reduced sensitivity to learning rate hyperparam (converges to similar results)
3 — Usually outperform Adam in final results due to improved stability throughout training (via the longer term memory and clipping from it).
In testing AdaMod on some datasets along with other optimizers, I find that AdaMod is consistently a top 5 optimizer.
Ranger and DeepMemory are also in the top 5 for reference, so AdaMod is in an elite group and a strong contender for use in your deep learning training!