Member-only story

Meet AdaMod: a new deep learning optimizer with memory

5 min readJan 5, 2020

AdaMod is a new deep learning optimizer that builds on Adam, but provides an automatic warmup heuristic and long term learning rate buffering. From initial testing, AdaMod is a top 5 optimizer and readily beats or exceeds vanilla Adam, while being much less sensitive to the learning rate hyperparameter, smoother training curve, and requires no warmup mode.

AdaMod converges to same point even using up to 2 order magnitude different learning rates vs SGDM and Adam end up with different results.

AdaMod is proposed in the paper “An Adaptive and Momental Bound Method for Stochastic Learning” by Ding, Ren, Lou and Sun.

How AdaMod works: AdaMod maintains an exponential, long term average of the adaptive learning rates themselves, and uses that to clip any excessive adaptive rates throughout training. The results is improved convergence, no need for warmup, and less sensitivity to the actual learning rate chosen. The degree of memory is controlled by a new parameter, Beta3.

This longer term memory and clipping from it, addresses the main issues found with many adaptive optimizers such as vanilla Adam — the lack of longer term memory can result in non-convergence as the optimizer is trapped in bad optima due to spikiness of the adaptive learning rates. This non-convergence issue was raised by Reddi, Kale, Kumar (2018) and has resulted in a number of improvements to optimizers to offer ways to overcome it.

Meet AdaMod: a new deep learning optimizer with memory

Written by Less Wright

Responses (5)