Meet DiffGrad: New Deep Learning Optimizer that solves Adam’s ‘overshoot’ issue

5 min readDec 26, 2019

Example of short term gradient changes on the way to the global optimum (center). Image from paper.

DiffGrad, a new optimizer introduced in the paper “diffGrad: An optimizer for CNN’s” by Dubey, et al, builds on the proven Adam optimizer by developing an adaptive ‘friction clamp’ and monitoring the local change in gradients in order to automatically lock in optimal parameter values that Adam can skip over.

Comparison of results, 300 epochs (from the paper). Note the esp large improvement for CIFAR 100 vs Adam and SGD with Momentum (red column).

When local gradient changes begin to reduce during training, this is often indicative of the potential presence of a global minima. DiffGrad applies an adaptive clamping effect to lock parameters into global minima, vs momentum only optimizers like Adam which can get close, but often fly right by due to their inability to rapidly decelerate. The result is out-performance vs Adam and SGD with momentum, as shown in the test results above.

Training fast but with some regret: Adam and other ‘adaptive’ optimizers rely on computing an exponential moving average of the gradients, which allows it to take much larger steps (or greater velocity) during training where the gradients are relatively consistent vs the fixed, plodding, steps of SGD.

Meet DiffGrad: New Deep Learning Optimizer that solves Adam’s ‘overshoot’ issue

Written by Less Wright