Meet DiffGrad: New Deep Learning Optimizer that solves Adam’s ‘overshoot’ issue
DiffGrad, a new optimizer introduced in the paper “diffGrad: An optimizer for CNN’s” by Dubey, et al, builds on the proven Adam optimizer by developing an adaptive ‘friction clamp’ and monitoring the local change in gradients in order to automatically lock in optimal parameter values that Adam can skip over.
When local gradient changes begin to reduce during training, this is often indicative of the potential presence of a global minima. DiffGrad applies an adaptive clamping effect to lock parameters into global minima, vs momentum only optimizers like Adam which can get close, but often fly right by due to their inability to rapidly decelerate. The result is out-performance vs Adam and SGD with momentum, as shown in the test results above.
Training fast but with some regret: Adam and other ‘adaptive’ optimizers rely on computing an exponential moving average of the gradients, which allows it to take much larger steps (or greater velocity) during training where the gradients are relatively consistent vs the fixed, plodding, steps of SGD.