Label Smoothing & Deep Learning: Google Brain explains why it works and when to use (SOTA tips)
Hinton, Muller and Cornblith from Google Brain released a new paper titled “When does label smoothing help?” and dive deep into the internals of how label smoothing affects the final activation layer for deep neural networks. They built a new visualization method to clarify the internal effects of label smoothing, and provide new insight into how it works internally. While label smoothing is often used, this paper explains the why and how label smoothing affects NN’s and valuable insight as to when, and when not, to use label smoothing.
This article is a summary of the paper’s insights to help you quickly leverage the findings for your own deep learning work. The full paper is recommended for deeper analysis.
What is Label Smoothing?: Label smoothing is a loss function modification that has been shown to be very effective for training deep learning networks. Label smoothing improves accuracy in image classification, translation, and even speech recognition. Our team used it for example in breaking a number of FastAI leaderboard records:
The simple explanation of how it works is that it changes the training target for the NN from a hard ‘1’ to ‘1-label smoothing adjustment’, meaning the NN is trained to be a bit less confident of it’s answers. The default value is usually .1, meaning the target answer is .9 (1 minus .1) and not 1.
Example: assume we want to classify images into dogs and cats. If we see a photo of a dog, we train the NN (via cross entropy loss) to move towards a 1 for dog, and 0 for cat. And if a cat, the reverse where we train towards 1 for cat, 0 for dog. In other words, a binary or “hard’ answer.
However, NN’s have a bad habit of becoming ‘over-confident’ in their predictions during training, and this can reduce their ability to generalize and thus perform as well on new, unseen future data. In addition, large datasets can often include incorrectly labeled data, meaning inherently the NN should be a bit skeptical of the ‘correct answer’ to reduce extreme modeling around some degree of bad answers.
Thus, what label smoothing does is force the NN to be less confident in it’s answers by training it to move towards the ‘1-adjustment’ target, and then dividing the adjustment amount over the remaining classes….rather than simply a hard 1.
For our binary dog/cat example, label smoothing of .1 means the target answer would be .90 (90% confident) it’s a dog for a dog image, and .10 (10% confident) it’s a cat, rather than the previous move towards 1 or 0. By being a bit less certain, it acts as a form of regularization and improves it’s ability to perform better on new data.
Seeing label smoothing in code may help drive home how it works better than the usual math (from FastAI github). The greek letter that looks like an E (epsilon or ε ) is the label smoothing adjustment factor:
Label smoothing’s affect on Neural Networks: Now we get to the heart of the paper, which shows visually how label smoothing affects the NN’s classification processing.
First, AlexNet classifying “airplane, automobile and bird” during training.
And then validation:
As you can see, label smoothing enforces much tighter groupings for the classification while at the same time enforcing more equidistant spacing between the clusters.
A ResNet example for “beaver, dolphin and otter” is even more clarifying:
As the images visually show, label smoothing produces tighter clustering and greater separation between categories for the final activations.
This is a primary reason why label smoothing produces more regularized and robust neural networks, that importantly tend to generalize better on future data. However, there’s an additional beneficial effect than just better activation centering…
Implicit Network Calibration from Label Smoothing: In this paper, Hinton et al proceed from the visualization process, to show how label smoothing automatically helps calibrate the network and reduce the networks calibration error, without requiring a manual Temperature adjustment.
Previous research (Guo et al) has shown that neural networks are often over-confident and poorly calibrated relative to their true accuracy. To prove this, Guo et al developed a calibration measure known as ECE (Expected Calibration Error). By using this measure they then were able to adjust the calibration of a given neural network with a post- training modifier known as Temperature scaling, and move the network into better alignment with it’s true skill (reduce ECE) and thus improve final accuracy. (Temperature scaling is performed by multiplying the final logits with a Temperature scalar before passing it to the softmax function).
The paper shows a number of examples, but the the best example of this is a ResNet trained with and without label smoothing on ImageNet, and with both networks compared versus a Temperature adjusted network.
As you can see, training with label smoothing produces a network that has much better ECE (Expected Calibration Error) and more simply put, has a more ideal confidence relative to it’s own accuracy.
In effect, a label smoothed network is not ‘over-confident’ and as a result should generalize and perform better on live future data.
Knowledge Distillation (or when not to use label smoothing) : The final section of the paper discusses the finding that while label smoothing produces improved neural networks for a variety of tasks…it should not be used if the final model will serve as a a teacher for other ‘student’ networks.
The author’s note that while training with label smoothing improves the final accuracy of the teacher, it fails to transfer as much knowledge to the student network compare to teachers trained on ‘hard’ targets (no label smoothing).
The reason label smoothing appears to produce models that are poor teachers is somewhat shown in the initial visualizations. By forcing the final classifications into tighter clusters, the network drops additional details to focus on the core distinctions between classes.
This ‘rounding’ helps the network to perform better on unseen data. However, the information lost ends up negatively impacting it’s ability to teach new student models.
As a result, a teacher with better accuracy is not one that distills information to students better.
Summary and SOTA tips: In almost all cases, training with label smoothing produces a better calibrated network that generalizes better and ultimately is more accurate on unseen production data.
Thus label smoothing should be part of most deep learning training by default.
However, the one case it is not useful is for building networks that will later serve as teachers…then hard target training will produce a better teacher neural network.
Notes: Similar to the above excellent paper and it’s guidance for leveraging label smoothing, a number of useful papers have been published the past three months. I’m thus planning to produce summarys and ‘training tips’ from the latest greatest SOTA papers for those involved in training and building neural networks (and will use the same on our team’s next effort at breaking more FastAI leaderboard records). Stay tuned!