Comparison of new activation functions for deep learning . Results favor FTSwishPlus
Three new activation functions for deep learning ,FTSwish, and LiSHT and TRelu, have been recently proposed, along with modifications to the current workhorse, ReLU in the form of General ReLU. These new activation functions, along with the additions of ‘mean shift’ and clamping from FastAI’s advanced dev course, are run through initial testing on FastAI’s ImageNette vision dataset and ImageWoof (dog classification) to provide insight on their potential advantages in deep learning and specifically, Image Classification tasks.
Executive Summary: FTSwish with mean shifting (FTSwish+) was the top performing activation, and provided both highest net accuracy and one of the smoother training curves.
Activation functions provide the non-linearity vital to allowing deep learning networks to perform an impressive number of tasks. ReLU (Rectified Linear Unit) has been the default workhorse for some time in deep learning. However, some concerns over ReLU’s removal of all negative values and the associated dying gradient, have prompted new activation functions that handle negative values differently.
Two new activation functions have been published this year:
1 — FTSwish = Fixed Threshold Swish-like activation unit (paper: https://arxiv.org/abs/1812.06247) This provides a swish like curve for positive values, and a fixed negative value (threshold) for all negative values.
2 — LiSHT = Non-parametric Linear Scaled Hyperbolic Tangent activation unit. (paper: https://arxiv.org/abs/1901.05894#) Effectively, this function scales the tanh function such that negative values are returned as positive, producing a parabola like curve:
3 — General ReLU — Introduced by FastAI, this is ReLU plus a leakiness option to address the lack of negative handling from ReLU, along with ‘mean shift’ and ‘clamping’.
ReLU — ReLU is the default activation for deep learning, and thus was naturally included.
4 — Finally, TRelu (Threshold Relu) was developed to try and isolate if the curve on the positive values for FTSwish was what was driving it or was it simply the negative threshold activation. It is Relu plus a flat threshold value of -.25 for all negative values.
Initial testing proceeded and was performed using:
1 — FastAI’s XResNet34 for ImageNette, cyclical learning with max learning rate of 1e-2 and up to 40 epochs on ImageNette. This is a vision dataset which is a subset of ImageNet, but with only 10 classification categories. Image sizes used were 160px.
2 — FastAI’s XResNet50 for ImageWoof (12,954 images), cyclical learning with max learning rate of 1e-3, and 80 epochs. Image sizes were 320px.
Initial testing results had GeneralRelu as the winner using ImageNette. However, given that this was Relu + some additional enhancements, I opted to then create LiSHT+ and FTSwish+, which were the respective activations plus mean shifting and clamping.
Ultimately clamping did not improve the overall results — it usually resulted in a better starting point, but in the end, hindered final accuracy rather than improved. Clamping was thus not used for the larger ImageWoof testing.
However, mean shifting for FTSwish (henceforth, FTSwish+) made a tremendous difference.
To clarify mean shifting — this was introduced by FastAI/Jeremy Howard, and the idea is to adjust the initial starting point such that it adjusts the activation function so that it gives a mean of zero and ideally, standard deviation of 1 for a random tensor, representing the weights, run through kaiming initialization. In this case, for FTSwish, it requires an adjustment of -.1 to get a mean of zero. This adjustment is then applied every time the FTSwish activation is used. Below is the testing of -.1, showing how it produces a mean 0 result:
In addition, while the original FTSwish paper recommends a threshold of -.2, I tested -.3 and -.4. Both -.2 and -.3 did well, so I then tested -.25…and that was ultimately the winner overall. Here are the results for both ImageNette and ImageWoof testing:
One additional bonus for FTSwish+ was that the path per epoch is very smooth. On the ImageNette dataset, it never took a step backwards, vs say General Relu and Relu both had epochs that dropped below a previous epoch. LiSHT+ was also quite smooth, but never could arrive at a competitive final result.
The original paper did not release any source code, so below is my implementation as the performance improved by mean shifting, FTSwish+.
Source is also available and maintained here: https://github.com/lessw2020/FTSwishPlus
(TRelu and LightRelu (LiSHT+) are also available as repos there).
For the ImageWoof with 80 epochs, both FTSwishPlus and TRelu were run 2x to verify their results as the previous first and second place from the ImageNette portion. Note that not shown but Relu was also run again, with much lower results.
Computational efficiency: Relu averaged 2:01 for each epoch, vs FTSwishPlus averaged 2:40+. Thus, it brings an additional 40% overhead in terms of time, but for maximum accuracy this is the recommended activation function to work with.
Hopefully these results provide a promising option with FTSwishPlus for you to explore with your own datasets beyond the basic Relu!