EfficientNet from Google — Optimally Scaling CNN model architectures with “compound scaling”

4 min readMay 30, 2019

Google recently published both a very exciting paper and source code for a newly designed CNN (convolutional neural network) called EfficientNet, that set new records for both accuracy and computational efficiency.

This was not a minor improvement but rather an accuracy improvement of up to 6% while on the order of 5–10x more efficient than most current CNN’s. Their underlying findings should serve as solid guides for those looking to architect better CNN’s in the future, and thus this article to review the main tenets they’ve uncovered.

Here’s the results of EfficientNet, scaled to different block layers (B1, B2, etc) vs. most other popular CNN’s.

Image from EfficientNet paper ( https://arxiv.org/abs/1905.11946)

As the image shows, EfficientNet tops the current state of the art both in accuracy and in computational efficiency. How did they do this?

Lesson 1 — They learned that CNN’s must be scaled up in multiple dimensions. Scaling CNN’s only in one direction (eg depth only) will result in rapidly deteriorating gains relative to the computational increase needed.

Most CNN’s are typically scaled up by adding more layers or deeper . e.g. ResNet18, ResNet34, ResNet152, etc. The numbers represent the total number of blocks (layers) and in general, the more layers the more ‘power’ the CNN has.

This ability to increase depth was largely from Kaiming He’s paper where ResNet and the skip (or identity) connection was introduced. The skip connection was what allowed training of deeper layered nets to be possible and ResNet has since been a very dominant architecture for computer vision.

However, as this paper shows, simply going deeper rapidly saturates the gains…ResNet 1000 isn’t much more accurate than ResNet152 for example, as after 100 -150 layer’s gains rapidly drop off.

Going wider is another often used scaling method, and tends to capture finer details and can be easier to train. However, it’s benefits quickly saturate as well.

Here’s the graph’s as you scale up depth, width and image resolution compared to accuracy gains and FLOPS (computations):

Image from https://arxiv.org/abs/1905.11946, headers added for clarity

Lesson 2 — In order to scale up efficiently, all dimensions of depth, width and resolution have to be scaled together, and there is an optimal balance for each dimension relative to the others.

The authors discovered that there is a synergy in scaling multiple dimensions together, and after an extensive grid search derived the theoretically optimal formula of “compound scaling” using the following co-efficients:

Depth = 1.20

Width = 1.10

Resolution = 1.15

In other words, to scale up the CNN, the depth of layers should increase 20%, the width 10% and the image resolution 15% to keep things as efficient as possible while expanding the implementation and improving the CNN accuracy.

This compound scaling formula is used to scale up the EfficientNet from B0-B7 (increasing layers).

The paper then compares their EfficientNet accuracy to most other CNN’s on ImageNet, with striking results when comparing the number of parameters/computations required vs almost every other CNN architecture (5x+ reduction while keeping or beating most other accuracy).

Perhaps more importantly, the authors then test out EfficientNet using transfer learning (using pre-trained weights from ImageNet) on multiple datasets. This is the normal process of how most CNN’s get implemented for actual work products, and the results are quite impressive:

EfficientNet showing outstanding results via Transfer Learning on multiple datasets

As further evidence of the efficiency of combining scaling in multiple dimensions, here’s a heatmap comparison from the paper showing how EfficientNet architecture effectively hones in on items in an image relative to scaling in single dimension architectures:

The last heatmap shows the results of compound scaling — note how much more cleanly it captures the items.

Finally, the authors show that using EfficientNet can be up to 5x+ faster for inference (live use) on mobile phones.

The huge reduction in parameters and computations required with EfficientNet may open up new opportunities for CNN’s to be used on mobile platforms and represent a big leap forward for Mobile AI!

Overall, this is a breakthrough in CNN architecture and shows the need for CNN developers to think in multiple dimensions when scaling up architectures.

Here’s the direct link to the paper: https://arxiv.org/abs/1905.11946

Source code is available as well:

1 — TensorFlow official: https://github.com/mingxingtan/efficientnet

2 — PyTorch has multiple implementations, and I’m personally working on one to hopefully roll into FastAI:

A — https://github.com/zsef123/EfficientNets-PyTorch

B — https://github.com/lukemelas/EfficientNet-PyTorch

C — https://github.com/rwightman/pytorch-image-models

(this one has EfficientNet intermingled with a number of other models so harder to follow the code, but otherwise looks really good).

D — In progress (mine): https://github.com/lessw2020/EfficientNet-PyTorch

EfficientNet from Google — Optimally Scaling CNN model architectures with “compound scaling”

Written by Less Wright

Responses (2)