Res2Net: New deep learning multi-scale architecture, for improved object detection with existing backbones

As Google Brain’s EfficientNet paper showed, there are rapidly diminishing returns on investment for scaling up various aspects of CNN architectures (width, depth, resolution).

A new paper by Gao, Cheng, Zhao et al (Res2Net: a new multi-scale backbone architecture), however, shows that multi-scale or scaling within a given block, rather than the usual layer by layer, is an unexplored domain with additional payoffs especially for object recognition and segmentation.

Most architectures leverage scale on a layer by layer basis. Their innovation here is to employ a hierarchical, cascading feature group (termed ‘scale’) within a given residual block, replacing the generic single 3x3 kernel.

Towards that end, they rebuilt the bottleneck block of the common ResNet architecture and replaced the standard 1–3–1 CNN layout with a “4scale -(3x3)” residual, hierarchical architecture. This change thus created “Res2Net”. The middle main convolution thus moves from single-branch to multi-branch.

A diagram is explanatory:

The concept here is to capture different levels of scale within the image at a more granular level, via increased receptive fields within the block rather than layer by layer, and thereby improve the CNN’s ability to detect and manage objects within the image.

The authors term the number of feature groups within the Res2Net block as the “scale dimension”. Thus the above block is a scale 4, Res2Net block.

These Res2Net modules can then be plugged into standard ResNet or ResNeXt CNN’s, and thus increase their granularity

Res2Net shines on segmentation type of tasks, where the improved object discernment comes into play. A comparison of semantic segmentation images below:

ResNet101 vs Res2Net101- the granularity of the Res2Net blocks improves segmentation. (GT = Ground Truth)

As you can see above, the introduction of the scale dimension within the bottleneck helps the CNN to better outline the items of interest within the image, and thus improve overall accuracy.

Improvements on ImageNet — only change is to switch to Res2Net blocks in the specific architecture.

The github for Res2Net is here:

Updating the ResNet for Res2Net = Res2NetPlus

However, I found that the official implementation of Res2Net was within an older style ResNet in multiple aspects. Thus, I took the Res2Net implementation from github @frgfm (based on github @gasvn ) and modified it to:

1 — Use Mish instead of ReLU for activation (see my article on Mish here for why)

2 — Changed the stem of the ResNet to a cutting edge 3 @ 3x3 stack stem (stride 2, stride 1, stride 1) instead of the older, single 7x7 kernel.

3 — Reversed the order of the BN->activation sequence to be activation->BN. This is based on our findings for FastAI research (credit to Ignacio Oguiza) and backed by the TenCent paper “Rethinking the usage of Batchnorm…”:

“ we should not place Batch Normalization before ReLU since the non-negative responses of ReLU will make the weight layer update in a suboptimal way…”

PyTorch code is available at:

Usage is simple:

Using Res2Net Plus in FastAI framework (this creates a 4 scale, 26 width Res2Net50)

Res2NetPlus at work:

In consulting work I did for building a solar panel detector from satellite imagery, I built a from-scratch Res2NetPlus50 and then compared it to the standard Imagenet pre-trained ResNet50, with transfer learning head. I found the Res2Net50 to have both greater accuracy (+5%), and steadier training.

Ultimately the model went into live production last week, with 97.8% accuracy on the validation data.

Initial production results were in line with training results:


Res2Net, while having similar computational complexity as equivalent ResNet, still runs slower than it’s ResNet counterpart (20% seems about average).

In addition, for classification tasks such as the FastAI leaderboard datasets, Res2Net set records for validation and training loss (i.e. more correct when correct, less wrong when wrong) but not for final absolute accuracy.

This is one I have not figured out how to correct, other than to hypothesize that some classification tasks may not rely as heavily on full object distinction.

Thus, the optimal usage for Res2Net seems to focus on object recognition and segmentation type of tasks.

One tip — Res2Net loves advanced data augmentation such as MixUp, CutMix, etc. You can just watch the validation losses plummet when these are used, so highly recommend leveraging Res2Net with extensive data augmentation.


Official repository for Res2Net:

Res2NetPlus architecture:

PyTorch, Deep Learning, Object detection, Stock Index investing and long term compounding.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store