What LayerNorm really does for Attention in Transformers

3 min readMay 14, 2023

LayerNorm is more than scaling…(image credit)

2 things, not 1…

Normalization via LayerNorm has been part and parcel of the Transformer architecture for some time. If you asked most AI practitioners why we have LayerNorm, the generic answer would be that we use LayerNorm to normalize the activations on the forward pass and gradients on the backward.

What LayerNorm really does for Attention in Transformers

2 things, not 1…

Written by Less Wright