What LayerNorm really does for Attention in Transformers
3 min readMay 14, 2023
2 things, not 1…
Normalization via LayerNorm has been part and parcel of the Transformer architecture for some time. If you asked most AI practitioners why we have LayerNorm, the generic answer would be that we use LayerNorm to normalize the activations on the forward pass and gradients on the backward.