What LayerNorm really does for Attention in Transformers

Less Wright
3 min readMay 14
LayerNorm is more than scaling…(image credit)

2 things, not 1…

Normalization via LayerNorm has been part and parcel of the Transformer architecture for some time. If you asked most AI practitioners why we have LayerNorm, the generic answer would be that we use LayerNorm to normalize the activations on the forward pass and gradients on the backward.

Less Wright

PyTorch, Deep Learning, Object detection, Stock Index investing and long term compounding.