Meet ALBERT: a new ‘Lite BERT’ from Google & Toyota with State of the Art NLP performance and 18x fewer parameters.

Less Wright
7 min readSep 28, 2019

TL;DR = your previous NLP models are parameter inefficient and kind of obsolete. Have a great day.

[*Updated November 6 with Albert 2.0 and official source code release]

Google Research and Toyota Technological Institute jointly released a new paper that introduces the world to what is arguably BERT’s successor, a much smaller/smarter Lite Bert called ALBERT. (“ALBERT: A Lite BERT for Self-supervised Learning of Language Representations”).

ALBERT setting new SOTA for SQuAD and RACE testing, and beating BERT by +14.5% on RACE….but wait until you compare parameter sizes below. (1M and 1.5M refer to training steps used).

ALBERT’s results are of themselves impressive in terms of final results (setting new state of the art for GLUE, RACE, SQuAD) but …the real surprise is the dramatic reduction in model/parameter size.

A combination of two key architecture changes and a training change allow ALBERT to both outperform, and dramatically reduce the model size. Consider the size comparison below — BERT x-large has 1.27 Billion parameters, vs ALBERT x-large with 59 Million parameters!

1.27 Billion params in BERT vs 59M in ALBERT for the same size network (hidden/layers) … ~21.5x smaller.

There’s a lot to unpack in this paper, and I’ll attempt to delve into all the highlights…

--

--

Less Wright

PyTorch, Deep Learning, Object detection, Stock Index investing and long term compounding.