Meet ALBERT: a new ‘Lite BERT’ from Google & Toyota with State of the Art NLP performance and 18x fewer parameters.

TL;DR = your previous NLP models are parameter inefficient and kind of obsolete. Have a great day.

[*Updated November 6 with Albert 2.0 and official source code release]

Google Research and Toyota Technological Institute jointly released a new paper that introduces the world to what is arguably BERT’s successor, a much smaller/smarter Lite Bert called ALBERT. (“ALBERT: A Lite BERT for Self-supervised Learning of Language Representations”).

ALBERT setting new SOTA for SQuAD and RACE testing, and beating BERT by +14.5% on RACE….but wait until you compare parameter sizes below. (1M and 1.5M refer to training steps used).

ALBERT’s results are of themselves impressive in terms of final results (setting new state of the art for GLUE, RACE, SQuAD) but …the real surprise is the dramatic reduction in model/parameter size.

A combination of two key architecture changes and a training change allow ALBERT to both outperform, and dramatically reduce the model size. Consider the size comparison below — BERT x-large has 1.27 Billion parameters, vs ALBERT x-large with 59 Million parameters!

1.27 Billion params in BERT vs 59M in ALBERT for the same size network (hidden/layers) … ~21.5x smaller.

There’s a lot to unpack in this paper, and I’ll attempt to delve into all the highlights below.

For NLP, are bigger models always better? No…

Let’s start with an important point for NLP in general — this past year there has been progress in NLP by scaling up transformer type models such that each larger model, progressively improved final task accuracy by simply building a larger and larger pre-trained model. In the original BERT paper, they showed that larger hidden sizes, more hidden layers and more attention heads resulted in progressive improvements and tested up to 1024 hidden size.

The largest NLP model to date is NVIDIA’s recently released Megatron, a huge 8 billion parameter model that is over 24x the size of BERT and nearly 6x OpenAI’s GPT-2. Megatron was trained for 9 days on a setup of 512 GPUs.

However, there is arguably a tipping or saturation point where larger does not always equal better, and the authors of ALBERT show that their largest model BERT X-Large, with hidden size of 2048 and 4X the parameters of the original BERT large, actually goes downhill in performance by nearly 20%.

Bigger is not always better — doubling the hidden size, 4x the param size, for BERT degrades accuracy on the RACE dataset.

This is similar to the peaking effect of layer depths for computer vision. Scaling up in layer depth for computer vision improves to a point, and then goes downhill. Example — a ResNet-1000 does not outperform a ResNet152 even though it has 6.5x the layers. In other words, there is a saturation point where training complexity overwhelms and degrades any gains from additional network power.

Thus, with this in mind ALBERT’s creators set about making improvements in architecture and training methods to deliver better results instead of just building a ‘larger BERT’.

What is ALBERT?

The core architecture of ALBERT is BERT-like in that it uses a transformer encoder architecture, along with GELU activation. In the paper, they also use the identical vocabulary size of 30K as used in the original BERT. (V=30,000). However, ALBERT makes three substantial and important changes:

Architecture improvements for more efficient parameter usage:

1 — Factorized Embedding Parameterization

ALBERTS authors note that for BERT, XLNet and RoBERTa the WordPiece Embedding size (E) is tied directly to the H, Hidden Layer Size.

However, ALBERT authors point out that WordPiece embeddings are designed to learn context independent representations. The hidden layer embeddings are designed to learn context dependent representations.

The power of BERT largely relies on learning context dependent representations via the hidden layers. If you tie H and E, and with NLP requiring large V (vocab), then your embedding matrix E, which is really V*E, must scale with H (hidden layers)…and thus you end up with models that can have billions of parameters, but most of which are rarely updated in training.

Therefore tying two items, that work under differing purposes, means inefficient parameters.

Thus, untying the two, results in more efficient parameter usage and thus H (context dependent) should always be larger than E (context independent).

To do this, ALBERT splits the embedding parameters into two smaller matrixes. Thus, instead of projecting one hot vectors directly into H, one hot vectors are projected into a smaller, lower dimension matrix E….and then project E into the H hidden space.

Thus, parameters are reduced from Big O of (V*H), to the smaller Big O of (V*E + E*H).

2 — Cross Layer Parameter Sharing

ALBERT further improves parameter efficiency by sharing all parameters, across all layers. That means Feed Forward Network parameters and Attention parameters are all shared.

As a result, ALBERT’s transitions from layer to layer are smoother vs BERT, and the authors note that this weight sharing helps stabilize the network parameters.

Training changes — SOP, or Sentence Order Prediction:

ALBERT does use MLM (Masked Language Modeling), just like BERT, using up to 3 word masking (n-gram max of 3).

However, where BERT also used NSP, or Next Sentence Prediction, in addition to MLM…ALBERT developed it’s own training method called SOP.

Why not use NSP? It’s important to note that the RoBERTa authors showed that the Next Sentence Prediction (NSP) loss used in the original BERT was not very effective as as training mechanism and simply skipped using it. ALBERT inventors theorized why NSP was not that effective, however they leveraged that to develop SOP — Sentence Order Prediction.

SOP (ALBERT) vs NSP (BERT) and None (XLNet, RoBERTa)

ALBERT author’s theorized that NSP (Next Sentence Prediction) conflates topic prediction with coherence prediction. For reference, NSP takes two sentences — a positive match is where the second sentence is from the same document, a negative match is where the second sentence is from a different document.

By contrast, the ALBERT authors felt inter-sentence coherence was really the task/loss to focus on, not topic prediction, and thus SOP is done as follows:

Two sentences are used, both from the same document. The positive test cases is the two sentences are in proper order. The negative case is the two sentences in swapped order.

This avoids issues of topic prediction, and helps ALBERT to learn much finer grained, discourse or inter-sentence cohesion.

The results of course speak for themselves.

What if we scale ALBERT up?

In line with the previously mentioned note about how scaling up hits diminishing returns, the ALBERT authors performed their own ALBERT scaling testing and found peak points both for layer depth and width (hidden size). The authors thus recommend 12 layer models for ALBERT style cross parameter sharing.

ALBERT finds removing dropout, adding data improves performance:

Very much in line with what computer vision has found (see my article on adding data via augmentation and avoiding dropout), ALBERT’s authors report improved performance from avoiding dropout, and of course, training with more data.


ALBERT represents a new state of the art for NLP on several benchmarks and new state of the art for parameter efficiency. It’s an amazing breakthrough that builds on the great work done by BERT one year ago and advances NLP in multiple aspects. It’s especially refreshing to see that AI’s future is not only based on adding more GPUs and simply building larger pre-training models, but will also progress from improved architecture and parameter efficiency. The massive drop in parameters (or massive increase in parameter efficiency) while setting new state of the art records is an ideal mix for usable, practical AI.

The authors note that future work for ALBERT is to improve it’s computational efficiency, possibly via sparse or block attention. Thus, there’s hopefully even more to come from ALBERT in the future!

  • Update — there is more to come as Google has released both the official source but also provided a v2 Albert as part of the source release!

Here are the improvements from v1 to v2 — depending on the model, it’s a 1–3% average improvement:

Github and official/unofficial source for ALBERT?

Thanks to feedback from Damian Jimenez, I’m pleased to note that Google has now released the official source for ALBERT, v2:

Unofficial PyTorch version: Thanks to a tip from Tyler Kalbach, happy to note that an unofficial PyTorch version of ALBERT is now available!

Unofficial TensorFlow version: Thanks to a tip from Engbert Tienkamp in the comments, an unofficial TensorFlow version of ALBERT has been posted on GitHub here:

Paper link: ALBERT: a Lite BERT for Self-supervised Learning of Language Representations

PyTorch, Deep Learning, Object detection, Stock Index investing and long term compounding.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store