BERT vs ALBERT explained

BERT vs ALBERT explained

ALT

Introduction

Implementing Machine Learning and Deep Learning models at scale require an immense amount of training time and computational resources. Particularly in the context of language representation learning, studies have shown that full network pre-training which is large is of crucial importance for achieving state-of-the-art performance. But, we know that increasing the model size results in an increase in the number of model parameters, which significantly increases the training and computation requirements. This can be a huge challenge in the domain of large scale computing. In this blog, we provide a brief summary of the ICLR paper “ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS.” This paper talks about two parameter reduction techniques to lower memory consumption and increase the training speed of the BERT (Bidirectional Encoder Representations from Transformers) architecture. The proposed methods in the paper led to models that scale much better compared to the original BERT.

What is BERT?

We all know Google’s BERT has changed the NLP landscape, but what is it exactly? BERT is one of the most famous natural language processing (NLP) frameworks used to help computers understand the meaning of text by using the surrounding text as context. BERT which stands for ‘Bidirectional Encoder Representations from Transformers’ is built upon the concept of transformers where every output element is connected to every input element and their weights are dynamically calculated. In NLP, this process is commonly known as ‘Attention’.

Now… what is ALBERT?

BERT is known for performing tasks ranging from simple text classification to complex tasks like Question Answering. While it seems like the perfect language model, this state-of-the-art architecture deals with millions if not billions of parameters which might significantly hamper training speed as we scale these models since communication overhead is directly proportional to the number of parameters. These issues are addressed by designing A Lite BERT (ALBERT) which is similar to the architecture of BERT, except for the fact that it deals with much lesser parameters. So, how exactly does ALBERT overcome this issue? ALBERT incorporates two parameter reduction techniques in its implementation, which are: Factorized embedding parameterization and Cross-layer parameter sharing. Apart from these, self-supervised loss is also introduced for sentence-order prediction

Wondering what these mean? Let’s now dive into some details!

First, let’s look at the ALBERT model architecture. It is similar to that of BERT, that is, it uses a transformer encoder with GELU non-linearities.

ALT
Parameter Symbol
Embedding Size E
Number of Encoder Layers L
Hidden size H
Feed forward/filter size 4H
Number of attention heads H/64

Let us now look at how these parameter reduction techniques actually work.

1. Factorized embedding parameterization

In BERT, the WordPiece embedding size E is the same as the hidden layer size H. This leads to suboptimal performance due to the following reasons:

  • NLP tasks require a very large vocabulary size, denoted by V. If the embedding size is equal to the hidden size H, then increasing H leads to increase in size of the embedding matrix, i.e., V X E. This leads to an increase in the number of parameters in the model to billions, hence circling back to our primary problem.
  • WordPiece embeddings are meant to learn context-independent representations, whereas hidden-layer embeddings are meant to learn context-dependent representations. BERT primarily uses context-dependent representations, which requires the hidden size H to be much greater than embedding size E. If H and E are tied together, increasing H will increase E, thereby increasing the total model parameters.

Now, to combat this, ALBERT first decomposes the embedding parameters into two smaller matrices. First, the one-hot encoded vectors are projected into the lower dimensional embedding space of size E, and then projected to the hidden space of size H. We are therefore going from O(V × H) to O(V × E + E × H). This is quite significant because it reduces the number of parameters when H»E.

ALT

In the above table, we can see the performance of ALBERT based models with varying the embedding size E. We can see that non-shared embeddings (BERT style) perform better at higher E’s, but not by a significant margin. So for the expense of 1% reduction in accuracy in ALBERT, the number of parameters reduced is in the range 70-80M, which is a significant improvement from BERT. Out of all the E’s, 128 appears to perform better than the rest.

2. Cross-layer parameter sharing

The main purpose of parameter sharing is the radical reduction of parameters in a network. While the accuracy does slightly reduce by employing this method, the main goal of parameter reduction is achieved along with generalization of the model. While there are many ways to share parameters, ALBERT takes the default decision of sharing all parameters across layers. The performance of BERT and ALBERT can be compared by looking at the L2 and Cosine distances of the input and output embeddings of each layer as shown below.

ALT

As we can see in the figure above, the transitions from layer to layer are much smoother for ALBERT than BERT. Hence, apart from just parameter reduction, parameter sharing across layers also stabilizes the parameters.

ALT

The above table compares the ALBERT based models based on different configurations of parameter sharing. It considers embedding sizes E = 128 and E = 768. It is evident from the results that the not-shared (BERT-style) strategy performs the best, at the cost of a large number of parameters. The all-shared strategy (ALBERT-style) hurts the performance under both E’s, but the reduction is not severe compared to the not-shared strategy. Therefore, the all-shared strategy is better for this application and used as the default choice.

3. Inter-sentence coherence loss

In BERT, two types of losses are used, namely, Masked Language Modelling (MLM) loss and Next Sentence Prediction (NSP) loss. NSP loss is used to determine if two segments occur consecutively in a text. It was found that NSP loss is unreliable due to its lack of difficulty as a task. Therefore, in ALBERT, a new loss called sentence-order prediction (SOP) loss, focusing on inter sentence coherence was used. For positive samples, it uses two consecutive sentences from the same document and the same consecutive sentences with order swapped for negative examples. This helps to learn finer-grained distinctions about discourse-level coherence properties. Therefore, the ALBERT model performs better on multi-sentence encoding tasks.

ALT

This table compares the results of additional inter-sentence loss. It takes into account no additional loss, as in XLNet- and RoBERTa-style, NSP (BERT-style) and SOP (ALBERT-style). The comparison is performed for both intrinsic and downstream tasks. We can see that SOP loss solves the NSP tasks well, and performs much better on SOP tasks. The downstream performance is much better with SOP loss for multi-sentence encoding tasks, providing an improvement of 1% on an average.

How do these two compare?

1. Comparison with number of parameters

ALT

Now that we have talked about the methods used for parameter reduction, let us actually compare BERT and ALBERT by looking at some numbers. For example, ALBERT-large has about 18x lesser parameters compared to BERT-large which can be viewed as ALBERT having 18M parameters while BERT has 334M parameters!! We could also look at it from another perspective by considering the hidden layer size. An ALBERT-xlarge configuration with H = 2048 has only 60M parameters and an ALBERT-xxlarge configuration with H = 4096 has 233M parameters, i.e., around 70% of BERT large’s parameters.

From the comparison above, it is obvious that ALBERT performs better than BERT! But as Machine Learning enthusiasts, it is always better to perform comparison with a couple of popular benchmark datasets such as GLUE, SQuAD and RACE.

2. Comparison with benchmarks

ALT

ALBERT-xxlarge requires only 70% of the BERT-large’s parameters, to achieve significant improvements over BERT-large. This improvement can be largely seen on RACE (+8.4%).

3. Comparison with training time

ALT

The table compares the time of training vs the data throughput. Generally, longer training leads to better performance. So, here, training time is kept constant and data throughput is compared. We can see that ALBERT-xxlarge outperforms BERT-large in just 125k steps (32 hours), in comparison to BERT-large which takes 400k steps (34 hours) to achieve similar results. Here again, the most improvement can be seen on RACE (+5.2%)

The authors then decided to get their hands dirty and try out a few add-ons to improve the model! Let’s see what this is.

Additional training data and dropout effects

ALT

Up until this point we have only considered 2 datasets, namely Wikipedia and BOOKCORPUS. But the figure above shows the performance when we add additional data used by both XLNet and RoBERTa. It is evident from the figure that adding data gives a significant boost to the dev set MLM accuracy. But what is surprising is that even after training for 1M steps, the largest models do not overfit to their training data. So removing dropouts can further increase the capacity of the models which results in higher MLM accuracy as shown in the above figure. It is always said that adding combinations of batch normalization and dropout to CNNs can improve the model accuracy, but there is evidence which proves this theory wrong and shows that it may actually end up producing harmful results!

Conclusion

ALBERT is successful in terms of reduction in the number of parameters by giving rise to powerful contextual representations, thereby giving significantly better results. However, due to its large structure, ALBERT is computationally more expensive than BERT. Many recent works have tackled this issue by including sparse and block attention.

That’s it folks! Hope this was a good and informative read.

Bibliography

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Medium. (2019. September 27). Google’s ALBERT Is a Leaner BERT; Achieves SOTA on 3 NLP Benchmarks https://medium.com/syncedreview/googles-albert-is-a-leaner-bert-achieves-sota-on-3-nlp-benchmarks-f64466dd583

Machinecurve. (2021. Januray 6). ALBERT explained: A Lite BERT https://www.machinecurve.com/index.php/2021/01/06/albert-explained-a-lite-bert/