BERT vs ALBERT explained

Anonymous

BERT vs ALBERT explained

Anonymous

17 Jan 2022 (modified: 05 May 2023)Submitted to BT@ICLR2022Readers: Everyone

Abstract: Implementing Machine Learning and Deep Learning models at scale require an immense amount of training time and computational resources. Particularly in the context of language representation learning, studies have shown that full network pre-training which is large is of crucial importance for achieving state-of-the-art performance. But, we know that increasing the model size results in an increase in the number of model parameters, which significantly increases the training and computation requirements. This can be a huge challenge in the domain of large scale computing. In this blog, we provide a brief summary of the ICLR paper “ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS.” This paper talks about two parameter reduction techniques to lower memory consumption and increase the training speed of the BERT (Bidirectional Encoder Representations from Transformers) architecture. The proposed methods in the paper led to models that scale much better compared to the original BERT.

Submission Full: zip

Blogpost Url: yml

ICLR Paper: https://arxiv.org/abs/1909.11942

2 Replies

Loading