ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Zhenzhong Lan; Mingda Chen; Sebastian Goodman; Kevin Gimpel; Piyush Sharma; Radu Soricut

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut

Published: 20 Dec 2019, Last Modified: 12 Oct 2025ICLR 2020 Conference Blind SubmissionReaders: Everyone

Keywords: Natural Language Processing, BERT, Representation Learning

TL;DR: A new pretraining method that establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large.

Abstract: Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT~\citep{devlin2018bert}. Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and \squad benchmarks while having fewer parameters compared to BERT-large. The code and the pretrained models are available at https://github.com/google-research/ALBERT.

Code: https://github.com/google-research/ALBERT

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 35 code implementations](https://www.catalyzex.com/paper/albert-a-lite-bert-for-self-supervised/code)

Original Pdf: pdf

32 Replies

Loading