Keywords: sparsity, natural language processing, pre-training, computational efficiency
Abstract: Identifying algorithms for computational efficient unsupervised training of large language models is an important and active area of research.
In this work, we develop and study a straightforward, dynamic always-sparse pre-training approach for BERT language modeling, which leverages periodic compression steps based on magnitude pruning followed by random parameter re-allocation.
This approach enables us to achieve Pareto improvements in terms of the number of floating-point operations (FLOPs) over statically sparse and dense models across a broad spectrum of network sizes.
Furthermore, we demonstrate that training remains FLOP-efficient when using coarse-grained block sparsity, making it particularly promising for efficient execution on modern hardware accelerators.
One-sentence Summary: We present a dynamic sparse pre-training approach for BERT and demonstrate its superior FLOP-efficiency when compared to the dense baseline.
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/arxiv:2108.06277/code)
11 Replies
Loading