Towards Structured Dynamic Sparse Pre-Training of BERT

Anastasia S. D. Dietrich; Frithjof Gressmann; Douglas Orr; Ivan Chelombiev; Daniel Justus; Carlo Luschi

Towards Structured Dynamic Sparse Pre-Training of BERT

Anastasia S. D. Dietrich, Frithjof Gressmann, Douglas Orr, Ivan Chelombiev, Daniel Justus, Carlo Luschi

Published: 28 Jan 2022, Last Modified: 22 Jun 2025ICLR 2022 SubmittedReaders: Everyone

Keywords: sparsity, natural language processing, pre-training, computational efficiency

Abstract: Identifying algorithms for computational efficient unsupervised training of large language models is an important and active area of research. In this work, we develop and study a straightforward, dynamic always-sparse pre-training approach for BERT language modeling, which leverages periodic compression steps based on magnitude pruning followed by random parameter re-allocation. This approach enables us to achieve Pareto improvements in terms of the number of floating-point operations (FLOPs) over statically sparse and dense models across a broad spectrum of network sizes. Furthermore, we demonstrate that training remains FLOP-efficient when using coarse-grained block sparsity, making it particularly promising for efficient execution on modern hardware accelerators.

One-sentence Summary: We present a dynamic sparse pre-training approach for BERT and demonstrate its superior FLOP-efficiency when compared to the dense baseline.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/towards-structured-dynamic-sparse-pre/code)

11 Replies

Loading