Empirical Upper Bounds for Unstructured Sparsity in Compute-Efficient Language Modeling

Esha Singh; Shane Bergsma; Nolan Simran Dey; Joel Hestness; Gavia Gray

Empirical Upper Bounds for Unstructured Sparsity in Compute-Efficient Language Modeling

Esha Singh, Shane Bergsma, Nolan Simran Dey, Joel Hestness, Gavia Gray

Published: 09 Oct 2024, Last Modified: 19 Nov 2024Compression Workshop @ NeurIPS 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: sparsity, scaling laws, language models, proximal methods

TL;DR: A case study on why sparsifying methods should be evaluated relative to a scaling law.

Abstract: Sparsity in deep neural networks promises two improvements in computational efficiency, fewer FLOPs spent both to train the network and to perform inference. We find that both may be quantified best using a compute-efficient scaling law. This tool allows us to compare existing methods to train networks with unstructured sparse regularisation and parametrization. In this setting, it is natural to focus on the proportion of weights in the network whose magnitude is below a given threshold and assume that those weights do not affect the output of the network. However, we may not know where that threshold is, so we aim to separate our analysis from a specific threshold. By evaluating the network sparsity at many possible thresholds we can characterise an empirical upper bound on the advantage of sparsity for pre-training large language models. We test this bound comparing the performance of existing sparse regularization methods to find a 15\% reduction in pre-training FLOPs or a 30-40\% reduction in inference FLOPs and further identify decoupled proximal methods as a promising direction.

Submission Number: 82

Loading