How to Prune Your Language Model: Recovering Accuracy on the ``Sparsity May Cry'' Benchmark

Eldar Kurtic; Torsten Hoefler; Dan Alistarh

How to Prune Your Language Model: Recovering Accuracy on the ``Sparsity May Cry'' Benchmark

Eldar Kurtic, Torsten Hoefler, Dan Alistarh

Published: 20 Nov 2023, Last Modified: 21 Dec 2023CPAL 2024 (Proceedings Track) OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: pruning, deep learning, benchmarking

TL;DR: We provide a set of pruning guidelines, and instantiate them to recover accuracy on the challenging Sparsity May Cry benchmark.

Abstract: Pruning large language models (LLMs) from the BERT family has emerged as a standard compression benchmark, and several pruning methods have been proposed for this task. The recent ``Sparsity May Cry'' (SMC) benchmark put into question the validity of all existing methods, exhibiting a more complex setup where many known pruning methods appear to fail. We revisit the question of accurate BERT-pruning during fine-tuning on downstream datasets, and propose a set of general guidelines for successful pruning, even on the challenging SMC benchmark. First, we perform a cost-vs-benefits analysis of pruning model components, such as the embeddings and the classification head; second, we provide a simple-yet-general way of scaling training, sparsification and learning rate schedules relative to the desired target sparsity; finally, we investigate the importance of proper parametrization for Knowledge Distillation in the context of LLMs. Our simple insights lead to state-of-the-art results, both on classic BERT-pruning benchmarks, as well as on the SMC benchmark, showing that even classic gradual magnitude pruning (GMP) can yield competitive results, with the right approach.

Track Confirmation: Yes, I am submitting to the proceeding track.

Submission Number: 52

Loading