Keywords: LLM, Network Pruning, Hessian-based Pruning
TL;DR: This paper presents a new SoTA Hessian-based one-shot LLM pruning algorithm, which can be applied to unstructured and semi-structured sparsities.
Abstract: Pruning is widely recognized as an effective method for reducing the parameters of large language models (LLMs), potentially leading to more efficient inference. One classic and prominent path of one-shot LLM pruning is to leverage the second-order gradients (i.e., Hessian), represented by the pioneering works like SparseGPT (Frantar & Alistarh, 2023). However, the predefined left-to-right pruning order in SparseGPT leads to suboptimal performance when the weights exhibit columnar patterns. This paper studies the effect of pruning order under the SparseGPT framework. The analyses lead us to propose ROSE, a reordered SparseGPT method that prioritizes weight columns with larger potential pruning errors to be processed first. Specifically, following the block-wise iterative pruning scheme of SparseGPT, we first perform a pre-pruning step to identify weights that are highly likely to be pruned, based on which we compute both column-wise and block-wise pruning loss. Columns within each block are then reordered in descending order of column loss, while blocks are reordered in descending order of block loss. We further analyze different layer types and selectively apply reordering to specific layers. Substantial empirical results on prevalent LLMs (LLaMA2-7B/13B/70B, LLaMA3-8B, Mistral-7B) demonstrate that ROSE surpasses the original SparseGPT and other counterpart pruning methods.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 77
Loading