Efficient One-Shot Pruning of Large Language Models with Low-Rank Approximation

Published: 01 Jan 2024, Last Modified: 11 Apr 2025SMC 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Model pruning, as an effective method for compressing large language models (LLMs), has recently attracted considerable attention in the field of natural language processing. However, existing LLM pruning methods have two main drawbacks: (1) Iterative pruning for LLMs with over a billion parameters requires retraining, which leads to significant pruning costs. (2) LLMs Pruning is formalized as a weight reconstruction problem that necessitates second-order information, incurring expensive computations. To address these issues, we propose a novel pruning method named Eplra: efficient one-shot pruning of large language models with low-rank approximation, which efficiently identifies sparse networks in LLMs. Specifically, we design a novel pruning metric based on input activations for the rapid one-shot compression of LLMs. We first incorporate input activations into the calculation of weight importance to promote precise pruning of low-priority weights. Then, we perform local weight comparisons across each output of linear layers to induce uniform sparsity. Next, we expand Eplra into semi-structured pruning patterns to accommodate various acceleration scenarios. Finally, we employ low-rank parametrized update matrices to fine-tune the pruned model, facilitating a swift recovery of model performance. Experimental results on various language benchmark datasets demonstrate that Eplra outperforms the state-of-the-art methods.
Loading