TL;DR: The importance of different layers is automatically determined by calculating weight and activation information, ensuring the model's performance is maintained under high-sparsity pruning.
Abstract: Pruning has recently been widely adopted to reduce the parameter scale and improve the inference efficiency of Large Language Models (LLMs). Mainstream pruning techniques often rely on uniform layerwise pruning strategies, which can lead to severe performance degradation at high sparsity levels. Recognizing the varying contributions of different layers in LLMs, recent studies have shifted their focus toward non-uniform layerwise pruning. However, these approaches often rely on pre-defined values, which can result in suboptimal performance. To overcome these limitations, we propose a novel method called Dynamic Layerwise Pruning (DLP). This approach adaptively determines the relative importance of each layer by integrating model weights with input activation information, assigning pruning rates accordingly. Experimental results show that DLP effectively preserves model performance at high sparsity levels across multiple LLMs. Specifically, at 70% sparsity, DLP reduces the perplexity of LLaMA2-7B by 7.79 and improves the average accuracy by 2.7% compared to state-of-the-art methods. Moreover, DLP is compatible with various existing LLM compression techniques and can be seamlessly integrated into Parameter-Efficient Fine-Tuning (PEFT). We release the code\footnote{The code is available at: \url{https://github.com/ironartisan/DLP}.} to facilitate future research.
Lay Summary: Although large language models are powerful, they run slowly and incur high costs. Traditional pruning methods remove the same proportion of parameters in each layer, and when most weights are removed, they impair accuracy. We wonder whether we can do better by pruning more from less important layers and preserving more important layers without manually tuning thresholds. Our Dynamic Layerwise Pruning (DLP) method retains critical features and applies lower sparsity rates to important layers and higher sparsity rates to less important layers. Notably, at high sparsity levels, our method outperforms previous approaches and seamlessly integrates with other compression and fine‑tuning techniques. Our work facilitates the deployment of LLMs on resource‑constrained devices and contributes to the sustainability of LLM technology.
Primary Area: Deep Learning->Large Language Models
Keywords: Pruning, Large Language Models, Model Compression
Submission Number: 1605
Loading