Targeted Low-rank Refinement: Enhancing Sparse Language Models with Precision

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propose an iterative method for refining pruned neural network weights, aiming to improve model performance while maintaining sparsity
Abstract: Pruning is a widely used technique for compressing large neural networks that eliminates weights that have minimal impact on the model's performance. Current pruning methods, exemplified by magnitude pruning, assign an importance score to each weight based on its magnitude and remove weights with scores below a certain threshold. Nonetheless, these methods often create a gap between the original dense and the pruned sparse model, potentially impairing performance. Especially when the sparsity ratio is high, the gap becomes more pronounced. To mitigate this issue, we introduce a method to bridge the gap left by pruning by utilizing a low-rank approximation of the difference between the dense and sparse matrices. Our method entails the iterative refinement of the sparse weight matrix augmented by a low-rank adjustment. This technique captures and retains the essential information often lost during pruning, thereby improving the performance of the pruned model. Furthermore, we offer a comprehensive theoretical analysis of our approach, emphasizing its convergence properties and establishing a solid basis for its efficacy. Experimental results on LLaMa models validate its effectiveness on large language models across various pruning techniques and sparsity levels. Our method shows significant improvements: at 50\% sparsity, it reduces perplexity by 53.9\% compared to conventional magnitude pruning on LLaMa-7B. Furthermore, to achieve a specific performance target, our approach enables an 8.6\% reduction in model parameters while maintaining a sparsity ratio of about 50\%.
Lay Summary: Large language models like ChatGPT contain billions of parameters, making them powerful but computationally expensive. Researchers use "pruning" to remove less important connections and make models faster, but this often hurts performance significantly. We developed a method to recover the lost performance by identifying what information was removed during pruning and adding back a compressed low-rank component. Our approach works without additional training data and maintains the sparse structure needed for efficient hardware execution.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Primary Area: Deep Learning->Large Language Models
Keywords: Pruning, Large Language Model, Model Compression
Submission Number: 2068
Loading