Keywords: large language models, model compression, efficient fine-tuning
TL;DR: We approximate model weight as the sum of a low-rank matrix and a sparse matrix and efficiently fine-tune the compressed model.
Abstract: Large Language Models (LLMs) have recently emerged as a significant advancement in natural language processing; however, their large scale and computational complexity make deployment a challenge. Model pruning has emerged as a post-training strategy to reduce LLMs' memory and computation needs. Despite notable progress, these techniques show a reduction in performance and necessitate post-pruning for recovery. To address these problems, we introduce $\textbf{ELSA}$, a novel method combining pruning and low-rank decomposition for better compression and recovery. We first use an alternating projections method to decompose the weight matrices into sparse matrices and low-rank matrices, which is validated from both theoretical and empirical perspectives; then we freeze the sparse matrices and update the low-rank matrices to efficiently recover the performance. To demonstrate the effectiveness and efficiency of the method, we conduct experiments on various language tasks (seven zero-shot tasks and language modeling) and models from different families (LLaMA, OPT, and Qwen) and different scales. The experiments show that the method outperforms state-of-the-art pruning methods and has comparable inference efficiency.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 18533
Loading