Dynamic Low-Rank Sparse Adaptation for Large Language Models

Weizhong Huang; Yuxin Zhang; Xiawu Zheng; Liuyang; Jing Lin; Yiwu Yao; Rongrong Ji

Dynamic Low-Rank Sparse Adaptation for Large Language Models

Weizhong Huang, Yuxin Zhang, Xiawu Zheng, Liuyang, Jing Lin, Yiwu Yao, Rongrong Ji

Published: 22 Jan 2025, Last Modified: 24 Mar 2025ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models; Network Sparsity; Low-Rank Adaptation

TL;DR: We present Dynamic Low-Rank Sparse Adaptation, an efficient fine-tuning method to enhance the performance of sparse Large Language Models.

Abstract: Despite the efficacy of network sparsity in alleviating the deployment strain of Large Language Models (LLMs), it endures significant performance degradation. Applying Low-Rank Adaptation (LoRA) to fine-tune the sparse LLMs offers an intuitive approach to counter this predicament, while it holds shortcomings include: 1) The inability to integrate LoRA weights into sparse LLMs post-training, and 2) Insufficient performance recovery at high sparsity ratios. In this paper, we introduces dynamic $\textbf{Lo}$w-rank $\textbf{S}$parse $\textbf{A}$daptation $\textbf{(LoSA)}$, a novel method that seamlessly integrates low-rank adaptation into LLM sparsity within a unified framework, thereby enhancing the performance of sparse LLMs without increasing the inference latency. In particular, LoSA dynamically sparsifies the LoRA outcomes based on the corresponding sparse weights during fine-tuning, thus guaranteeing that the LoRA module can be integrated into the sparse LLMs post-training. Besides, to achieve the optimal sparse model architecture, LoSA leverages Representation Mutual Information (RMI) as an indicator to determine the importance of layers, thereby dynamically determining the optimal layer-wise sparsity rates during fine-tuning. Predicated on this, LoSA adjusts the rank of the LoRA module based on the variability in layer-wise reconstruction errors, allocating an appropriate fine-tuning for each layer to reduce the output discrepancies between dense and sparse LLMs. Extensive experiments tell that LoSA can efficiently boost the efficacy of sparse LLMs within a few hours, without introducing any additional inferential burden. For example, LoSA reduced the perplexity of sparse LLaMA-2-7B by $\textbf{68.73}$$\downarrow$ and increased zero-shot accuracy by $\textbf{16.32}$%$\uparrow$, achieving a $\textbf{2.60$\times$}$ speedup on CPU and $\textbf{2.23$\times$}$ speedup on GPU, requiring only $\textbf{45 minutes}$ of fine-tuning on $\textbf{a single}$ NVIDIA A100 80GB GPU. Code is available at https://github.com/wzhuang-xmu/LoSA.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 417

Loading