SP-LoRA: Sparsity-Preserved Low-Rank Adaptation for Sparse Large Language Model

Yuxuan Hu; Jing Zhang; Zhe Zhao; Cuiping Li; Hong Chen

SP-LoRA: Sparsity-Preserved Low-Rank Adaptation for Sparse Large Language Model

Yuxuan Hu, Jing Zhang, Zhe Zhao, Cuiping Li, Hong Chen

20 Sept 2024 (modified: 01 Dec 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: sparsity, parameter efficient fine-tuning, low rank adaptation, large language model

TL;DR: We propose a low-rank fine-tuning method for sparse LLMs and address the challenge of high memory overhead for preserving model sparsity.

Abstract: Large Language Models (LLMs) have shown remarkable performance in various natural language processing tasks but suffer from substantial hardware resource requirements and inference latency issues due to their vast parameter counts. To mitigate these challenges, several post-training pruning techniques such as SparseGPT, Wanda, and RIA have been developed to reduce parameter sizes. However, these methods often result in performance gaps, particularly for smaller models, and lack efficient fine-tuning strategies that preserve sparsity. This paper introduces SP-LoRA, a novel approach that combines the benefits of low-rank adaptation (LoRA) with the efficiency of sparse models. SP-LoRA addresses the issue of density reversion when merging LoRA adapters with sparse matrices through the introduction of a mask matrix $\mathcal{M}$, ensuring sparsity is maintained. Furthermore, since maintaining sparsity tends to result in a large memory overhead, we propose gradient checkpointing and memory reuse techniques to optimize GPU memory usage during fine-tuning, achieving comparable efficiency to standard LoRA. Through extensive evaluations on pruned LLMs using methods like Wanda and SparseGPT, followed by fine-tuning with SP-LoRA, we demonstrate its effectiveness in both zero-shot scenarios and domain-specific tasks. Our key contributions include a parameter-efficient fine-tuning method for sparse LLMs, an optimized algorithm for reduced GPU memory overhead, and comprehensive empirical validation across diverse models.

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2070

Loading