Dobi-SVD: Differential SVD for LLM Compression and Some New Perspectives

ICLR 2025 Conference Submission5809 Authors

Published: 22 Jan 2025, Last Modified: 22 Jan 2025ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Model Compression, Low-Rank Decomposition, SVD, Effecient LLM, Differentiable
TL;DR: We are the first to theoretically prove that truncating activations outperforms truncating weights, and we propose Dobi-SVD, the first SVD-based method to significantly compress LLM weights with minimal performance drop.
Abstract: Large language models (LLMs) have sparked a new wave of AI applications; however, their substantial computational costs and memory demands pose significant challenges to democratizing access to LLMs for a broader audience. Singular Value Decomposition (SVD), a technique studied for decades, offers a hardware-independent and flexibly tunable solution for LLM compression. In this paper, we present new directions using SVD: we first theoretically analyze the optimality of truncating weights and truncating activations, then we further identify three key issues on SVD-based LLM compression, including (1) How can we determine the optimal truncation position for each weight matrix in LLMs? (2) How can we efficiently update the weight matrices based on truncation position? (3) How can we address the inherent "injection" nature that results in the information loss of the SVD? We propose an effective approach, **Dobi-SVD**, to tackle the three issues. First, we propose a **differentiable** truncation-value learning mechanism, along with gradient-robust backpropagation, enabling the model to adaptively find the optimal truncation positions. Next, we utilize the Eckart-Young-Mirsky theorem to derive a theoretically **optimal** weight update formula through rigorous mathematical analysis. Lastly, by observing and leveraging the quantization-friendly nature of matrices after SVD decomposition, we reconstruct a mapping between truncation positions and memory requirements, establishing a **bijection** from truncation positions to memory. Experimental results show that with a 40\% parameter-compression rate, our method achieves a perplexity of 9.07 on the Wikitext2 dataset with the compressed LLama-7B model, a 78.7\% improvement over the state-of-the-art SVD for LLM compression method. We emphasize that Dobi-SVD is the first to achieve such a high-ratio LLM compression with minimal performance drop. We also extend our Dobi-SVD to VLM compression, achieving a 20\% increase in throughput with minimal performance degradation. We hope that the inference speedup—up to 12.4x on 12GB NVIDIA Titan Xp GPUs and 3x on 80GB A100 GPUs for LLMs, and 1.2x on 80GB A100 GPUs for VLMs—will bring significant benefits to the broader community such as robotics.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5809
Loading