Keywords: NLP, LLM, LLM Compression
TL;DR: This method learns, through gradient descent, the optimal ranks of a low rank decomposition to use per layer when compressing LLMs using low-rank decomposition
Abstract: Approaches for large-language model compression using low-rank decomposition have made strides, particularly with the introduction of activation and loss-aware Singular Value Decomposition (SVD) that improve the trade-off between decomposition rank and downstream task performance. Despite these advancements, a persistent challenge remains—selecting the optimal ranks for each layer to jointly optimize compression rate and downstream task accuracy. Current methods either rely on heuristics that can yield sub-optimal results due to their limited discrete search space or are gradient-based but are not as performant as heuristic approaches without post-compression fine-tuning. To address these issues, we propose Learning to Low-Rank Compress (LLRC), a gradient-based approach which directly learns the weights of masks that select singular values in a fine-tuning-free setting. Using a calibration dataset of just 3,000 documents, this training architecture teaches the model to select fewer and fewer singular values while minimizing the divergence of intermediate activations from the original model. Our approach outperforms competing fine-tuning-free rank selection approaches, such as Sensitivity-based Truncation Rank Searching (STRS), Adaptive Rank Selection (ARS), and LLM-Pruner on Llama-2-7B, Llama-3-8B, Gemma-7B, and Llama-2-13B across various compression rates on common-sense reasoning and open-domain question-answering tasks.For instance, with a compression rate of 20%, our approach outperforms the competitive STRS on MMLU, BoolQ, and OpenbookQA by 12%, 3.5%, and 4.4%, respectively, using Llama-2-13B. More remarkably, our fine-tuning-free approach consistently outperforms LLM-Pruner, even after fine-tuning, on NQ-Open, MMLU, BoolQ, and OpenbookQA with Llama-2-7B.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12005
Loading