Keywords: model compression, quantization, efficient finetuning, llm
TL;DR: ClusComp outperforms existing quantization methods across various bit-withs, and is both parameter and memory-efficient for finetuning.
Abstract: As large language models (LLMs) continue to scale, model compression becomes increasingly important for enabling edge deployment and ensuring accessibility to users with limited resources. Weight-only quantization is a key technique for model compression, allowing for a substantial reduction in model size while preserving performance. However, as bit-width decreases, the performance of quantized LLMs tends to degrade significantly. Additionally, due to the non-differentiable operation in quantization, standard finetuning on quantized LLMs is unsupported, and alternative finetuning approaches often fail to match the effectiveness of full finetuning. In this paper, we introduce ClusComp, a novel and simple model compression paradigm. ClusComp first clusters the weight matrices to generate codebooks, and then tunes these codebooks block-by-block to reconstruct intermediate activations. Despite its simplicity, ClusComp (1) consistently achieves better performance in 2-4 bit precision; (2) pushes the compression limit to the 1-bit level, and outperforms existing ultra-low-bit methods with limited finetuning steps; (3) facilitates seamless and efficient finetuning, surpasses existing quantization-based or memory-efficient finetuning methods, and even rivals full finetuning of the FP16 model. Notably, these procedures can be executed on a single NVIDIA A6000-48GB GPU for LLMs with as many as 70B parameters.
Primary Area: other topics in machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 440
Loading