Extreme composite compression of large language models through joint optimization

Yuexiao Ma; Zhanhao Xie; Wenting Lin; Xiawu Zheng; Yuhang Wu; Shenao; Jing Lin; Yiwu Yao; Fei Chao; Rongrong Ji

Extreme composite compression of large language models through joint optimization

Yuexiao Ma, Zhanhao Xie, Wenting Lin, Xiawu Zheng, Yuhang Wu, Shenao, Jing Lin, Yiwu Yao, Fei Chao, Rongrong Ji

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: model quantization, model compression, sparsification, joint optimization

Abstract: Post-Training Quantization (PTQ) and Sparsification (PTS) are dominant methods in the compression of Large Language Models (LLMs) due to their minimal resource usage and generalizability. It is a natural idea to integrate quantization and sparsification in a unified framework, which however, often results in substantial accuracy losses. Here we argue that, the key lies in optimization. This paper introduces a novel joint optimization strategy that concurrently mitigates errors induced by both sparsification and quantization. Unlike sequential approaches, our method employs learnable transformation matrices to simultaneously optimize errors across both dimensions, preventing the typical misalignments associated with sequential optimizations. Furthermore, we present a reordering mechanism within the learnable mask sparsification process to maintain consistent sparsity ratios. This mechanism ensures the prioritization of the least important weights during each update iteration, thus enhancing the stability of the compression process. Our approach demonstrates considerable performance enhancements across diverse models and datasets, with the most notable gains observed under conditions of extremely low-bit quantization and high sparsity ratios. For example, in the LLaMA2-13b model with weight quantization at 2 bit and a 75% sparsity configuration, our method surpasses the state-of-the-art (SOTA) by 9.03% in average accuracy across five zero-shot tasks. Meanwhile, in the newest LLaMA3-8b model, with weight quantization at 3 bit and a 50% sparsity configuration, our method outperforms the SOTA by 4.58% (56.86% vs 52.28%) in zero-shot tasks and achieves a perplexity reduction of 4.45 on the WikiText2 dataset (10.78 vs 15.23).

Supplementary Material: zip

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8839

Loading