1-Bit Quantization Meets Structured Pruning: Towards Extreme Compression of Large Language Models

Mingzhe Yang; Sihao Lin; Zhihui Li; Xiaojun Chang

1-Bit Quantization Meets Structured Pruning: Towards Extreme Compression of Large Language Models

Mingzhe Yang, Sihao Lin, Zhihui Li, Xiaojun Chang

13 Sept 2025 (modified: 16 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language model, 1-bit quantization, network quantization, structured pruning

Abstract: Large language models (LLMs) have achieved remarkable success across a wide range of AI applications. However, their massive parameter scales pose substantial challenges for deployment in practice. Quantization, a widely adopted compression technique, reduces parameter precision to as low as 1 bit, substantially shrinking the size and storage footprint of LLMs. While existing 1-bit quantization methods have reached the theoretical lower bound of bit-width, they remain confined to low-level quantization and fail to fully exploit structured redundancy for further compression. This is because previous works mainly focus on element-wise weight saliency and overlook the structured distribution of the weight saliency map. As a first attempt, this paper explores a unified framework that integrates structured pruning with 1-bit quantization, leveraging the strengths of both approaches for more effective compression. To this end, we introduce a novel Structured Saliency Score metric to identify which structured units should be pruned or quantized within the LLMs. We showcase that the proposed metric can effectively coordinate the synergy between quantization and pruning with a theoretical analysis. Extensive experiments on diverse LLMs and benchmarks demonstrate that our approach not only surpasses existing 1-bit quantization methods but also achieves memory savings while maintaining competitive performance.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 4740

Loading