SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

Wei Huang; Haotong Qin; Yangdong Liu; Yawei Li; Qinshuo Liu; Xianglong Liu; Luca Benini; Michele Magno; Shiming Zhang; XIAOJUAN QI

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Qinshuo Liu, Xianglong Liu, Luca Benini, Michele Magno, Shiming Zhang, XIAOJUAN QI

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: A structured mixed-precision quantization method for large language, achieving both hardware efficiency and improved language performance.

Abstract: Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs). However, while uniform-precision quantization is computationally efficient, it often compromises model performance. To address this, we propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise with high accuracy. Our approach leverages the observation that important weights follow a structured distribution and introduces two key components: 1) Salience-Determined Bit Allocation adaptively assigns bit-widths to groups within each layer based on their salience; and 2) Salience-Weighted Quantizer Calibration optimizes quantizer parameters by incorporating element-level salience, retain essential information. With its structured group-wise partitioning, SliM-LLM provides a hardware-friendly solution that matches the efficiency of uniform quantization methods while significantly improving accuracy. Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths. For example, a 2-bit quantized LLaMA-7B model reduces memory usage by nearly 6x compared to the floating-point baseline, decreases perplexity by 48% compared to state-of-the-art gradient-free PTQ methods, and maintains GPU inference speed. Additionally, the extended version, SliM-LLM+, which incorporates gradient-based quantization, further reduces perplexity by 35.1%. Our code is available at https://github.com/Aaronhuang-778/SliM-LLM.

Lay Summary: When large language models (LLMs) are used, they can require a lot of memory and computing power. To make these models smaller and more efficient, we use a process called quantization, which reduces the amount of information the model needs to store. However, standard methods of quantization can hurt the model’s performance and previous mixed-precision quantization is not hardware-friendly. To solve this, we developed SliM-LLM, a smarter way of quantizing models. Instead of treating all parts of the model the same, SliM-LLM uses a "mixed-precision" approach and makes it structural inner matrix. It assigns more storage (bit-width) to the important groups of in each matrix and less to the less important parts, based on their importance, or "salience." This method keeps the model accurate while staying efficient. Our tests show SliM-LLM dramatically improves performance compared to other methods, especially for small memory sizes. For example, it reduces memory use by 6x for the LLaMA-7B model while maintaining speed and outperforming other approaches in accuracy.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/Aaronhuang-778/SliM-LLM

Primary Area: Deep Learning->Large Language Models

Keywords: Large Language Model, Low-bit Quantization, Machine Learning, Mixed-Precision

Submission Number: 9428

Loading