TL;DR: A structured mixed-precision quantization method for large language, achieving both hardware efficiency and improved language performance.
Abstract: Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs). However, while uniform-precision quantization is computationally efficient, it often compromises model performance. To address this, we propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise with high accuracy. Our approach leverages the observation that important weights follow a structured distribution and introduces two key components: 1) Salience-Determined Bit Allocation adaptively assigns bit-widths to groups within each layer based on their salience; and 2) Salience-Weighted Quantizer Calibration optimizes quantizer parameters by incorporating element-level salience, retain essential information. With its structured group-wise partitioning, SliM-LLM provides a hardware-friendly solution that matches the efficiency of uniform quantization methods while significantly improving accuracy. Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths. For example, a 2-bit quantized LLaMA-7B model reduces memory usage by nearly 6x compared to the floating-point baseline, decreases perplexity by 48% compared to state-of-the-art gradient-free PTQ methods, and maintains GPU inference speed. Additionally, the extended version, SliM-LLM+, which incorporates gradient-based quantization, further reduces perplexity by 35.1%. Our code is
available at https://github.com/Aaronhuang-778/SliM-LLM.
Lay Summary: When large language models (LLMs) are used, they can require a lot of memory and computing power. To make these models smaller and more efficient, we use a process called quantization, which reduces the amount of information the model needs to store. However, standard methods of quantization can hurt the model’s performance and previous mixed-precision quantization is not hardware-friendly.
To solve this, we developed SliM-LLM, a smarter way of quantizing models. Instead of treating all parts of the model the same, SliM-LLM uses a "mixed-precision" approach and makes it structural inner matrix. It assigns more storage (bit-width) to the important groups of in each matrix and less to the less important parts, based on their importance, or "salience." This method keeps the model accurate while staying efficient.
Our tests show SliM-LLM dramatically improves performance compared to other methods, especially for small memory sizes. For example, it reduces memory use by 6x for the LLaMA-7B model while maintaining speed and outperforming other approaches in accuracy.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/Aaronhuang-778/SliM-LLM
Primary Area: Deep Learning->Large Language Models
Keywords: Large Language Model, Low-bit Quantization, Machine Learning, Mixed-Precision
Submission Number: 9428
Loading