SliderQuant: Accurate Post-Training Quantization for LLMs

ICLR 2026 Conference Submission17118 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large language models, post-training quantization, low-bit neural networks, model compression
TL;DR: This paper presents SliderQuant, a new post-training quantization framework for LLMs, which is superior to existing methods.
Abstract: In this paper, we address post-training quantization (PTQ) for large language models (LLMs) from an overlooked perspective: given a pre-trained LLM, the predominant sequential quantization framework treats different layers equally, but this may be not optimal in challenging bit-width settings. We empirically study the quantization impact of different layers on model accuracy, and observe that: (1) shallow/deep layers are usually more sensitive to quantization than intermediate layers; (2) among shallow/deep layers, the most sensitive one is the first/last layer, which exhibits significantly larger quantization error than others. These empirical observations imply that the quantization design for different layers of LLMs is required on multiple levels instead of a single level shared to all layers. Motivated by this, we propose a new PTQ framework termed **Sliding**-lay**er** **Quant**ization (SliderQuant) that relies on a simple adaptive sliding quantization concept facilitated by few learnable parameters. The base component of SliderQuant is called inter-layer sliding quantization, which incorporates three types of sliding window designs tailored for addressing the varying layer sensitivity to quantization. The other component is called intra-layer sliding quantization that leverages an incremental strategy to quantize each window. As a result, SliderQuant has a strong ability to reduce quantization errors across layers. Extensive experiments on various language generation and reasoning tasks with different LLMs show that our method outperforms previous works for both weight-only quantization and weight-activation quantization. Code will be made publicly available.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 17118
Loading