Keywords: Large Language Models, Model Compression, Post-Training Quantization
TL;DR: FlexibleLLM, a finetuning-free weight-only Post-Training Quantization framework, makes low-bit quantization more efficient and flexible by focusing on the spatial distribution and intrinsic attributes of outliers as well as the noise they introduce.
Abstract: Low-bit quantization is crucial for deploying Large Language Models (LLMs) on resource-constrained hardware. However, existing Post-Training Quantization (PTQ) methods are limited by a monolithic view of outliers, failing to address their dual spatial distribution (both discrete and clustered) and overlooking "attribute outliers"—weights that are sensitive to quantization but not numerically large. Furthermore, these methods generally ignore the critical issue of quantization errors accumulating and amplifying across layers. To overcome these challenges, we introduce FlexibleLLM, a novel finetuning-free, weight-only PTQ framework founded on a new theoretical analysis of outliers. FlexibleLLM holistically addresses the outlier problem through three synergistic components: (1) To handle clustered outliers, the Self-Adaptive Block-Level Greedy Bit Search (SBGBS) module enables highly flexible, fractional-level bit-width allocation (e.g., 2.1 bits), optimizing the trade-off between hardware utilization and model accuracy. (2) For discrete outliers, the Discrete Outlier Suppression and Aware (DOSA) module employs a dual strategy: it innovatively uses Hadamard transforms for computationally efficient suppression of numerical outliers and a Hessian-aware mechanism to precisely handle overlooked "attribute outliers”. (3) To combat error propagation, the Layer-Level Feedback and Denoising (LFD) module introduces a dynamic correction mechanism that mitigates the accumulation of ``activation noise'' from a global, cross-layer perspective. Extensive experiments demonstrate that FlexibleLLM achieves state-of-the-art performance, significantly outperforming not only existing finetuning-free methods but also many finetuning-based approaches, all while requiring substantially fewer computational resources. Code is available at https://anonymous.4open.science/r/FlexibleLLM.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 6872
Loading