Adaptive Layer-Wise Transformations for Post-Training Quantization of Large Language Models

Cuong Pham; Dung Anh Hoang; Cuong C. Nguyen; Trung Le; Gustavo Carneiro; Jianfei Cai; Thanh-Toan Do

Adaptive Layer-Wise Transformations for Post-Training Quantization of Large Language Models

Cuong Pham, Dung Anh Hoang, Cuong C. Nguyen, Trung Le, Gustavo Carneiro, Jianfei Cai, Thanh-Toan Do

16 Sept 2025 (modified: 21 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Post-training Quantization, Large Language Models

Abstract: Large language models require significant computational resources for deployment, making quantization essential for practical applications. However, the main obstacle to effective quantization lies in systematic outliers in activations and weights, which cause substantial LLM performance degradation, especially at low-bit settings. While existing transformation-based methods like affine and rotation transformations successfully mitigate outliers, they apply the homogeneous transformation setting, i.e., using the same transformation types across all layers, ignoring the heterogeneous distribution characteristics within LLMs. In this paper, we propose an adaptive transformation selection framework that systematically determines optimal transformations on a per-layer basis. To this end, we first formulate transformation selection as a differentiable optimization problem to achieve the accurate transformation type for each layer. However, searching for optimal layer-wise transformations for every model is computationally expensive. To this end, we establish the connection between weight distribution kurtosis and accurate transformation type. Specifically, we propose an outlier-guided layer selection method using robust $z$-score normalization that achieves comparable performance to differentiable search with significantly reduced overhead. Comprehensive experiments on LLaMA family models demonstrate that our adaptive approach consistently outperforms the widely-used homogeneous transformation settings. For example, our method achieves an improvement of up to 4.58 perplexity points and a 2.11\% gain in average six-task zero-shot accuracy under aggressive W3A3K2V2 quantization settings for the LLaMA-3-8B model compared to the current best existing method, FlatQuant, demonstrating the necessity of heterogeneous transformation selection for optimal LLM quantization.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 6515

Loading