GPTAQ: Efficient Finetuning-Free Quantization for Asymmetric Calibration

Yuhang Li; Ruokai Yin; Donghyun Lee; Shiting Xiao; Priyadarshini Panda

GPTAQ: Efficient Finetuning-Free Quantization for Asymmetric Calibration

Yuhang Li, Ruokai Yin, Donghyun Lee, Shiting Xiao, Priyadarshini Panda

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We introduce GPTAQ, a novel finetuning-free quantization method for compressing large-scale transformer architectures. Unlike the previous GPTQ method, which independently calibrates each layer, we always match the quantized layer's output to the exact output in the full-precision model, resulting in a scheme that we call *asymmetric calibration*. Such a scheme can effectively reduce the quantization error accumulated in previous layers. We analyze this problem using optimal brain compression to derive a close-formed solution. The new solution explicitly minimizes the quantization error as well as the accumulated asymmetry error. Furthermore, we utilize various techniques to parallelize the solution calculation, including channel parallelization, neuron decomposition, and Cholesky reformulation for matrix fusion. As a result, GPTAQ is easy to implement, simply using 20 more lines of code than GPTQ but improving its performance under low-bit quantization. Remarkably, on a single GPU, we quantize a 405B language transformer as well as EVA-02—the rank first vision transformer that achieves 90% pretraining Imagenet accuracy. Code is available at [Github](https://github.com/Intelligent-Computing-Lab-Yale/GPTAQ).

Lay Summary: We provide an approach to compress the existing large language models and other vision foundation models. We wondered if the existing method could follow the original model behavior when compressing them, especially in the middle of the process, where we have to match the compressed model output with the original model output. We give a closed-form solution to this problem, and manage to execute the algorithm in a very efficient way. Moreover, our solution can be integrated into the widely supported GPTQ APIs, using only 20 more lines of code, and improves their performance.

Link To Code: https://github.com/Intelligent-Computing-Lab-Yale/GPTAQ

Primary Area: Optimization->Discrete and Combinatorial Optimization

Keywords: Transformers, Quantization, Discrete Optimization

Submission Number: 1623

Loading