decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points

26 Apr 2024 (modified: 06 Nov 2024)Submitted to NeurIPS 2024EveryoneRevisionsBibTeXCC BY-NC-ND 4.0
Keywords: quantization; large language model; optimization
TL;DR: transform model quantization into a constrained optimization problem
Abstract: Quantization emerges as one of the most promising compression technologies for deploying efficient large models in recent years. However, existing quantization schemes suffer from significant accuracy degradation at very low bits, or require some additional computational overhead when deployed, making it difficult to be applied to large-scale applications in industry. In this paper, we propose decoupleQ, achieving a substantial increase in model accuracy, especially at very low bits. decoupleQ abandons the traditional heuristic quantization paradigm and decouples the model parameters into integer and floating-point parts, then transforming the quantization problem into a mathematical constrained optimization problem, which is then solved alternatively by off-the-shelf solution methods. decoupleQ gets rid of any tricks for dealing with outliers, sensitive channels, etc., and focuses only on the basic optimization objective to achieve high model accuracy on extreme low bit quantization. Quantization via decoupleQ is linear and uniform, making it hardware-friendlier than non-uniform counterpart, and enabling the idea to be migrated to high-bit quantization to enhance its robustness. decoupleQ has achieved comparable accuracy as fp16/bf16 for 2-bit quantization of large speech models in our company. The code (including the W2 CUDA kernels) is attached and will be made public.
Primary Area: Optimization for deep networks
Supplementary Material: zip
Submission Number: 836
Loading