Abstract: Large Language Models (LLMs) have demonstrated re-
markable performance across various natural language processing
(NLP) tasks. However, their deployment is challenging due to the
substantial computational resources required. Power-of-two (PoT)
quantization is a general tool to counteract this difficulty. Albeit pre-
vious works on PoT quantization can be efficiently dequantized on
CPUs using fixed-point addition, it showed less effectiveness on GPUs.
The reason is entanglement of the sign bit and sequential bit manipula-
tions needed for dequantization. We propose a novel POT quantization
framework for LLM weights that (i) outperforms state-of-the-art ac-
curacy in extremely low-precision number formats, and (ii) enables
faster inference through more efficient dequantization. To maintain
the accuracy of the quantized model, we introduce a two-step post-
training algorithm: (i) initialize the quantization scales with a robust
starting point, and (ii) refine these scales using a minimal calibration
set. The performance of our PoT post-training algorithm surpasses
the current state-of-the-art in integer quantization, particularly at low
precisions such as 2- and 3-bit formats. Our PoT quantization acceler-
ates the dequantization step required for the floating point inference
and leads to 3.67×speed up on a NVIDIA V100, and 1.63×on a
NVIDIA RTX 4090, compared to uniform integer dequantization.
Loading