Keywords: Quantization, Large Language Model, Efficient Model
Abstract: Large Language Models (LLMs) are powerful but resource-intensive. Power-of-two (PoT) quantization offers hardware-friendly compression but often struggles with accuracy, especially on GPUs due to sign bit entanglement and sequential dequantization. We propose PoTPTQ, a novel PoT quantization framework for LLM weights that achieves state-of-the-art accuracy in extremely low-precision (2- and 3-bit) and enables faster inference via efficient dequantization. Our two-step post-training algorithm initializes quantization scales robustly and refines them with a minimal calibration set. PoTPTQsurpasses integer quantization baselines at low precisions and achieves dequantization speedups of up to $3.67\times$ on NVIDIA V100 and $1.63\times$ on NVIDIA RTX 4090 compared to uniform integer dequantization.
Submission Number: 137
Loading