Keywords: quantization, LLM, transformers
Abstract: Large language models (LLMs) are compute- and energy-intensive at inference time. While quantization improves efficiency, naive approaches often degrade performance due to outliers. We introduce FPTQuant, a method that enables effective transformer quantization through four novel, lightweight function-preserving transforms (FPTs): (1) a pre-RoPE transform for queries/keys, (2) a value transform, (3) an MLP scaling transform, and (4) a dynamic residual scaling. These FPTs exploit transformer equivariances to reshape activations without altering model function, require no custom kernels, and add negligible inference overhead. FPTQuant enables static INT4 quantization with minimal overhead and shows SOTA speed-up of up to $3.9\times$ over FP.
Empirically, FPTQuant has an excellent accuracy-speed trade-off---it is performing on par or exceeding most prior work and only shows slightly lower accuracy compared to a method that is up to 29\% slower.
Submission Number: 28
Loading