Research Area: Compute efficient LMs
Keywords: Quantization, Large Language Models, Mixed Precision, Error Reconstruction, Low-Rank Decomposition
TL;DR: We propose an adaptive mixed precision and low-rank approximation quantization error reconstruction, which allows us to automatically search for optimal parameter settings in a discrete space.
Abstract: Large language models (LLMs) has demonstrated superior performance on various downstream tasks. However, their practical applications are hindered by their immense memory and computation requirements. Although recent post-training quantization methods can effectively reduce memory usage and improve computational efficiency, they often overlook the varying sensitivity of different layer weights to bit precision. Additionally, the previous methods suffer from significant accuracy loss under low-bit quantization (2-3 bits). To address these limitations, we propose Adaptive Mixed Precision and Low-Rank Quantization Error Reconstruction for LLMs (AMLQ), which achieves state-of-the-art performance under the approximate average bit precision overall. Furthermore, we introduce the low-rank decomposition to reconstruct quantization error based on the output features. Experimental results demonstrate that this method can be effectively combined with various quantization techniques and bring considerable performance gains. Our approach comprehensively considers model performance and inference efficiency, offering more than 3$\times$ speedup over the FP16 execution.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 604
Loading