AdaQuantLM: LLM Quantization with Adaptive Bit-Widths

Shuangyi Chen; Ashish J Khisti

AdaQuantLM: LLM Quantization with Adaptive Bit-Widths

Shuangyi Chen, Ashish J Khisti

Published: 09 Oct 2024, Last Modified: 19 Nov 2024Compression Workshop @ NeurIPS 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Quantization, Adaptive Bit-width

TL;DR: AdaQuantLM enables adaptive bit-width quantization, eliminating the need for separate fine-tuning for each bitwidth. By leveraging additive codewords, it efficiently converts between bitwidths without storing full-precision weights.

Abstract: Current LLM quantization methods focus on single bitwidth quantization, requiring time-consuming finetuning and benchmarking for each bitwidth version, which limits their adaptability to different scenarios. To address these challenges, we propose AdaQuantLM, a method for LLM quantization with adaptive bit-width. Inspired by techniques such as AdaBits and Additive Quantization for Language Models (AQLM), AdaQuantLM leverages the additivity of codewords in quantized models. This allows for the efficient conversion between different bit-widths by adding or removing specific codewords, eliminating the need for storing full-precision weights. Our approach jointly quantizes and fine-tunes LLMs across multiple bit-widths, enabling the model to adapt to devices with varying computational resources while maintaining performance. We demonstrate the effectiveness of AdaQuantLM through experiments on the Gemma-2b model, highlighting its potential for broad applicability in the efficient deployment of LLMs.

Submission Number: 57

Loading