Keywords: large vision-language model, model quantization, salience-driven optimization, hierarchical search
Abstract: In this paper, we propose an extreme sparse coding quantization framework of 2-bit large vision-language models (LVLMs) for efficient multimodal reasoning. Conventional codebook-based quantization methods assign the same codeword number to all weights ignoring the significant variance of weight salience, which leads to substantial discretization errors. On the contrary, we flexibly assign optimal codeword combination for each weight based on weight salience to mitigate the performance degradation with negligible complexity overhead. Specifically, we first select the number of codewords for all weights based on the salience evaluation with second-order information. We then propose hierarchical codeword selection to efficiently search the appropriate codeword combinations from the extremely large codebook for optimal sparse representation. The high-level candidate search selects representative codeword subsets with minimal quantization errors, through which the low-level subset refinement discovers the optimal fine-grained codeword combination for all weights. Finally, we optimize the visual encoder to concentrate the weight salience distribution, which reduces computational overhead because of the decreased codewords for aggregated salient weights. Experimental results demonstrate that our method achieves a 5.58× reduction in model size while outperforming state-of-the-art model quantization methods by 2.78 in performance on the 13B LLaVA model, achieving a notable margin of improvement while maintaining similar computational costs in LVLM quantization.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 14941
Loading