Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: This paper presents the first study to explore the potential of parameter quantization for multimodal large language models to alleviate the significant resource constraint encountered during vision-language instruction tuning. We introduce a Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW. This method is grounded in two key innovations: (1) The learning of group-wise scale factors for quantized LLM weights to mitigate the quantization error arising from activation outliers and achieve more effective vision-language instruction tuning; (2) The implementation of a multimodal warmup that progressively integrates linguistic and multimodal training samples, thereby preventing overfitting of the quantized model to multimodal data while ensuring stable adaptation of multimodal large language models to downstream vision-language tasks. Extensive experiments demonstrate that models quantized by QSLAW perform on par with, or even surpass, their full-precision counterparts, while facilitating up to 1.4 times reduction in VL tuning time and GPU consumption.
Primary Subject Area: [Generation] Multimedia Foundation Models
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: Large language models (LLMs) exhibit formidable comprehension capabilities, prompting researchers to develop multi-modal large language models (MLLMs) with vision-language (VL) instruction tuning. This innovative approach enables LLMs to execute multi-modal tasks in an autoregressive fashion. One notable example of this technique is LLaVA, which employs a fully harnesses the power of pre-trained LLMs, thereby significantly enhancing its visual comprehension capabilities. However, the current VL instruction tuning for MLLMs exhibits considerable redundancy in terms of computation and memory burden. For instance, LLaVA-13B fully fine-tunes the entire LLM during VL instruction tuning, often requiring hundreds of GPU hours which poses great challenges to the rapid adaptation of LLMs for cross-modal tasks. Our paper explores the potential of parameter quantization for MLLMs and proposes Quantization-aware Scale LeArning based on multimodal Warmup, termed QSLAW, aiming to alleviate the extensive training demands encountered during VL instruction tuning while preserving the original performance. MLLM quantized by QSLAW perform on par with, or even surpass, their full-precision counterparts, while facilitating reduction in VL tuning time and GPU consumption
Supplementary Material: zip
Submission Number: 578
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview