From Bulk to Budget: Best Practices To Compress Multimodal Large Language Models

Yiran Huang; Lukas Thede; Massimiliano Mancini; Wenjia Xu; Zeynep Akata

From Bulk to Budget: Best Practices To Compress Multimodal Large Language Models

Yiran Huang, Lukas Thede, Massimiliano Mancini, Wenjia Xu, Zeynep Akata

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal large language models, model pruning, knowledge distillation, model compression

TL;DR: Best practices for MLLM compression

Abstract: Multimodal large language models (MLLMs) are increasingly developed to meet diverse deployment needs, varying in scale and computational demand. While recent research has focused on building MLLMs from Small Language Models (SLMs), these efforts remain limited in flexibility and are still data- and compute-intensive. In this paper, we present the first comprehensive study on flexibly compressing and recovering existing MLLMs in a data-efficient manner. Hence, we address a critical gap in the literature by empirically analyzing best practices for adapting to specific hardware or resource limitations. Our study investigates pruning and knowledge distillation techniques, examining their impact on downstream performance across various model compression strategies, including pruning paradigms, recovery training schemes, and data requirements. Key findings reveal that widthwise pruning is particularly effective in resource-constrained scenarios. For smaller compression ratios, finetuning the multimodal projector alone can restore most performance, while combining finetuning with hidden state knowledge distillation proves most effective across all compression levels. Notably, we demonstrate efficient model downsizing using as little as 5% of the original dataset for moderate compression. Our analysis suggests best practices for compressing MLLMs for resource-efficient deployment. With our best practices, Bunny-v1.0-3B retains over 95% of its original performance, while LLaVA-v1.5-7B maintains more than 97%, with compression ratios below 30%.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6400

Loading