Abstract: Multimodal large language models (MLLMs) are increasingly developed to meet diverse deployment needs, varying in scale and computational demand. While recent research has focused on building MLLMs from Small Language Models (SLMs), these efforts remain limited in flexibility and are still data- and compute-intensive. In this paper, we present the first comprehensive study on flexibly compressing existing MLLMs through structural pruning and recovery training in a data-efficient manner. Hence, we address a critical gap in the literature by empirically analyzing best practices for adapting to specific hardware or resource limitations. Our study investigates pruning and knowledge distillation techniques, examining their impact on downstream performance across various model compression strategies, including pruning paradigms and recovery training schemes. We further investigate the feasibility of performing recovery training using only a small fraction of the available data. Key findings reveal that widthwise pruning is more effective than layerwise pruning in resource-constrained scenarios. For smaller compression ratios, finetuning the multimodal projector alone can restore most performance, while combining finetuning with hidden state knowledge distillation proves most effective across all compression levels. Notably, we demonstrate efficient model downsizing using as little as 5% of the original dataset for moderate compression, which achieves over 95% of the performance compared to using the full dataset. Our paper addresses a critical gap in the literature by empirically analysing the best practices for compressing MLLMs. With our best practices, Bunny-v1.0-3B retains over 95% of its original performance, while LLaVA-v1.5-7B maintains more than 97%, with compression ratios below 30%.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Yen-Chang_Hsu1
Submission Number: 4475
Loading