MLLM-Pruner: Efficient Activation-aware Pruning for Multimodal LLMs

Yunan Ding; Yan Tai; Siqi Luo; Xiaohong Liu; Guodong Guo; Bo Yang

MLLM-Pruner: Efficient Activation-aware Pruning for Multimodal LLMs

Yunan Ding, Yan Tai, Siqi Luo, Xiaohong Liu, Guodong Guo, Bo Yang

12 Sept 2025 (modified: 18 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Post-training Pruning, Activation-aware Pruning, MLLM

TL;DR: Efficient Activation-aware MLLM Pruning.

Abstract: Multimodal large language models (MLLMs) have demonstrated impressive performance across a wide range of vision-language tasks. However, the increasing scale of these models leads to significant challenges in deployment costs. Post-training pruning emerges as an effective compression technique to address these challenges. Recent pruning studies on large language models (LLMs) has shown that activation-aware pruning strategies that combine weight magnitude with the $\ell_2$-norm of input activations can achieve superior performance. Nevertheless, directly applying these approaches to MLLMs often leads to substantial performance degradation. This is because the $\ell_2$-norm assumes all activations contribute equally, while in MLLMs, visual and textual tokens exhibit divergent activation patterns. Moreover, textual-only calibration datasets used in LLM pruning are inadequate for capturing modality-specific dependencies, which further limits their ability to evaluate the importance of weight. In this paper, we propose MLLM-Pruner, a novel activation-aware pruning framework specifically tailored for MLLMs. To address these issues, MLLM-Pruner introduces two key innovations: (1) we construct a representative multimodal calibration dataset comprising general-domain text, Instruction Tuning, and Visual Instruction Tuning data to comprehensively preserve language generation, instruction-following, and visual reasoning abilities for MLLMs. (2), we design a modality-sensitive importance estimation metric that leverages the Singular Value Decomposition (SVD) of attention distributions to reweight the input activations, effectively captures the activation contribution across modalities, and reduces the pruning error. Our MLLM-Pruner does not rely on an expensive iterative reconstruction and re-training process. Extensive experiments on LLaVA-based MLLMs across various benchmarks demonstrate that MLLM-Pruner consistently outperforms state-of-the-art pruning methods while maintaining efficient compression. Our code, model weights, and multimodal calibration dataset will be made publicly available upon publication.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 4314

Loading