UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers

Dachuan Shi; Chaofan Tao; Ying Jin; Zhendong Yang; Chun Yuan; Jiaqi Wang

UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers

Dachuan Shi, Chaofan Tao, Ying Jin, Zhendong Yang, Chun Yuan, Jiaqi Wang

22 Sept 2022 (modified: 04 Aug 2025)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone

Keywords: Multimodal Model, Model Compression, Vision-Language Transformers

TL;DR: For the first time, we propose a multimodal compression approach UPop for vision-language Transformers from the perspective of pruning.

Abstract: Data from the real world contains a vast amount of multimodal information, among which vision and language are the two most representative modalities. On the other hand, researchers have spent much effort on model compression to reduce the huge memory and computational consumption of increasingly large models. However, how to compress multimodal models, especially vison-language Transformers, is still under-explored. This paper proposes the Unified and Progressive Pruning (UPop) that compresses vison-language Transformers via pruning. UPop incorporates 1) unifiedly searching countless multimodal subnetworks in a continuous optimization space from the uncompressed model; 2) progressively and simultaneously retraining the subnetwork. The subnetworks are learned in multiple components, including the self-attention modules, MLPs in both vision and language branches, and cross-attention modules. To ease the progress of pruning, we design \textit{Unified Pruning} to automatically assign the optimal pruning ratio to each compressiable component, instead of manually assigning each component a pruning ratio. To explore the limitation of compression ratio, we propose \textit{Progressive Pruning} to maintain convergence between search and retrain. In addition, UPop enables zero-cost subnetwork selection after searching countless multimodal subnetworks, and the searched subnetwork can be used without any retraining. Experiments on multiple discriminative and generative vision-lanuage tasks demonstrate the versatility of the proposed UPop. For example, we achieve \textbf{2$\times $} compression and \textbf{1.66$\times$} FLOPs reduction on COCO dataset of Image Caption with \textbf{0.8} SPICE drop, \textbf{4$\times $} compression and \textbf{2.96$\times$} FLOPs reduction with \textbf{2.1} SPICE drop.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/upop-unified-and-progressive-pruning-for/code)

12 Replies

Loading