UPop: Unified and Progressive Pruning for Compressing Vision-Language TransformersDownload PDF

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone
Keywords: Multimodal Model, Model Compression, Vision-Language Transformers
TL;DR: For the first time, we propose a multimodal compression approach UPop for vision-language Transformers from the perspective of pruning.
Abstract: Data from the real world contains a vast amount of multimodal information, among which vision and language are the two most representative modalities. On the other hand, researchers have spent much effort on model compression to reduce the huge memory and computational consumption of increasingly large models. However, how to compress multimodal models, especially vison-language Transformers, is still under-explored. This paper proposes the Unified and Progressive Pruning (UPop) that compresses vison-language Transformers via pruning. UPop incorporates 1) unifiedly searching countless multimodal subnetworks in a continuous optimization space from the uncompressed model; 2) progressively and simultaneously retraining the subnetwork. The subnetworks are learned in multiple components, including the self-attention modules, MLPs in both vision and language branches, and cross-attention modules. To ease the progress of pruning, we design \textit{Unified Pruning} to automatically assign the optimal pruning ratio to each compressiable component, instead of manually assigning each component a pruning ratio. To explore the limitation of compression ratio, we propose \textit{Progressive Pruning} to maintain convergence between search and retrain. In addition, UPop enables zero-cost subnetwork selection after searching countless multimodal subnetworks, and the searched subnetwork can be used without any retraining. Experiments on multiple discriminative and generative vision-lanuage tasks demonstrate the versatility of the proposed UPop. For example, we achieve \textbf{2$\times $} compression and \textbf{1.66$\times$} FLOPs reduction on COCO dataset of Image Caption with \textbf{0.8} SPICE drop, \textbf{4$\times $} compression and \textbf{2.96$\times$} FLOPs reduction with \textbf{2.1} SPICE drop.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning
12 Replies

Loading