Simultaneously Training and Compressing Vision-and-Language Pre-Training Model

Qiaosong Qi, Aixi Zhang, Yue Liao, Wenyu Sun, Yongliang Wang, Xiaobo Li, Si Liu

Published: 2023, Last Modified: 08 Apr 2025IEEE Trans. Multim. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Model compression is an essential step for large-scale pre-training models toward practical application and deployment on the edge device. However, when conventional compression methods following ‘pre-training then compressing’ two-phase pipeline are applied to Vision-and-Language Pre-training (VLP) models, it will lead to a high calculation and memory overhead. In this work, we break the two-phase pipeline and propose an efficient and effective one-phase VLP model compression mechanism, named REDUCER, which stands for ‘simultaneously training and compREssing’ VLP model via progressive moDUle replaCing and nEtwork Rewiring. Specifically, REDUCER consists of three insightful designs. Firstly, we design a one-phase compression framework to train and compress the VLP model simultaneously to avoid the extra calculation and memory cost caused by an isolated model compression phase in the conventional two-phase pipeline. Secondly, we propose an adaptive progressive module replacing mechanism to compress the model depth free from explicit knowledge distillation losses, relieving the multi-task optimization problems. Thirdly, we integrate pruning techniques into VLP model compression to simultaneously compress the model in width and depth. Overall, we obtain a lightweight VLP model with only one pre-training phase, and it is the first one-phase compression method for VLP models. Extensive experiments have been conducted on representative VLP models, i.e., ClipBERT and VICTOR, and the experimental results show a superior trade-off between performance and efficiency.