Multimodality Self-distillation for Fast Inference of Vision and Language Pretrained Models

Jun Kong, Jin Wang, Liang-Chih Yu, Xuejie Zhang

Published: 01 Jan 2024, Last Modified: 15 Mar 2026IEEE Transactions on MultimediaEveryoneRevisionsCC BY-SA 4.0

Abstract: The computational cost of the vision and language pretrained models (VL-PTMs) limits their deployment in resource-constrained devices that require low latency. One existing solution is to apply the early exiting (EE) strategy to accelerate the inference. This technique can force model prediction using only a few former transformer layers. However, these former layers behave differently with the final classifier, inevitably resulting in performance decline. To counter such limitation, self-distillation has been commonly introduced to enhance the representation abilities of the EE classifiers. This results in a semantic gap since EE classifiers are directly trained to mimic the outputs of the final classifier without access to the modality-specific behaviors. This study proposes a multimodality self-distillation method for the fast inference of VL-PTMs. To fill the semantic gap between modalities, we split the multimodalities into separate modalities and added them as extra inputs to encourage the effective distillation of each modality. Furthermore, the mean squared error (MSE) is introduced to minimize the distance of feature maps and further enhance the representation ability of the EE classifiers. Experiments show that the proposed method outperforms the previous EE strategies with the same inference time, and performs competitively even if the model exited very early.

External IDs:doi:10.1109/tmm.2024.3384060