CLIPPING: Distilling CLIP-based Models for Video-Language Understanding

Renjing Pei; Jianzhuang Liu; Weimian Li; Bin Shao; Songcen Xu; Peng Dai; Juwei Lu; Youliang Yan

CLIPPING: Distilling CLIP-based Models for Video-Language Understanding

Renjing Pei, Jianzhuang Liu, Weimian Li, Bin Shao, Songcen Xu, Peng Dai, Juwei Lu, Youliang Yan

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone

Keywords: Knowledge Distillation, Vison-Language Understanding, Model Compression

Abstract: Pre-training a vison-language model and then fine-tuning it on downstream tasks have become a popular paradigm. However, pre-trained vison-language models with the Transformer architecture usually have a large number of parameters and take long inference time. Knowledge distillation has been an efficient technique to transfer the capability of a large model to a small one while maintaining the accuracy, which has achieved remarkable success in natural language processing. However, the collection of the pre-training data for the pre-training knowledge distillation costs huge manpower in multi-modality applications. In this paper, we propose a novel knowledge distillation method, named CLIPPING, where the plentiful knowledge of a large teacher model that has been fine-tuned for video-language tasks with the powerful pre-trained CLIP can be effectively transferred to a small student only at the fine-tuning stage. Especially, a new layer-wise alignment is proposed for knowledge distillation of the intermediate layers from the Transformer to the CNN in CLIPPING, which enables the student model to well absorb the knowledge of the teacher. Besides, we present an effective cross-modality knowledge distillation, which includes both the knowledge of the global video-caption distributions from the teacher model and the knowledge of the local video-caption distributions from the pre-training model (CLIP). Finally, CLIPPING with MobileViT-v2 as the vison encoder without any vison-language pre-training achieves 91.5%-95.3% of the performance of its teacher on three video-language retrieval benchmarks, with its vison encoder being 19.5x smaller. CLIPPING also significantly outperforms a state-of-the-art small baseline (ALL-in-one-B) on the MSR-VTT dataset, obtaining relatively 7.4% performance gain, with 29% fewer parameters and 86.9% fewer flops. Moreover, CLIPPING is comparable or even superior to many large pre-training models.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning

TL;DR: In this paper, we propose a novel knowledge distillation method that is specially designed for small vison-language models.

Supplementary Material: zip

5 Replies

Loading