Adaptively Building a Video-language Model for Video Captioning and Retrieval without Massive Video Pretraining

Zihao Liu; Xiaoyu Wu; Shengjin Wang; Jiayao Qian

Adaptively Building a Video-language Model for Video Captioning and Retrieval without Massive Video Pretraining

Zihao Liu, Xiaoyu Wu, Shengjin Wang, Jiayao Qian

Published: 20 Jul 2024, Last Modified: 05 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large-scale pretrained image-language models have shown remarkable performance recently. However, building a video-language model is more challenging due to the complexity of video and the difficulty of collecting high-quality data. This paper builds a video-language model in an adaptive manner, which transfers the knowledge from the image domain and can achieve state-of-the-art performance without any further massive video pretraining. The main contributions include a Visual Perception Adapter that seamlessly and efficiently adapts a pretrained image-language model to the video domain and a fine-grained contrastive learning with Inter-modal Token Alignment that bridges semantic gaps between vision, audio, and language with less data. The proposed model is evaluated on video captioning and retrieval. Experiments demonstrate that the proposed model exhibits competitive performance compared to models pretrained on millions of video-text pairs. Notably, our model's CIDEr and R@1 scores on the MSR-VTT dataset exceed the existing state-of-the-art by 6.3\% and 1.3\%.

Primary Subject Area: [Content] Vision and Language

Secondary Subject Area: [Content] Multimodal Fusion

Relevance To Conference: Our research focuses on obtaining a better visual-language model at a lower training cost, aligning with the theme of "Multimedia Content Understanding" of the conference.

Supplementary Material: zip

Submission Number: 1870

Loading