Adaptively Building a Video-language Model for Video Captioning and Retrieval without Massive Video Pretraining
Abstract: Large-scale pretrained image-language models have shown remarkable performance recently. However, building a video-language model is more challenging due to the complexity of video and the difficulty of collecting high-quality data. This paper builds a video-language model in an adaptive manner, which transfers the knowledge from the image domain and can achieve state-of-the-art performance without any further massive video pretraining. The main contributions include a Visual Perception Adapter that seamlessly and efficiently adapts a pretrained image-language model to the video domain and a fine-grained contrastive learning with Inter-modal Token Alignment that bridges semantic gaps between vision, audio, and language with less data. The proposed model is evaluated on video captioning and retrieval. Experiments demonstrate that the proposed model exhibits competitive performance compared to models pretrained on millions of video-text pairs. Notably, our model's CIDEr and R@1 scores on the MSR-VTT dataset exceed the existing state-of-the-art by 6.3\% and 1.3\%.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: Our research focuses on obtaining a better visual-language model at a lower training cost, aligning with the theme of "Multimedia Content Understanding" of the conference.
Supplementary Material: zip
Submission Number: 1870
Loading