Adaptively Building a Video-language Model for Video Captioning and Retrieval without Massive Video Pretraining

Published: 20 Jul 2024, Last Modified: 05 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large-scale pretrained image-language models have shown remarkable performance recently. However, building a video-language model is more challenging due to the complexity of video and the difficulty of collecting high-quality data. This paper builds a video-language model in an adaptive manner, which transfers the knowledge from the image domain and can achieve state-of-the-art performance without any further massive video pretraining. The main contributions include a Visual Perception Adapter that seamlessly and efficiently adapts a pretrained image-language model to the video domain and a fine-grained contrastive learning with Inter-modal Token Alignment that bridges semantic gaps between vision, audio, and language with less data. The proposed model is evaluated on video captioning and retrieval. Experiments demonstrate that the proposed model exhibits competitive performance compared to models pretrained on millions of video-text pairs. Notably, our model's CIDEr and R@1 scores on the MSR-VTT dataset exceed the existing state-of-the-art by 6.3\% and 1.3\%.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: Our research focuses on obtaining a better visual-language model at a lower training cost, aligning with the theme of "Multimedia Content Understanding" of the conference.
Supplementary Material: zip
Submission Number: 1870
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview