VLAP: Efficient Video-Language Alignment via Frame Prompting and Temporal 057 004 005 Distilling

Ming Lin, Junbang Liang,, Xijun Wang, Shan Yang, Chun-Kai Wang, Kenan Deng, Yu Lou

Published: 02 Sept 2024, Last Modified: 02 Oct 2024OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Pre-trained large vision-language models especially large language models have shown promising results for language related tasks like question and answering. Most state-of-the- art video-language models are built from image-language models. But videos, unlike images have one more temporal dimension. With computing resource constrains, how to efficiently and effectively sample image frames from a video is the main challenge for video related tasks. With new advances in LLMs, new challenges emerge for cross-modal asks like how to properly ingest visual information from videos to LLMs and what information to feed to LLMs. I this work, we propose an Efficient Video-Language Aligment (VLAP) network that tackles efficient frame samplin and cross-modal alignment in one. In our VLAP network, we design a learnable frame prompter module to sample the most important frames and introduce new a cross-modal temporal distillation model to reduce inference computation cost while keep the temporal information. Meanwhile, we introduce a Text-Visua-Text molding strategy to best align across the visual and language modality and leveraging the pre-trained LLMs. We show through ablation study that his molding strategy creates best alignment cross modali ties. Overall, our VLAP network outperforms state-of-the-art methods on the video question answering benchmarks and video captioning benchmark.