Abstract: Pre-trained large vision-language models especially large language models have shown promising results for language related tasks like question and answering. Most state-of-the- art video-language models are built from image-language models. But videos, unlike images have one more temporal dimension. With computing resource constrains, how to efficiently and effectively sample image frames from a video is the main challenge for video related tasks. With new advances in LLMs, new challenges emerge for cross-modal asks like how to properly ingest visual information from videos to LLMs and what information to feed to LLMs. I this work, we propose an Efficient Video-Language Aligment (VLAP) network that tackles efficient frame samplin and cross-modal alignment in one. In our VLAP network, we design a learnable frame prompter module to sample the most important frames and introduce new a cross-modal temporal distillation model to reduce inference computation cost while keep the temporal information. Meanwhile, we introduce a Text-Visua-Text molding strategy to best align across the visual and language modality and leveraging the pre-trained LLMs. We show through ablation study that his molding strategy creates best alignment cross modali ties. Overall, our VLAP network outperforms state-of-the-art methods on the video question answering benchmarks and video captioning benchmark.
Loading