Keywords: Video understanding, video pre-trained model, vision-language model, collaboration learning
Abstract: Leveraging video pre-trained models has led to significant advancements in video understanding tasks. However, due to the inherent bias towards temporal learning in video pre-training, these models fail to capture comprehensive spatial cues. Additionally, the widely-used supervised adaption methods lack fine-grained semantic guidance as single action labels cannot precisely depict the intra-class diversity. To address these challenges, we incorporate the general capabilities of large Vision Language Models (VLMs) and propose a cross-modal collaborative knowledge transfer method to enhance video understanding. First, we propose an attentive spatial knowledge transfer method that distills spatial knowledge from the VLM's image encoder, enabling the precise capture of spatial information. Next, we design a contrastive textual knowledge transfer approach that achieves detailed video representations through fine-grained text-video alignment. Owing to the cross-modal knowledge transfer, the video representations are capable of attending to informative spatial regions and aligning with fine-grained texts that carry rich semantics. Extensive experiments demonstrate that our method achieves state-of-the-art performance across various datasets, validating its effectiveness.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 120
Loading