AdaptIP: Transferring Cross-modal Information to Temporal Modeling for Video-Language Representation Learning

24 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: video-language representation learning, transfer learning
Abstract: Driven by the data-centric AI paradigm, pre-trained image-text representations exhibit a subtle alignment of visual and textual concepts. In light of images being a subset of videos, recent work is dedicated to transferring pre-trained image-text representations into the video-language domain, attracting widespread attention. Nevertheless, these efforts employ training strategies such as full fine-tuning or post-pretraining, which do not necessarily constitute the most optimal approaches for transferring general pre-trained representations. In this paper, we resort to the increasingly popular parameter-efficient transfer learning (PETL) approach, proposing AdptIP, to adapt the pre-trained CLIP model into the field of video-language representation learning. AdaptIP devises a hierarchical cross-modal adaptation approach, focusing on intra-modal temporal modeling and inter-modal fine-grained alignment within the video-language domain. Additionally, the pre-trained CLIP backbone is frozen to maintain a common prior and ensure efficient model training. Comprehensive experiments on video-text retrieval, video question answering, and video captioning benchmarks highlight the versatility, superiority and efficiency of AdaptIP. Code will be available soon.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9460
Loading