CrossTVR: Multi-Grained Re-Ranker for Text Video Retrieval with Frozen Image Encoders

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Text-Video Retrieval, CLIP, Frozen, Multimodal
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: State-of-the-art text-video retrieval (TVR) methods commonly use CLIP and cosine similarity for efficient retrieval. Meanwhile, cross attention methods, which employ a transformer decoder to compute attention between text queries and video frames, offer a more comprehensive interaction between multimodal information. Complementary to the existed one-stage text video retrieval approaches above, we propose a re-ranker called CrossTVR that further explores the fine-grained and comprehensive interaction between text and all the vision tokens of a given video at the frame level and the video (clips or segments) level. Furthermore, we employ the frozen CLIP model strategy for fine-grained retrieval, enabling scalability to larger pre-trained vision models like ViT-G and resulting in further improved retrieval performance. Subsequently, a two-stage text-video retrieval architecture can be proposed. In the first stage, we leverage existed TVR methods with cosine similarity network to efficently obtain text video candidate pairs. In the second stage, the proposed re-ranker is applied for fine-grained retrieval. Experimental results on text-video retrieval datasets demonstrate the effectiveness and scalability of the proposed re-ranker when combined with existing mainstream one-stage text-video retrieval approaches.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: pdf
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4495
Loading