The Devil is in the Word: Video-Conditioned Text Representation Refinement for Text-to-Video Retrieval
Keywords: Text-to-Video Retrieval, Video-conditioned Text Representation Enhancement
Abstract: Pre-trained vision-language models (VLMs), such as CLIP, have shown remarkable success in the text-video retrieval task due to their strong vision-language
representations learned from large-scale paired image-text samples. However,
compared to videos, text is often short and concise, making it difficult to fully
capture the rich and redundant semantics present in a video with thousands of
frames. Recent advances have focused on utilizing text features to extract key information from these redundant video frames. However, text representation generated without considering video information can suffer from bias and lack the
expressiveness needed to capture key words that could enhance retrieval performance. In this study, we first conduct preliminary experiments to demonstrate
the importance of enhancing text representations. These experiments reveal that
text representation only generated from text input often misinterpret critical information. To address this, we propose a simple yet efficient method, VICTER, i.e.,
video-conditioned text representation refinement, to enrich text representation using a versatile module. Specifically, we introduce a video abstraction module that
extracts representative features from multiple video frames. This is followed by
a video-conditioned text enhancement module that refines the original text features by reassessing individual word features and extracting key words using the
generated video features. Empirical evidence shows that VICTER not only effectively captures relevant key words from the input text but also complements
various existing frameworks. Our experimental results demonstrate a significant
improvement of VICTER over several baseline frameworks (with 0.4% ∼ 1.0%
improvements on R@1). Furthermore, VICTER achieves state-of-the-art performance on three benchmark datasets, including MSRVTT, DiDeMo, and LSMDC. Code will be made available.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3584
Loading