DIANE: Zero-Shot Video Retrieval via Index Time Alignment and Enrichment

ACL ARR 2025 February Submission6141 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: While recent progress in video retrieval has been advanced by the exploration of supervised representation learning, regarded as a strategy for training time alignment, in this paper, we focus on index time alignment, by transforming the video to text, bridging the representation gap between the video and query. However, naively generating captions from videos is suboptimal -- captions generated from the videos often miss crucial details and nuances. In this work, we take a step further by exploring the index time enrichment strategy -- enhancing the text representation of video with diverse information. Specifically, we design a novel relevance-boosted caption generation method, bringing extra relevant details into video captions by using LLMs. To emphasize key information, we also extract key visual tokens from captions and videos. Moreover, to highlight the unique characteristics of each video, we propose a distinctiveness analysis method that infuses the key features into text representation. Benefiting from these methods, extensive experiments on several video retrieval benchmarks demonstrate the superiority of \ours over existing fine-tuned and pretraining methods without any data. A comprehensive study with both human and automatic evaluations shows that the enriched captions capture the key details and barely bring noise to the captions. Codes and data will be released.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Multimodality and Language Grounding to Vision, Robotics and Beyond, NLP Applications
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 6141
Loading