EASIER: Relevance-Boosted Captioning and Structural Information Extraction for Zero-Shot Video-Text Retrieval

ACL ARR 2024 June Submission3774 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: While recent progress in video-text retrieval (VTR) has been advanced by the exploration of supervised representation learning, in this paper, we present a novel zero-shot VTR framework, EASIER, to retrieve video/text with off-the-shelf captioning methods, large language models (LLMs), and text retrieval methods. Specifically, we first map videos into captions and then retrieve video captions and text using text retrieval methods, without any model training or fine-tuning. However, due to the limited power of captioning methods, the captions often miss important content in the video, resulting in unsatisfactory retrieval performance. To translate more information into video captions, we designed a novel relevance-boosted caption generation method, bringing extra relevant details into video captions by LLMs. Moreover, to emphasize key information and reduce the noise brought by imagination, we extract key visual tokens from captions and design different templates for structuring these tokens with the proposed structural information extraction, further boosting the retrieval performance. Benefiting from the enriched captions and structuralized information, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of EASIER over existing fine-tuned and pretraining methods without any data. A comprehensive study with both human and automatic evaluations shows that the enriched captions capture the key details and barely bring noise to the captions. Codes and data will be released.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Multimodality and Language Grounding to Vision, Robotics and Beyond, NLP Applications
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 3774
Loading