Multi-Modal Inductive Framework for Text-Video Retrieval

Published: 20 Jul 2024, Last Modified: 01 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Text-video retrieval (TVR) identifies relevant videos based on textual queries. Existing methods are limited by their ability to understand and connect different modalities, resulting in increased difficulty in retrievals. In this paper, we propose a generation-based TVR paradigm facilitated by LLM distillation to better learn and capture deep retrieval knowledge for text-video retrieval, amidsting the rapid evolution of Large Language Models. Specifically, we first design the fine-tuning large vision-language model that leverages the knowledge learned from language models to enhance the alignment of semantic information between the text and video modalities. It also incorporates an inductive reasoning mechanism, which focuses on incorporating important temporal and spatial features into the video embeddings. We further design question prompt clustering to select the most important prompts, considering their contribution to improving retrieval performance. Experimental results show that our approach achieves excellent performance on two benchmark datasets compared to its competitors.
Primary Subject Area: [Engagement] Multimedia Search and Recommendation
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: This work contributes to the field of multimedia/multimodal processing by introducing a generation-based Text-Video Retrieval (TVR) paradigm through Large Language Model (LLM) distillation. The proposed Multi-Modal Inductive framework, termed MMI-TVR, leverages the knowledge learned from language models to enhance the alignment of semantic information between text and video modalities. By incorporating an inductive reasoning mechanism, the framework is adept at incorporating significant temporal and spatial features into video embeddings. This advancement not only bridges the gap between text and video data but also enables the model to better comprehend the context and interactions within videos. Furthermore, the fine-tuning large vision-language model designed in this framework uses the knowledge distilled from language models to understand and connect the nuances and contexts of both modalities effectively. The text-video knowledge distillation component generates fine-grained multi-modal knowledge, which is both relevant and informative, facilitating improved retrieval and problem-solving. Additionally, question prompt clustering is employed to select the most pertinent prompts, thereby enhancing retrieval performance.
Supplementary Material: zip
Submission Number: 2259
Loading