Bidirectional Generative Retrieval with Multi-Modal LLMs for Text-Video Retrieval

25 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Video Retrieval, Multi-modal Large Language Model
Abstract: In recent years, multi-modal large language models (MLLMs) have shown outstanding advancement in various multi-modal understanding tasks by leveraging the powerful knowledge of large language models (LLMs). Extending MLLMs to text-video retrieval enables handling more complex queries with multiple modalities beyond simple uni-modal queries for traditional search engines. It also provides a new opportunity to incorporate search into a unified conversational system, but MLLM-based text-video retrieval has been less explored in the literature. To this end, we investigate MLLMs' capabilities in text-video retrieval as a generation task, namely, generative retrieval, in two directions. An intuitive direction is $\textit{content generation}$ that directly generates the content given a query. Another direction is $\textit{query generation}$, which generates the query given the content. Interestingly, we observe that in both text-to-video and video-to-text retrieval tasks, query-generation less suffers from the bias and significantly outperforms content-generation. In this paper, we propose a novel framework, Bidirectional Text-Video Generative Retrieval (BGR), that handles both text-to-video and video-to-text retrieval tasks by measuring the relevance using two generation directions. Our framework trains MLLMs by simultaneously optimizing two objectives, $\textit{i.e.}$, video-grounded text generation (VTG) and text-grounded video feature generation (TVG). At inference, our framework ensembles predictions by both generation directions. We also introduce a Prior Normalization, a simple plug-and-play module, to further alleviate the $\textit{prior bias}$ induced by the likelihood of uni-modal content data that often overwhelms the relevance between query and content. Our extensive experiments on multi-modal benchmarks demonstrate that BGR and Prior Normalization are effective in alleviating the prior bias, especially the text prior bias from LLMs' pretrained knowledge in MLLMs, achieving state-of-the-art performance.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4090
Loading