Abstract: Multimodal Retrieval-Augmented Generation (MM-RAG) has emerged as a promising approach for enhancing the reliability and factuality of large vision-language models (LVLMs).
While end-to-end loss backpropagation is infeasible due to non-differentiable operations during the forward process, current methods primarily focus on component-level optimizations, necessitate extensive component-specific training datasets and suffer from a gap between local and global optimization objectives.
In this paper, we propose a new paradigm that backpropagates global rewards from the system output to each component and then transforms these rewards into specific local losses, enabling each component to perform gradient descent and thus ensuring end-to-end optimization.
Specifically, we first insert two lightweight multimodal components, a query translator and an adaptive reranker, to address the heterogeneity of multimodal knowledge and the varying knowledge demands for different questions, and then tune only these inserted components using our proposed paradigm to integrate the entire system.
Our method achieves SOTA performance on multiple knowledge-intensive multimodal benchmarks with high training efficiency, relying exclusively on supervised signals from an external reward model. Experimental results and our detailed analysis of the evolution of components during training collectively reveal the advantages and considerable potential of this paradigm as a promising direction for MM-RAG research.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: large vision language models, retrieval augmented generation, multimodality
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 828
Loading