End-to-End Optimization for Multimodal Retrieval-Augmented Generation via Reward Backpropagation

End-to-End Optimization for Multimodal Retrieval-Augmented Generation via Reward Backpropagation

ACL ARR 2025 May Submission828 Authors

15 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Multimodal Retrieval-Augmented Generation (MM-RAG) has emerged as a promising approach for enhancing the reliability and factuality of large vision-language models (LVLMs). While end-to-end loss backpropagation is infeasible due to non-differentiable operations during the forward process, current methods primarily focus on component-level optimizations, necessitate extensive component-specific training datasets and suffer from a gap between local and global optimization objectives. In this paper, we propose a new paradigm that backpropagates global rewards from the system output to each component and then transforms these rewards into specific local losses, enabling each component to perform gradient descent and thus ensuring end-to-end optimization. Specifically, we first insert two lightweight multimodal components, a query translator and an adaptive reranker, to address the heterogeneity of multimodal knowledge and the varying knowledge demands for different questions, and then tune only these inserted components using our proposed paradigm to integrate the entire system. Our method achieves SOTA performance on multiple knowledge-intensive multimodal benchmarks with high training efficiency, relying exclusively on supervised signals from an external reward model. Experimental results and our detailed analysis of the evolution of components during training collectively reveal the advantages and considerable potential of this paradigm as a promising direction for MM-RAG research.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: large vision language models, retrieval augmented generation, multimodality

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 828

Loading