Keywords: Multimodal Embedding Retrieval; Bayesian Data Reweighting; Retrieval-Augmented Generation;
Abstract: Knowledge-based Visual Question Answering (VQA) requires retrievers to incorporate external knowledge, e.g., documents, to answer questions. Existing retrievers are typically optimized with standard contrastive learning, which treats all non-positive pairs as equally informative, leading to false negative bias and difficulties in hard negative mining. To overcome these issues, we propose \textbf{Bayesian Data Reweighting (BDR)}, a probabilistic framework that assigns learnable importance weights to query-document pairs and performs Bayesian inference over these weights. We derive closed-form posterior updates under conjugate priors and develop an efficient EM algorithm for weight estimation. This approach adaptively emphasizes informative pairs without explicit hard negative mining. Experiments on two representative multimodal retrievers demonstrate consistent improvements, with BDR achieving gains of up to $8.6$ points on individual datasets and an average recall of $68.6$ across all M2KR datasets, surpassing the previous state-of-the-art.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 2195
Loading