Boosting Multimodal Retrieval-Augmented Generation for Knowledge-Based VQA with One-pass Ladder Reranking

ACL ARR 2026 January Submission1307 Authors

29 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Knowledge-Based VQA, Multimodal Retrieval-Augmented Generation
Abstract: Evidence selection remains a major bottleneck in Multimodal retrieval-augmented generation (RAG) for Knowledge-Based visual question answering (VQA). Current rerankers typically score candidates in isolation or employ single-round selection, failing to model the inherently comparative nature of evidence ranking. As a result, they struggle with hard negatives candidates that are either visually near-duplicates of the query but textually irrelevant or textually plausible yet visually inconsistent with the image. To address these problems, we propose Multimodal One-pass Ladder Tournament, called MOLT, which reformulates reranking as a sequential ladder-style tournament. Instead of assigning absolute ranking scores, MOLT progressively filters distractors through explicit multimodal pairwise comparisons in a single decoding pass. To ensure robust learning, we introduce a two-stage training strategy: (1) supervised fine-tuning (SFT) initialized via distillation from a strong teacher model, followed by (2) reinforcement learning using Group Relative Policy Optimization (GRPO) with a composite reward that jointly optimizes output format compliance, step-wise logical consistency, and final selection accuracy. Experiments on two widely-used benchmarks show that MOLT achieves state-of-the-art performance, which outperforms compared methods by up to 7.3 percentage points. The code is available at https://anonymous.4open.science/r/molt.
Paper Type: Long
Research Area: Question Answering
Research Area Keywords: multimodal QA, knowledge base QA, reasoning
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 1307
Loading