Region-R1: Reinforcing Query-Side Region Cropping for Multi-Modal Re-Ranking

ACL ARR 2026 January Submission6906 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-modal Retrieval-Augmented Generation, Re-ranking, Reinforcement Learning
Abstract: Multi-modal retrieval-augmented generation (MM-RAG) depends on re-ranking to surface the most relevant evidence for image-question queries. We propose Region-R1, which treats query-side region cropping as a decision problem during re-ranking, allowing the system to retain the full image or focus on a question-relevant region before scoring candidates with a fixed vision-language encoder. The policy is trained with reinforcement learning using designed rewards. Across two challenging benchmarks, Region-R1 delivers consistent gains in top-heavy ranking, improving top-1 retrieval over prior re-rankers and increasing conditional Recall@1 by up to 20% in relative terms. These results show that query-side adaptation is a simple way to strengthen MM-RAG re-ranking.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: cross-modal information extraction, multimodality, cross-modal application
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 6906
Loading