Keywords: Retrieval, Multimodal Larger Language Models, Contrastive Learning
TL;DR: A benchmark for multimodal document retrieval.
Abstract: Retrieval is essential for multimodal large language models (MLLMs) to handle long contexts and improve factual accuracy. However, existing benchmarks focus on end-to-end answer generation, making retrieval evaluation difficult. To address this, we introduce VisR-Bench, a benchmark for question-driven retrieval in scanned documents. Our queries do not explicitly contain answers, preventing models from relying on keyword matching. Additionally, they avoid ambiguous references to figures or tables by ensuring that each query includes descriptive information necessary to locate the correct content. The dataset spans English and 15 other languages, with English queries enabling fine-grained evaluation across answer modalities (tables, text, figures) and non-English queries focus on multilingual generalization. VisR-Bench provides a comprehensive framework for evaluating retrieval in document understanding.
Submission Number: 78
Loading