Abstract: Long-form question answering (LFQA) aims to generate grounded paragraph-length answers by leveraging external documents. However, existing LFQA research has largely overlooked multimodality. We introduce RefLVQA as the first LFQA dataset featuring visual questions and multimodal documents. The dataset comprises 157K visual QA pairs, each annotated with sentence-level reference documents in the form of citations. To evaluate the model’s ability to support its responses using external documents, we propose a citation-based evaluation approach, where models are required to append appropriate citations to back up their answers. Our key findings are threefold: (1) Naïve multimodal RAG methods face challenges due to an excessive reliance on textual documents and insufficient grounding capabilities in image-based documents. (2) We propose Two-step MultiRAG, which outperforms unimodal RAG approaches, demonstrating the benefits of leveraging multimodal documents over unimodal ones. (3) Our qualitative analysis reveals that models frequently generate responses ungrounded in the referenced image documents.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision question answering, multimodality, retrieval-augmented generation, benchmarking
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: english
Keywords: vision question answering, multimodality, retrieval-augmented generation, benchmarking
Submission Number: 152
Loading