RefLVQA: Referential Long-Form Visual Question Answering with Multimodal Documents

RefLVQA: Referential Long-Form Visual Question Answering with Multimodal Documents

ACL ARR 2025 May Submission152 Authors

08 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Long-form question answering (LFQA) aims to generate grounded paragraph-length answers by leveraging external documents. However, existing LFQA research has largely overlooked multimodality. We introduce RefLVQA as the first LFQA dataset featuring visual questions and multimodal documents. The dataset comprises 157K visual QA pairs, each annotated with sentence-level reference documents in the form of citations. To evaluate the model’s ability to support its responses using external documents, we propose a citation-based evaluation approach, where models are required to append appropriate citations to back up their answers. Our key findings are threefold: (1) Naïve multimodal RAG methods face challenges due to an excessive reliance on textual documents and insufficient grounding capabilities in image-based documents. (2) We propose Two-step MultiRAG, which outperforms unimodal RAG approaches, demonstrating the benefits of leveraging multimodal documents over unimodal ones. (3) Our qualitative analysis reveals that models frequently generate responses ungrounded in the referenced image documents.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: vision question answering, multimodality, retrieval-augmented generation, benchmarking

Contribution Types: NLP engineering experiment, Data resources

Languages Studied: english

Keywords: vision question answering, multimodality, retrieval-augmented generation, benchmarking

Submission Number: 152

Loading