VinQA: Visual Elements Interleaved Answer Generation for Question Answering on Complex Real-World Multimodal Documents
Abstract: The recent advancement of Multimodal Large Language Models (MLLMs) has enabled the extension of Retrieval-Augmented Generation (RAG) to handle multimodal inputs. Prior work has explored retrievers designed to retrieve multimodal contexts relevant to a given query. These contexts typically consist of multiple document pages with various modalities, including text and diverse visual elements such as charts, tables, diagrams, and photos. However, relatively little attention has been paid to generating visual element interleaved answers from such complex multimodal contexts. In this paper, we introduce Visual Elements Interleaved Answer Generation in Question Answering (VinQA) dataset. VinQA is constructed by simulating a Multimodal RAG pipeline over real-world documents, yielding complex multimodal contexts. The answers interleave visual elements at appropriate positions, along with their textual descriptions. We evaluate various proprietary and open-source models on VinQA test set using two encoding methods: Page Encoding, which encodes document pages as images to capture full visual appearance, and Modality Encoding, which encodes each modality separately for fine-grained understanding. The evaluation assesses grounded answer quality and the effective integration of visual elements. Our results show that Modality Encoding generally outperforms Page Encoding in the zero-shot setting. However, after training on VinQA training set, both methods exhibit substantial improvements with the performance gap becoming marginal.
Paper Type: Long
Research Area: Question Answering
Research Area Keywords: multimodality , multimodal QA, retrieval-augmented generation
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 6623
Loading