Few-Shot Multimodal Explanation for Visual Question Answering

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: A key object in eXplainable Artificial Intelligence (XAI) is to create intelligent systems capable of reasoning and explaining real-world data to facilitate reliable decision-making. Recent studies have acknowledged the importance of providing user-friendly and verifiable explanations to facilitate trustworthy Visual Question Answering (VQA) systems. This paper aims to promote explainable VQA from both data and method perspectives. First, we propose a new Standard Multimodal Explanation (SME) dataset and a new Few-Shot Multimodal Explanation for VQA (FS-MEVQA) task, which aims to generate the multimodal explanation of the underlying reasoning process for solving visual questions with few training samples. Our SME dataset includes 1,028,230 samples composed of questions, images, answers, and multimodal explanations, which can facilitate the research in both traditional MEVQA and FS-MEVQA. To the best of our knowledge, this is the first large-scale dataset with joint language-vision explanations based on standard English and additional visual grounding tokens, which bridge MEVQA to a broad field in Natural Language Processing (NLP). Second, we propose a training-free Multimodal Explaining Agent (MEAgent) method based on an LLM agent with multimodal open-world tools to infer answers and generate multimodal explanations for visual questions. Our MEAgent can learn multimodal explaining from merely $N(=16)$ training samples and leverage open-world abilities to perform FS-MEVQA on test samples. Comprehensive experimental results evaluated by language quality metrics, visual detection metric, and visual attribution metrics on our SME dataset indicate the superiority of our method for FS-MEVQA, compared to state-of-the-art MEVQA methods and the multimodal LLM GPT-4V.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Generation] Generative Multimedia, [Content] Media Interpretation, [Experience] Multimedia Applications
Relevance To Conference: In this work, we propose a new Standard Multimodal Explanation (SME) dataset with 1,028,230 samples for Multimodal Explanation for Visual Question Answering (MEVQA) and a new Few-Shot MEVQA (FS-MEVQA) task. Moreover, we propose a training-free Multimodal Explaining Agent (MEAgent) method for FS-MEVQA, which significantly outperforms traditional MEVQA methods and the multimodal LLM GPT-4V.
Supplementary Material: zip
Submission Number: 4931
Loading