A Simple Baseline for Zero-shot Visual Question Answering via Synthetic Data Generation

ACL ARR 2024 June Submission969 Authors

13 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Zero-shot Visual Question Answering (VQA) poses a challenging and crucial task in vision and language reasoning, demanding models to generate answers based on questions and images without human annotation. Previous approaches mainly focus on transforming images into captions and utilizing language model knowledge to answer visual questions. Despite the promising results, such a paradigm suffers from hallucination and high inference costs. In this paper, we propose a zero-shot VQA framework MKDG, which transfers knowledge from large language models (LLMs) and multi-modality models through a synthetic data generation approach, thus utilizing the ability of LLMs and mitigating the hallucination. Specifically, our method introduces a three-step synthetic data generation and training pipeline that first creates pseudo questions and answers with caption model and LLMs. To alleviate the hallucination and unbalanced data distribution in synthetic data, we propose a CLIP-based filtering and data selection strategy. Finally, we fine-tune a moderate-sized generative vision language model with the automatically curated synthetic dataset to perform VQA task. Experimental results on popular VQA benchmarks demonstrate the effectiveness of MKDG. We achieve superior performance and outperform outperforming strong baselines incorporating GPT-3 with significantly lower inference cost.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision question answering,cross-modal content generation
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: english
Submission Number: 969
Loading