SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 oralEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We introduce SKVQA, a large-scale synthetic multimodal dataset, for improving context-aware generation capability of multimodal.
Abstract: Multimodal retrieval-augmented generation (RAG) plays a crucial role in domains such as knowledge-based visual question answering (KB-VQA), where models should effectively integrate additional knowledge to generate a response. However, existing vision and language models (VLMs) are not inherently designed for context-augmented generation, limiting their effectiveness in such tasks. While synthetic data generation has recently gained attention for training large VLMs, its application for context-augmented generation remains underexplored. To address this gap, we introduce SKVQA, a large-scale synthetic multimodal dataset containing over 2 million visual question-answer pairs, each associated with external knowledge sources to determine the final answer. Compared to previous datasets, SKVQA exhibits 11× more unique questions, greater domain diversity, and a broader spectrum of image sources. Through human evaluations, we confirm the high quality of the generated question-answer pairs and their contextual relevance. Extensive experiments show that SKVQA serves both as a challenging benchmark for knowledge-based VQA and as an effective training resource for adapting generative multimodal models to context-augmented generation. Our results further indicate that models trained on SKVQA demonstrate enhanced generalization in both context-aware VQA and multimodal RAG settings.
Lay Summary: This paper introduces SK-VQA, a large-scale dataset containing over 2 million question-answer pairs created to help AI models better understand and answer questions about images, especially when answering those questions requires additional background knowledge.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/IntelLabs/multimodal_cognitive_ai/tree/main/SK-VQA
Primary Area: Applications->Everything Else
Keywords: Multimodal, retrieval augmented generation, data generation
Submission Number: 5513
Loading