SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs

Xin Su; Man Luo; Kris W Pan; Tien Pei Chou; Vasudev Lal; Phillip Howard

SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs

Xin Su, Man Luo, Kris W Pan, Tien Pei Chou, Vasudev Lal, Phillip Howard

Published: 01 May 2025, Last Modified: 11 Aug 2025ICML 2025 oralEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We introduce SKVQA, a large-scale synthetic multimodal dataset, for improving context-aware generation capability of multimodal.

Abstract: Multimodal retrieval-augmented generation (RAG) plays a crucial role in domains such as knowledge-based visual question answering (KB-VQA), where models should effectively integrate additional knowledge to generate a response. However, existing vision and language models (VLMs) are not inherently designed for context-augmented generation, limiting their effectiveness in such tasks. While synthetic data generation has recently gained attention for training large VLMs, its application for context-augmented generation remains underexplored. To address this gap, we introduce SKVQA, a large-scale synthetic multimodal dataset containing over 2 million visual question-answer pairs, each associated with external knowledge sources to determine the final answer. Compared to previous datasets, SKVQA exhibits 11× more unique questions, greater domain diversity, and a broader spectrum of image sources. Through human evaluations, we confirm the high quality of the generated question-answer pairs and their contextual relevance. Extensive experiments show that SKVQA serves both as a challenging benchmark for knowledge-based VQA and as an effective training resource for adapting generative multimodal models to context-augmented generation. Our results further indicate that models trained on SKVQA demonstrate enhanced generalization in both context-aware VQA and multimodal RAG settings.

Lay Summary: This paper introduces SK-VQA, a large-scale dataset containing over 2 million question-answer pairs created to help AI models better understand and answer questions about images, especially when answering those questions requires additional background knowledge.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/IntelLabs/multimodal_cognitive_ai/tree/main/SK-VQA

Primary Area: Applications->Everything Else

Keywords: Multimodal, retrieval augmented generation, data generation

Submission Number: 5513

Loading