A Survey on Bridging VLMs and Synthetic Data

09 May 2025 (modified: 29 Oct 2025)Anonymous Preprint SubmissionEveryoneRevisionsCC BY 4.0
Keywords: Multimodal AI, Vision-Language Models, Synthetic Data, Generative AI
Abstract: Vision-language models (VLMs) have significantly advanced multimodal AI by learning joint representations of visual and textual data. However, their progress is hindered by challenges in acquiring high-quality, aligned datasets, including issues of cost, privacy, and scarcity. On the other hand, synthetic data, created through the use of generative AI—which can even include VLMs—offers a scalable and cost-effective solution to these challenges. This paper presents the first comprehensive survey on bridging VLMs and synthetic data, exploring both the role of synthetic data in VLMs and the role of VLMs in synthetic data. First, we provide a preliminary overview by briefly explaining the architecture of two basic VLMs and, after studying a large number of previous works, offer an extensive survey of the previously proposed methodologies and potential future directions in this area.
Author Ids: ~Mohammad_Ghiasvand_Mohammadkhani1, ~Saeedeh_Momtazi2, ~Hamid_Beigy1
Submission Number: 677
Loading