Keywords: Synthetic Dataset, Generative Models, Noise Optimization
Abstract: Recent advances in diffusion models have enabled the generation of synthetic images nearly indistinguishable from real ones, making them attractive for dataset construction. However, synthetic images often contain features that differ from those of real images, which can hinder the training of Vision-Language Models (VLMs).
In this paper, we propose a method to construct synthetic image datasets that enable more effective VLMs training. The proposed method reduces the gap between real and synthetic images by optimizing the initial noise in diffusion models. Our approach enhances the alignment between text conditions and generated images within the embedding spaces of multiple models, in a plug-and-play manner. This approach also reduces characteristic discrepancies from real images, leading to higher-quality synthetic image data and ultimately improving VLM training.
Using the CC3M dataset as a baseline, we generate synthetic datasets conditioned on the same captions. Experiments show that CLIP models trained on our datasets achieve 23.69\% Ave. R@1 in zero-shot retrieval and 17.97\% in zero-shot classification accuracy on ImageNet-1K, outperforming models trained on naïvely generated data.
Furthermore, our method demonstrates strong scalability and sample efficiency—achieving even better performance with up to 40\% fewer synthetic images.
Submission Number: 16
Loading