Training with Real instead of Synthetic Generated Images Still Performs Better

Published: 09 Apr 2024, Last Modified: 09 Apr 2024SynData4CVEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Synthetic data, domain adaptation, vision-and-language, generative models
TL;DR: Our paper asks: does training on synthetic data from a generative model provide any gain beyond training on the upstream data used to train the generative model?
Abstract: Recent advances in text-to-image models have inspired many works that seek to train models with synthetic images, capitalizing on the ability of modern generators to control the data we synthesize and thus train on. However, synthetic images ultimately originate from the upstream data pool used to train the generative model we sample from---does the intermediate generator add any gain over simply training on relevant parts of the upstream data directly? In this paper, we study this question in the setting of task adaptation by comparing training with task-targeted synthetic data generated from Stable Diffusion---a generative model trained on the LAION-2B dataset---against training with targeted real images sourced directly from LAION-2B. We show that while targeted synthetic data can aid model adaptation, it largely lags behind targeted real data. Overall, assuming we have access to the upstream data pool of the generator, we should be cautious in our use of generated synthetic data. Studying synthetic data in settings where the upstream data is not accessible---for instance, due to copyright or privacy concerns---or searching for benefits from synthetic data even when it is present are opportunities for future work.
Submission Number: 42